CN117787366B

CN117787366B - Hardware accelerator and scheduling method thereof

Info

Publication number: CN117787366B
Application number: CN202410219345.XA
Authority: CN
Inventors: 杨宏斌; 董刚; 赵雅倩; 曹其春; 胡克坤; 梁玲燕
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-02-28
Filing date: 2024-02-28
Publication date: 2024-05-10
Anticipated expiration: 2044-02-28
Also published as: CN117787366A

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a hardware accelerator and a scheduling method thereof. The hardware accelerator includes: a dispatch core unit, a feature buffer unit, a weight buffer unit, and a plurality of computing components, the plurality of computing components comprising: a matrix multiplication calculating section, a Softmax calculating section, and a layer normalization calculating section; the scheduling core unit is used for receiving and caching the starting associated data transmitted by the external equipment; and receiving a starting calculation command transmitted by the external equipment, and controlling and scheduling all calculation components based on starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model. The accelerator can well realize the calculation acceleration of each calculation layer of the neural network model, and realize efficient data storage and scheduling, and has strong feasibility and low cost.

Description

Hardware accelerator and scheduling method thereof

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a hardware accelerator and a scheduling method thereof.

Background

Compared with the traditional cyclic neural network (RNN, recurrent Neural Network) or convolutional neural network (CNN, convolutional Neural Network), the transducer model (a neural network model based on an attention mechanism) has better parallelism and longer context modeling capability, and shows superior performance in the fields of NLP (Natural Language Processing ), image recognition, target detection, target tracking and the like.

The current transducer model basically adopts an encoder-decoder structure, and a large number of matrix multiplication operations, softmax (normalized exponential function) operations, layernorm (LN, layer Normalization, layer normalization) operations, residual operations and activation operations are involved in the computation process of the transducer model. Aiming at the current situation that the computation amount of a transducer model is large, how to realize efficient data storage and scheduling and how to realize computation acceleration according to the network structure of the transducer model is a problem to be solved in the field.

Disclosure of Invention

The hardware accelerator and the scheduling method thereof provided by the invention are used for solving the problems of how to realize efficient data storage and scheduling and how to realize calculation acceleration according to the network structure of a transducer model in the prior art.

The present invention provides a hardware accelerator comprising:

A scheduling core unit, a feature buffer unit, a weight buffer unit, and a plurality of computing units, wherein the computing units comprise: a matrix multiplication calculating section, a Softmax calculating section, and a layer normalization calculating section;

The scheduling core unit is used for receiving and caching starting association data transmitted by external equipment, and the starting association data comprises: memory access parameters, calculation control parameters, quantization parameters and start signals of the calculation components; and receiving a start calculation command transmitted by the external device, and controlling and scheduling all the calculation components based on the start association data under the condition of receiving the start calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the feature cache unit and the weight cache unit; the scheduling core unit is also used for receiving task completion signals returned by all the computing components and prompting the external equipment to read task processing results of all the computing components based on the task completion signals.

According to the hardware accelerator provided by the invention, the characteristic cache unit comprises a plurality of groups of shared cache units; the shared buffer unit is used for buffering the feature matrix data transmitted by the external device and intermediate results generated in the process of executing the calculation tasks by the calculation components, wherein the intermediate results refer to task processing results obtained in the process of executing the calculation tasks of the calculation components of the calculation layers before final calculation results of the current feature matrix data are obtained, and the final calculation results refer to results output after the current feature matrix data are processed by the neural network model.

According to the present invention, there is provided a hardware accelerator, the neural network model including: the input layer, the first normalization layer, the Q generation layer, the K generation layer, the V generation layer, the dot product attention layer, the Softmax layer, the attention multiplication layer, the projection layer, the second normalization layer, the first full connection layer and the second full connection layer are sequentially connected; the Q generation layer is used for obtaining a Q matrix, the K generation layer is used for obtaining a K matrix, and the V generation layer is used for obtaining a V matrix;

The plurality of groups of shared cache units comprise: the system comprises a first shared cache unit, a second shared cache unit, a third shared cache unit and a fourth shared cache unit, wherein the cache depth of the first shared cache unit is the scale of a group of target feature matrix data, the target feature matrix data refer to feature matrix data output by each target calculation layer of the neural network model, and the target calculation layer refers to any one layer of the first layer normalization layer, the Q generation layer, the K generation layer, the V generation layer, the dot product attention layer, the Softmax layer, the attention multiplication layer, the projection layer, the second layer normalization layer and the second full connection layer; the buffer depths of the second shared buffer unit and the first shared buffer unit are the same, and the buffer depth of the third shared buffer unit is 2 times of the buffer depth of the first shared buffer unit or the second shared buffer unit so as to store the Q matrix and the V matrix simultaneously; the depth of the fourth shared cache unit is 4 times of the depth of the first shared cache unit or the second shared cache unit, and the fourth shared cache unit is used for storing data to be cached in the first full-connection layer task processing process.

According to the hardware accelerator provided by the invention, when the computing task of the input layer is executed, the first shared cache unit is used for caching the currently input feature matrix data, and the second shared cache unit is used for carrying out residual backup on the currently input feature matrix data;

when executing the calculation task of the first layer normalization calculation unit, the first shared cache unit is used for providing a data source for executing the calculation task for the layer normalization calculation unit, and the task processing result of the layer normalization calculation unit is stored in the first shared cache unit;

When executing the computing task of the Q generation layer, the first shared buffer unit is one of the data sources of the computing task executed by the matrix multiplication computing unit, the weight buffer unit is the second data source of the computing task executed by the matrix multiplication computing unit, the task processing result of the matrix multiplication computing unit is stored in the first buffer space of the third shared buffer unit, the third shared buffer unit comprises the first buffer space and the second buffer space, and the depths of the first buffer space and the second buffer space are the same;

When executing the calculation task of the K generation layer, the first shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the weight buffer unit is the second data source of the calculation task executed by the matrix multiplication calculation unit, and the task processing result of the matrix multiplication calculation unit is stored in the fourth shared buffer unit;

When executing the calculation task of the V generation layer, the first shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the weight buffer unit is the second data source of the calculation task executed by the matrix multiplication calculation unit, and the task processing result of the matrix multiplication calculation unit is stored in the second buffer space;

When the computing task of the dot product attention layer is executed, the first buffer space is one of data sources of the computing task executed by the matrix multiplication computing component, the fourth shared buffer unit is two of data sources of the computing task executed by the matrix multiplication computing component, and the task processing result of the matrix multiplication computing component is stored in the first shared buffer unit;

When executing the calculation task of the Softmax layer, the first shared cache unit is a data source for executing the calculation task by the Softmax calculation unit, and a task processing result of the Softmax calculation unit is stored in the first shared cache unit;

when executing the calculation task of the attention multiplication layer, the first shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the second buffer space is the second data source of the calculation task executed by the matrix multiplication calculation unit, and the task processing result of the matrix multiplication calculation unit is stored in the fourth shared buffer unit;

When executing the calculation task of the projection layer, the fourth shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the weight buffer unit is the second data source of the calculation task executed by the matrix multiplication calculation unit, the second shared buffer unit is the third data source of the calculation task executed by the matrix multiplication calculation unit, the task processing result of the matrix multiplication calculation unit is stored in the first shared buffer unit, and the third shared buffer unit is used for carrying out residual backup on the task processing result of the matrix multiplication calculation unit;

when executing the calculation task of the second layer normalization layer, the first shared cache unit is used for providing a data source for executing the calculation task for the layer normalization calculation component, and the task processing result of the layer normalization calculation component is stored in the first shared cache unit;

When executing the computing task of the first full connection layer, the first shared buffer unit is one of the data sources of the computing task executed by the matrix multiplication computing unit, the weight buffer unit is two of the data sources of the computing task executed by the matrix multiplication computing unit, and the task processing result of the matrix multiplication computing unit is stored in the fourth shared buffer unit;

when executing the computing task of the second full connection layer, the fourth shared buffer unit is one of the data sources of the computing task executed by the matrix multiplication computing unit, the weight buffer unit is two of the data sources of the computing task executed by the matrix multiplication computing unit, the third shared buffer unit is three of the data sources of the computing task executed by the matrix multiplication computing unit, the task processing result of the matrix multiplication computing unit is stored in the first shared buffer unit, and the second shared buffer unit is used for carrying out residual backup on the task processing result of the matrix multiplication computing unit.

According to the hardware accelerator provided by the invention, when data writing or storage is carried out on the target shared storage, the original data in the target shared storage is replaced by the current data to be stored, and the target shared storage is any one of the first shared cache unit, the second shared cache unit, the third shared cache unit and the fourth shared cache unit.

According to the hardware accelerator provided by the invention, the scheduling core unit is an offline operation unit, and the scheduling core unit is specifically configured to send corresponding parameters of each computing component in the startup association data to the corresponding computing component under the condition that the startup computing command is received, so that the computing component executes computing tasks of each computing layer of the neural network model;

The matrix multiplication calculation component is configured to execute a calculation task of the calculation layer in the neural network model, where matrix multiplication calculation is required, under the condition that corresponding data issued by the scheduling core unit is received, and an execution process includes: loading a first to-be-multiplied matrix from the feature buffer unit, loading a second to-be-multiplied matrix from the feature buffer unit or the weight buffer unit, and performing matrix multiplication operation based on the first to-be-multiplied matrix and the second to-be-multiplied matrix to obtain a matrix multiplication result; and under the condition that the current matrix multiplication result is an intermediate result, caching the current matrix multiplication result to the feature caching unit so that the Softmax computing component and the layer normalization computing component obtain data to be computed from the feature caching unit when executing computing tasks, wherein the intermediate result refers to task processing results obtained by the computing component in the process of executing computing tasks of each computing layer before obtaining final computing results of the current feature matrix data, and the final computing results refer to results output after the current feature matrix data are processed by a neural network model.

According to the hardware accelerator provided by the invention, under the condition that the current matrix multiplication result is the final calculation result of the current feature matrix data, the current matrix multiplication result is cached to the feature caching unit, and the current matrix multiplication result is transmitted to a preset block random access memory for storage; transmitting a calculation completion signal to the dispatching core unit, wherein the calculation completion signal is used for indicating the dispatching core unit to prompt the external equipment to read task processing results of all the calculation components; the block random access memory is used for feeding back all the task processing results to the external equipment under the condition that the reading command of the external equipment is received.

According to the hardware accelerator provided by the invention, the scheduling core unit is further used for receiving the bias parameters of each computing layer transmitted by the external device before the computing task starts; respectively issuing all the bias parameters to the matrix multiplication calculation part and the layer normalization calculation part;

the matrix multiplication calculation component and the layer normalization calculation component are both used for carrying out on-chip caching on the bias parameters;

The scheduling core unit is further configured to send bias parameter marking information corresponding to a current computing task to the matrix multiplication computing unit or the layer normalization computing unit in a task execution process, where the bias parameter marking information is used to instruct the matrix multiplication computing unit or the layer normalization computing unit to determine a target bias parameter from all bias parameters cached on a self chip, and the target bias parameter is used to complete execution of the current computing task.

According to the hardware accelerator provided by the invention, in the case that the matrix multiplication calculating part performs the calculation tasks of any one of the Q generation layer, the K generation layer, the V generation layer and the first fully-connected layer of the neural network model, the inputs of the matrix multiplication calculating part are feature matrix data and weight matrix data; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; traversing each column of the second matrix to be multiplied for a first row of matrix blocks of the first matrix to be multiplied to obtain a product between the first row of matrix blocks and the corresponding column of the second matrix to be multiplied; traversing each column of the second matrix to be multiplied for the next row of matrix blocks of the first matrix to be multiplied until the product between the last row of matrix blocks of the first matrix to be multiplied and the last column of matrix blocks of the second matrix to be multiplied is obtained, so that a task processing result of the current computing task is obtained; the data source of the feature matrix data is the feature cache unit, and the data source of the weight matrix data is the weight cache unit.

According to the hardware accelerator provided by the invention, under the condition that the matrix multiplication calculating part executes the calculation task of any one layer of the projection layer and the second full connection layer of the neural network model, the input of the matrix multiplication calculating part is feature matrix data, weight matrix data and residual matrix; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; multiplying the first to-be-multiplied matrix with the second to-be-multiplied matrix, and adding the obtained product with the residual matrix to obtain a task processing result of the current computing task; the data source of the feature matrix data is the feature cache unit, the data source of the weight matrix data is the weight cache unit, and the data source of the residual matrix is the feature cache unit.

According to the hardware accelerator provided by the invention, in the case that the matrix multiplication calculating part executes the calculation task of the dot product attention layer of the neural network model, the input of the matrix multiplication calculating part is a Q matrix and a K matrix, and the step of executing the calculation task by the matrix multiplication calculating part comprises the following steps: partitioning the Q matrix to obtain a plurality of Q submatrices; the K matrixes are subjected to blocking operation to obtain a plurality of K submatrices, the number of the Q submatrices is the same as that of the K submatrices, and the Q submatrices are in one-to-one correspondence with the K submatrices; multiplying each Q submatrix with the corresponding K submatrix respectively to obtain a task processing result of the current computing task; and the data sources of the Q matrix and the K matrix are the characteristic cache units.

According to the hardware accelerator provided by the invention, in the case that the matrix multiplication calculating part executes the calculation task of the attention multiplication value layer of the neural network model, the inputs of the matrix multiplication calculating part are attention score matrix data and a V matrix, and the step of executing the calculation task by the matrix multiplication calculating part comprises the following steps: performing partitioning operation on the attention score matrix data to obtain a plurality of attention score sub-matrices; the V matrix is subjected to blocking operation to obtain a plurality of V sub-matrices, the number of the attention score sub-matrices is the same as that of the V sub-matrices, and the attention score sub-matrices are in one-to-one correspondence with the V sub-matrices; obtaining a task processing result of the current computing task by performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix, wherein performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix comprises: the method comprises an inter-block transposition step and an intra-block transposition step, wherein the inter-block transposition refers to obtaining a V sub-matrix corresponding to the attention score sub-matrix in a jumping address mode in the matrix multiplication process, and the intra-block transposition refers to completing transposition of elements in each V sub-matrix block by adjusting the positions of bus elements; and the attention score matrix data and the data sources of the V matrix are the characteristic cache units.

According to the present invention, there is provided a hardware accelerator, further comprising: the memory unit is used for receiving the weight matrix data transmitted by the external equipment and loading the weight matrix data to the weight caching unit;

and the weight caching unit performs ping-pong caching on the weight matrix data by utilizing the caching depth.

According to the hardware accelerator provided by the invention, based on the cache depth, the weight cache unit is divided into a first cache subunit and a second cache subunit, wherein the first cache subunit and the second cache subunit are used for ping-pong weight matrix data to be used by each calculation layer of the neural network model, and the data source of the weight matrix data is the external equipment;

And the buffer depths of the first buffer subunit and the second buffer subunit are the scale of the weight matrix data of the first full-connection layer or the second full-connection layer of the neural network model.

According to the hardware accelerator provided by the invention, all weight matrix data comprise: first, second, third, fourth, fifth and sixth matrix data, the first matrix data being weight data of a Q-generation layer of the neural network model, the second matrix data being weight matrix data of a K-generation layer of the neural network model, the third matrix data being weight matrix data of a V-generation layer of the neural network model, the fourth matrix data being weight matrix data of a projection layer of the neural network model, the fifth matrix data being weight matrix data of a first fully-connected layer of the neural network model, the sixth matrix data being weight matrix data of a second fully-connected layer of the neural network model;

When executing a calculation task of an input layer of the neural network model, the first buffer subunit is configured to preload the first matrix data, the second matrix data, the third matrix data, and the fourth matrix data; the second cache subunit preloads the fifth matrix data; the weight data preloaded in the first cache subunit and the second cache subunit are used for reading or loading the weight data by each computing component when the computing components perform computing task processing;

The first buffer subunit is configured to preload the sixth matrix data when performing a computation task of a projection layer of the neural network model;

When executing the calculation task of the first full connection layer of the neural network model, the second buffer subunit is configured to preload the first matrix data, the second matrix data, the third matrix data, and the fourth matrix data;

The first cache subunit is configured to preload the fifth matrix data when performing a computational task of a second fully-connected layer of the neural network model.

According to the hardware accelerator provided by the invention, the memory unit is further used for receiving the feature matrix data transmitted by the external device and loading the feature matrix data to the feature cache unit.

According to the present invention, there is provided a hardware accelerator, further comprising: the input end of the multiplexer is respectively connected with the scheduling core unit, the matrix multiplication computing unit, the Softmax computing unit and the layer normalization computing unit so as to receive a result write-back control signal sent by the scheduling core unit and the task processing result sent by the matrix multiplication computing unit, the Softmax computing unit and the layer normalization computing unit; the result write-back control signal is used for indicating the multiplexer to write all the task processing results into a preset block random access memory, and the block random access memory is used for feeding all the task processing results back to the external equipment under the condition that a read command of the external equipment is received.

According to the present invention, there is provided a hardware accelerator, further comprising: and the multiplexer writes the task processing result into the block random access memory through the queue unit.

The invention also provides a server, comprising: a hardware accelerator as claimed in any preceding claim to computationally accelerate the neural network model.

The invention also provides a scheduling method of the hardware accelerator, which comprises the following steps:

receiving and caching starting association data transmitted by external equipment, wherein the starting association data comprises the following steps: memory access parameters, calculation control parameters, quantization parameters and start signals of the calculation components;

Receiving a starting calculation command transmitted by the external equipment, and controlling and scheduling all the calculation components based on the starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the characteristic cache unit and the weight cache unit;

and receiving task completion signals returned by all the computing components, and prompting the external equipment to read task processing results of all the computing components based on the task completion signals.

According to the scheduling method of the hardware accelerator, under the condition that the buffering of the starting associated data is completed, the corresponding parameters of each computing component in the starting associated data are issued to the corresponding computing components; and each computing component caches the issued parameters on-chip under the condition of receiving the issued parameters.

According to the scheduling method of the hardware accelerator provided by the invention, under the condition that the starting calculation command is received, all the calculation components are controlled and scheduled based on the starting associated data, so that the calculation components execute the calculation tasks of each calculation layer of the neural network model, and the method comprises the following steps:

Under the condition that the starting calculation command is received, controlling the weight caching unit to preload weight matrix data, controlling the layer normalization calculation component to load feature matrix data from the feature caching unit, loading weight matrix data from the weight caching unit, and executing the calculation task of the first layer normalization layer of the neural network model based on the loaded feature matrix data and weight matrix data; the pre-loading of the weight matrix data refers to pre-loading the weight matrix data to be used by each computing component for each computing component to read or load;

controlling the matrix multiplication calculation component to sequentially execute the calculation tasks of the Q generation layer, the K generation layer, the V generation layer and the dot product attention layer of the neural network model under the condition that the calculation tasks of the first normalization layer are executed;

controlling the Softmax computing component to execute a computing task of a Softmax layer of the neural network model;

Controlling the matrix multiplication calculating part to execute the calculation task of the attention multiplication layer of the neural network model;

Under the condition that the attention heads of the current Block of the neural network model are not completely calculated, re-executing the calculation tasks of the dot product attention layer, the Softmax layer and the attention multiplier layer, wherein all the attention heads comprise the calculation tasks of the dot product attention layer, the Softmax layer and the attention multiplier layer;

Under the condition that all attention heads of the current Block are calculated, controlling the matrix multiplication calculation component to execute a calculation task of a projection layer of the neural network model, and controlling the weight caching unit to preload weight matrix data;

Under the condition that the calculation task of the projection layer is determined to be executed, controlling the layer normalization calculation component to execute the calculation task of a second layer normalization layer of the neural network model;

Under the condition that the execution of the calculation task of the second normalization layer is finished, controlling the matrix multiplication calculation component to execute the calculation task of the first full-connection layer of the neural network model, and controlling the weight caching unit to preload weight matrix data;

Under the condition that the completion of the calculation task of the first full-connection layer is determined, controlling the matrix multiplication calculation component to execute the calculation task of the second full-connection layer of the neural network model;

controlling the weight caching unit to pre-load weight matrix data under the condition that the execution of the computing task of the second full-connection layer is determined to be finished and the fact that the Block is not calculated to be finished is determined, and repeatedly executing computing task processing operation;

And ending the calculation under the condition that the completion of the calculation task of the second full connection layer is determined and the completion of all Block calculation is determined.

The present invention also provides a scheduling core unit, comprising:

the data receiving subunit is configured to receive and cache start-up association data transmitted by an external device, where the start-up association data includes: access parameters, calculation control parameters, quantization parameters, and start signals for each calculation unit, the calculation unit comprising: a matrix multiplication calculating section, a Softmax calculating section, and a layer normalization calculating section;

the scheduling subunit is used for receiving a starting calculation command transmitted by the external equipment, and controlling and scheduling all the calculation components based on the starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the characteristic cache unit and the weight cache unit;

and the prompting subunit is used for receiving the task completion signals returned by all the computing components and prompting the external equipment to read the task processing results of all the computing components based on the task completion signals.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the scheduling method of the hardware accelerator when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a scheduling method for a hardware accelerator as described in any of the above.

The invention has the beneficial effects that: the invention provides a hardware accelerator and a scheduling method thereof, wherein a scheduling core unit, a feature cache unit, a weight cache unit and a plurality of computing components are arranged, and the computing components comprise: a matrix multiplication calculating section, a Softmax calculating section, and a layer normalization calculating section; the scheduling core unit is used for receiving and caching starting association data transmitted by the external equipment, and the starting association data comprises: memory access parameters, calculation control parameters, quantization parameters and start signals of all calculation components; and receiving a starting calculation command transmitted by the external equipment, and controlling and scheduling all calculation components based on starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of all calculation layers of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the feature cache unit and the weight cache unit; and receiving task completion signals returned by all the computing components, and prompting the external equipment to read task processing results of all the computing components based on the task completion signals. The method can well achieve calculation acceleration of each calculation layer of the neural network model, and achieve efficient data storage and scheduling, and is high in feasibility and low in cost.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a hardware accelerator according to the present invention;

FIG. 2 is a schematic diagram showing a loading sequence of an input matrix by a matrix multiplication calculating unit in a hardware accelerator according to the present invention;

FIG. 3 is a schematic diagram II of the loading sequence of the matrix multiplication computation unit to the input matrix in the hardware accelerator according to the present invention;

FIG. 4 is a schematic diagram III of the loading sequence of the matrix multiplication computation unit to the input matrix in the hardware accelerator according to the present invention;

FIG. 5 is a schematic diagram showing a loading sequence of the matrix multiplication calculating unit to the input matrix in the hardware accelerator according to the present invention;

FIG. 6 is a flow chart of a method for scheduling hardware accelerators provided by the present invention;

FIG. 7 is a flow chart of a specific embodiment of a method for scheduling a hardware accelerator according to the present invention;

FIG. 8 is a schematic diagram of a dispatch system for a hardware accelerator according to the present invention;

Fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes, by way of example, a hardware accelerator and a scheduling method thereof according to the present invention with reference to fig. 1 to 9.

Referring to fig. 1, the hardware accelerator provided in this embodiment includes:

A Scheduling Core (Scheduling Core) unit, a Feature Buffer (in fig. 1, simply referred to as a Feature Buffer) unit, a Weight Buffer (in fig. 1, simply referred to as a Weight Buffer) unit, and a plurality of computing units, where a plurality of the computing units include: matrix multiplication computation means (GEMM Unit or GEMM Component), softmax computation means (Softmax Unit or Softmax Component), and layer normalization computation means (LN Unit or LN Component).

It should be noted that, through analysis, the neural network model involves a large number of matrix multiplication, softmax, layernorm, residual and activation operations. The data sources of the two matrices involved in the matrix multiplication are mainly divided into two types: firstly, matrix multiplication is performed between feature matrix data and weight matrix data, for example, Q (Q matrix) generation is performed, and feature matrix data X and weight matrix data W ^Q are used; generating a K (K matrix) by using the feature matrix data X and the weight matrix data W ^k; the generation of V (V matrix) uses the feature matrix data X and the weight matrix data W ^V. The other is between feature matrix data and feature matrix data, such as q·k ^T (i.e., a calculation task performed by the dot product attention layer) and att·v (i.e., a calculation task performed by the attention multiplier layer) in attention calculation. Therefore, in this embodiment, by providing the independent feature buffer unit and the weight buffer unit, and independently providing the matrix multiplication calculating section, the Softmax calculating section, and the layer normalization calculating section, it is possible to contribute to realizing the calculation acceleration. It can be understood that, compared with the unified memory, the hardware accelerator in this embodiment can directly obtain the required data, i.e. the data to be processed or the matrix to be multiplied, from the feature buffer unit or the weight buffer unit when each computing unit executes the corresponding computing task, so as to effectively improve the computing speed.

It should be mentioned that the matrix multiplication calculation unit selects to complete the residual operation and the activation operation which may exist according to different situations of each calculation layer in the neural network model.

In addition, in the multi-head attention calculation, the matrix needs to be split and combined before and after the matrix calculation, namely, a calculation form of a block matrix is adopted. The hardware accelerator in the embodiment is also suitable for the application scenario of the block matrix calculation, namely, the hardware accelerator in the embodiment is adopted on the basis of the block matrix calculation to realize calculation acceleration.

It should also be mentioned that the external device may be an upper computer or a PCIE device.

It should be mentioned that the hardware accelerator in this embodiment is used to accelerate computation of the neural network model such as the transducer.

The scheduling core unit is used for receiving and caching starting association data transmitted by external equipment, and the starting association data comprises: memory access parameters, calculation control parameters, quantization parameters (Quantized parameters) and start signals of the respective calculation means; and receiving a start calculation command transmitted by the external device, and controlling and scheduling all the calculation components based on the start association data under the condition of receiving the start calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the feature cache unit and the weight cache unit; and receiving task completion signals returned by all the computing components, and prompting the external equipment to read task processing results of all the computing components based on the task completion signals.

Specifically, each computing unit corresponds to a set of access parameters, a computing control parameter, a quantization parameter, and a start signal. The access parameters refer to parameters required by the computing components in the process of executing the computing tasks of the computing layers of the neural network model when performing read-write operation on the feature cache unit or the weight cache unit, and the parameters comprise: an access Address (Address), a data length (i.e., a length of data required to be specified when accessing a cache), an access manner, a transmission rate, and the like. The calculation control parameters are parameters for controlling and guiding the calculation process of each calculation means, the calculation control parameters describing the operation and behavior of the calculation process, including: an opcode (the opcode is used to indicate the type of operation to be performed, the opcode defining a particular operation to be performed by the current computing element, such as addition, multiplication, and logic operations), a data source (i.e., the source location of the input data), and a data destination (the destination location for specifying the output task processing result), etc. The quantization parameter is used for quantitatively converting the size of the parameter in the process of executing the calculation task by each calculation component, thereby being beneficial to reducing the parameter quantity, improving the calculation speed and reducing the storage space and the calculation amount required by calculation. The start signal is used to instruct the computing means to start running, i.e. to start the computation.

By adopting the hardware accelerator framework, the calculation acceleration of each calculation layer of the neural network model can be well realized. By setting the dispatching core unit, other components or units in the hardware accelerator can be uniformly controlled and dispatched, so that the off-line efficient calculation of the whole transducer network is completed. After the external device issues the start association data, the feature matrix data, the weight matrix data and the start calculation command to the hardware accelerator in the embodiment, the components and the units can be controlled and scheduled uniformly through the scheduling check, no external interaction is needed, the off-line processing of the whole transducer flow is realized, and a feature processing closed loop is formed.

It should be noted that since the quantization parameter used by the matrix multiplication calculating section, the bias parameter, the quantization parameter used by the Softmax calculating section, the quantization parameter used by the layer normalization calculating section, the weight matrix data, and the data amount of the bias parameter are small, they are stored using their respective on-chip RAMs (Random Access Memory, random access memories). Such as the Quan Buf (quantization parameter buffer), bias Buf (Bias parameter buffer), and Weight Buf (Weight matrix data buffer) in fig. 1.

And the scheduling core unit and the external device are communicated through an AXI Lite (Advanced eXtensible INTERFACE LITE, a lightweight AXI bus protocol). The scheduling core unit includes a parameter buffer (i.e., para Buf in fig. 1) for buffering start-up association data, etc. Axiwrite in fig. 1 represents Write data operations using AXI bus protocol, axiread represents Read data operations using AXI bus protocol. 512b and 2048b in fig. 1 each represent the size of transmission data. In addition, the scheduling core is connected to the matrix multiplication calculating section, the Softmax calculating section, and the layer normalization calculating section, respectively. And each computing component feeds back the respective running state and task completion signal to the dispatching core unit in real time.

As shown in fig. 1, the input of the matrix multiplication calculating unit includes two to-be-multiplied matrices, which are exemplarily defined as an a matrix and a B matrix in fig. 1, where the a matrix is feature matrix data, obtained from the feature buffer unit, and the B matrix is weight matrix data or feature matrix data, obtained from the weight buffer unit or the feature buffer unit, respectively.

In order to better meet the storage and circulation requirements of the feature matrix data, in some embodiments, the feature cache unit includes multiple groups of shared cache units; the shared buffer unit is used for buffering the feature matrix data transmitted by the external device and intermediate results generated in the process of executing the calculation tasks by the calculation components, wherein the intermediate results refer to task processing results obtained in the process of executing the calculation tasks of the calculation components of the calculation layers before final calculation results of the current feature matrix data are obtained, and the final calculation results refer to results output after the current feature matrix data are processed by the neural network model. It should be noted that, by setting a plurality of groups of shared buffer units, the storage and circulation requirements of the feature matrix data and the intermediate results generated by each computing component can be better satisfied.

In some embodiments, the neural network model comprises: input layer (Input), first normalization layer (LN 1), Q generation layer (Q), K generation layer (K), V generation layer (V), dot product attention layer (Q) connected in turnK ^T), a Softmax layer (Softmax), an attention multiplier layer (Att/>)V), a projection layer (Proj), a second normalization layer (LN 2), a first fully-connected layer (FC 1), and a second fully-connected layer (FC 2); the Q generation layer is used for obtaining a Q matrix, the K generation layer is used for obtaining a K matrix, and the V generation layer is used for obtaining a V matrix.

It should be noted that, the matrix multiplication is mainly divided into the following four cases:

1） Formula (1)

2）Formula (2)

3）Formula (3)

4) Formula (4)

Wherein X represents a feature matrix, i.e. feature matrix data, W represents a weight matrix, i.e. weight matrix data, b represents a bias parameter, E represents a residual matrix, and X ₁、X₂ represents two different feature matrices. The case 1) corresponds to the calculation method of the Q generation layer (Q), the K generation layer (K), and the V generation layer (V) in the neural network model. Case 2) corresponds to the calculation of the projection layer (Proj) and the second fully connected layer (FC 2) in the neural network model. Case 3) corresponds to the dot product attention layer (Q) in the neural network modelK ^T) and an attention multiplier layer (Att/>V) calculation mode. Case 4) corresponds to the calculation mode of the first full connection layer (FC 1). Since the offset parameter b is small in data size, a large cache is not required. The weight matrix data uses independent caching. In the feature buffer unit, the residual matrix occupies 1 group of buffers, and other calculations for the data containing the weight matrix are as in case 1), case 2) and case 4), which only require 1 group of input features and 1 group of output feature buffers. Whereas in case 3) 2 sets of input features and 1 set of output features are needed for buffering. Thus, at least 4 sets of feature caches are required for the global operation of the neural network model: the feature matrix data stream can be realized by 1 group of residual caches (namely, the residual matrix corresponds to 1 group of cache requirements), 2 groups of input feature caches (namely, the feature matrix data corresponds to 2 groups of cache requirements), and 1 group of output feature caches (namely, the task processing result corresponds to 1 group of cache requirements). In particular, in calculating the dot product attention layer (Q/>K ^T), softmax layer (Softmax), and attention multiplier layer (Att/>V), since there are multiple heads (attention heads), each head needs to use part of the data in the matrix, so Q, K and V matrices need to be buffered simultaneously, and the residual matrix and the output result matrix (i.e. task processing result) are added, then four sets of buffers are insufficient. At this time, the Q and V matrices do not participate in the calculation at the same time, so that they can be placed on the same group of caches and stored in two areas. On the basis, the embodiment proposes that a plurality of groups of shared cache units comprise: a first shared buffer unit (URAM 1), a second shared buffer unit (URAM 2), a third shared buffer Unit (URAM) and a fourth shared buffer unit (URAM 4), wherein the buffer depth of the first shared buffer unit is the scale of a set of target feature matrix data, the target feature matrix data refers to feature matrix data output by each target calculation layer of the neural network model, and the target calculation layer refers to any one layer of the first normalization layer, the Q generation layer, the K generation layer, the V generation layer, the dot product attention layer, the Softmax layer, the attention multiplier layer, the projection layer, the second normalization layer and the second full connection layer; the depth of the second shared cache unit is the same as that of the first shared cache unit, and the depth of the third shared cache unit is 2 times that of the first shared cache unit or the second shared cache unit so as to store the Q matrix and the V matrix simultaneously; the depth of the fourth shared cache unit is 4 times of the depth of the first shared cache unit or the second shared cache unit, and the fourth shared cache unit is used for storing data to be cached in the first full-connection layer task processing process.

It should be noted that, the first shared cache unit, the second shared cache unit, the third shared cache unit, and the fourth shared cache unit are all general memories, which are not limited to storing a specific type of data, and all types of data to be stored may share the storage spaces of the first shared cache unit, the second shared cache unit, the third shared cache unit, and the fourth shared cache unit. In the process of performing calculation task processing by each calculation component, the stored data in the first shared cache unit, the second shared cache unit, the third shared cache unit and the fourth shared cache unit can dynamically change along with the calculation process. In this embodiment, by setting the first shared cache unit, the second shared cache unit, the third shared cache unit, and the fourth shared cache unit, and setting a corresponding depth for each shared cache unit, the data storage requirement can be met more accurately, the storage cost is reduced, and the overall load of the hardware accelerator is reduced.

The following is an exemplary description of the feature buffer locations of a single encoder block in a neural network model, please refer to table 1 below:

TABLE 1 feature buffer locations for a single encoder Block

Where C is the number of blocks in the row direction of the block matrix (i.e. the matrix to be multiplied or the matrix to be processed), R is the number of block matrix rows, and H is the number of attention headers. ① The first shared cache unit is denoted ②, the second shared storage unit is denoted ③, the third shared storage unit is denoted ④, the fourth shared storage unit is denoted ③ up, the upper half of the third shared storage unit is denoted ③ down, and the lower half of the third shared storage unit is denoted. The Weight cache represents a Weight cache unit. In addition, since the calculation after the number 1 has a need of calculating the residual matrix, the calculation result, that is, the task processing result is stored in the first shared buffer unit and is also stored in the second shared storage unit in a backup manner, at this time, the task processing result stored in the second shared storage unit is used for the subsequent matrix multiplication calculation unit to calculate the corresponding residual matrix, and the calculated residual matrix is stored in the feature buffer unit and is used for the subsequent calculation. The task processing results of numbers 9 and 12 are the same.

It should be mentioned that each Block in the transducer network or neural network model is the same. According to the data buffering requirement, in this embodiment, the Q matrix and the V matrix are stored in ③, that is, the third shared storage unit.

Specifically, when executing the computing task of the input layer, the first shared buffer unit is used for buffering the currently input feature matrix data, and the second shared buffer unit is used for carrying out residual backup on the currently input feature matrix data.

When executing the calculation task of the first layer normalization layer, the first shared buffer unit is a data source for executing the calculation task for the layer normalization calculation unit, and the task processing result of the layer normalization calculation unit is stored in the first shared buffer unit.

When executing the computing task of the Q generation layer, the first shared buffer unit is one of the data sources of the computing task executed by the matrix multiplication computing unit, the weight buffer unit is the second data source of the computing task executed by the matrix multiplication computing unit, the task processing result of the matrix multiplication computing unit is stored in the first buffer space of the third shared buffer unit, the third shared buffer unit comprises the first buffer space and the second buffer space, and the depths of the first buffer space and the second buffer space are the same.

When executing the calculation task of the K generation layer, the first shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the weight buffer unit is the second data source of the calculation task executed by the matrix multiplication calculation unit, and the task processing result of the matrix multiplication calculation unit is stored in the fourth shared buffer unit.

When executing the calculation task of the V generation layer, the first shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the weight buffer unit is the second data source of the calculation task executed by the matrix multiplication calculation unit, and the task processing result of the matrix multiplication calculation unit is stored in the second buffer space.

When the computing task of the dot product attention layer is executed, the first buffer space is one of data sources of the computing task executed by the matrix multiplication computing component, the fourth shared buffer unit is two of data sources of the computing task executed by the matrix multiplication computing component, and the task processing result of the matrix multiplication computing component is stored in the first shared buffer unit.

When executing the calculation task of the Softmax layer, the first shared cache unit is a data source for executing the calculation task by the Softmax calculation unit, and the task processing result of the Softmax calculation unit is stored in the first shared cache unit.

When executing the calculation task of the attention multiplication layer, the first shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the second buffer space is two of the data sources of the calculation task executed by the matrix multiplication calculation unit, and the task processing result of the matrix multiplication calculation unit is stored in the fourth shared buffer unit.

When executing the calculation task of the projection layer, the fourth shared buffer unit is one of the data sources of the calculation task executed by the matrix multiplication calculation unit, the weight buffer unit is the second data source of the calculation task executed by the matrix multiplication calculation unit, the second shared buffer unit is the third data source of the calculation task executed by the matrix multiplication calculation unit, the task processing result of the matrix multiplication calculation unit is stored in the first shared buffer unit, and the third shared buffer unit is used for carrying out residual backup on the task processing result of the matrix multiplication calculation unit.

When executing the calculation task of the second layer normalization layer, the first shared buffer unit is a data source for executing the calculation task for the layer normalization calculation unit, and the task processing result of the layer normalization calculation unit is stored in the first shared buffer unit.

When executing the computing task of the first full connection layer, the first shared buffer unit is one of the data sources of the computing task executed by the matrix multiplication computing unit, the weight buffer unit is two of the data sources of the computing task executed by the matrix multiplication computing unit, and the task processing result of the matrix multiplication computing unit is stored in the fourth shared buffer unit.

In some embodiments, when writing or storing data to a target shared storage, the original data in the target shared storage is replaced with current data to be stored, where the target shared storage is any one of the first shared cache unit, the second shared cache unit, the third shared cache unit, and the fourth shared cache unit. By performing the data replacement operation, the dynamic cache of the feature cache unit can be realized, the flexibility is high, and the cache load is reduced.

Since the computation of each computation layer of the neural network model needs to be performed continuously, in order to avoid a situation that the computation between layers waits, in some embodiments, the weight buffering unit includes: the method comprises the steps of a first buffer subunit (URAM in fig. 1) and a second buffer subunit (URAM in fig. 1), wherein the first buffer subunit and the second buffer subunit are used for performing ping-pong buffer on weight matrix data to be used by each calculation layer of the neural network model, and the data source of the weight matrix data is the external equipment. Specifically, by means of ping-pong caching, the method realizes the alternate storage of the weight matrix data to be used by each calculation layer of the neural network model, thereby being beneficial to accelerating the calculation speed, greatly improving the utilization efficiency of resources, improving the stability of an accelerator and effectively avoiding the situation of waiting for data loading between layers. It should be noted that, based on a single cache, the use of cache resources can be reduced by using the cache depth to perform the cache. It will be appreciated that the direct use of two units results in a significant amount of waste due to the large capacity of a single URAM unit.

In some embodiments, the buffer depths of the first buffer subunit and the second buffer subunit are each the size of the input weight matrix data of the first fully-connected layer or the second fully-connected layer.

In some embodiments, all of the weight matrix data comprises: first, second, third, fourth, fifth and sixth matrix data, the first matrix data being weight data of a Q-generation layer of the neural network model, the second matrix data being weight matrix data of a K-generation layer of the neural network model, the third matrix data being weight matrix data of a V-generation layer of the neural network model, the fourth matrix data being weight matrix data of a projection layer of the neural network model, the fifth matrix data being weight matrix data of a first fully-connected layer of the neural network model, the sixth matrix data being weight matrix data of a second fully-connected layer of the neural network model;

In table 2, for example, the loading timing of Weight (Weight) data or Weight matrix data, please refer to table 2:

TABLE 2 load timing examples of Weight (Weight) data or Weight matrix data

As shown in table 2, the ping represents a first buffer subunit of the weight buffer unit and pong represents a second buffer subunit of the weight buffer unit. W ^Q denotes first matrix data, i.e., weight matrix data of the Q generation layer, W ^K denotes second matrix data, i.e., weight matrix data of the K generation layer, and W ^V denotes third matrix data, i.e., weight matrix data of the V generation layer. W ^proj denotes fourth matrix data, i.e., weight matrix data of the projection layer. W ^FC1 denotes the fifth matrix data, i.e., the weight matrix data of the first full connection layer, and W ^FC2 denotes the sixth matrix data, i.e., the weight matrix data of the second full connection layer. The ping (W ^Q W^K W^V W^proj) indicates that the first buffer subunit is now loaded with four weight matrix data, W ^Q、W^K、W^V and W ^proj, pong (W ^FC1) indicates that the second buffer subunit is now loaded with W ^FC1, the remaining pingPong/>And the same is true. Since the size of the weight matrix data required by the first full connection layer and the second full connection layer is 4 times that of the weight matrix data of the other calculation layers, the buffer depths of the first buffer subunit and the second buffer subunit in the above embodiment are set to the size of the input weight matrix data of the first full connection layer or the second full connection layer. In order to avoid the situation of data loading waiting among different computing layers, in the above embodiment, a ping-pong buffer or an alternate buffer mode is adopted for the first buffer subunit and the second buffer subunit, and weight matrix data to be used is loaded in advance, so that the situation of data waiting among the layers is avoided.

In some embodiments, the scheduling core unit is an offline operation unit, and the scheduling core unit is specifically configured to, when receiving the start-up calculation command, issue, to the corresponding calculation unit, a corresponding parameter of each calculation unit in the start-up association data, so that the calculation unit performs a calculation task of each calculation layer of the neural network model.

The matrix multiplication calculation component is configured to execute a calculation task of the calculation layer in the neural network model, where matrix multiplication calculation is required, under the condition that corresponding data issued by the scheduling core unit is received, and an execution process includes: loading a first to-be-multiplied matrix from the feature buffer unit, loading a second to-be-multiplied matrix from the feature buffer unit or the weight buffer unit, and performing matrix multiplication operation based on the first to-be-multiplied matrix and the second to-be-multiplied matrix to obtain a matrix multiplication result; and under the condition that the current matrix multiplication result is an intermediate result, caching the current matrix multiplication result to the feature caching unit so that the Softmax computing component and the layer normalization computing component obtain data to be computed from the feature caching unit when executing computing tasks, wherein the intermediate result refers to task processing results obtained by the computing component in the process of executing computing tasks of each computing layer before obtaining final computing results of the current feature matrix data, and the final computing results refer to results output after the current feature matrix data are processed by a neural network model. By the method, the dispatching and control of each computing component by the dispatching core unit can be better realized, and the computing speed is accelerated. It can be understood that the matrix multiplication result is the task processing result of the current stage.

In some embodiments, when the current matrix multiplication result is the final calculation result of the current feature matrix data, the current matrix multiplication result is cached to the feature caching unit, and the current matrix multiplication result is transmitted to a preset block random access memory (i.e. axi_ram in fig. 1) for storage; transmitting a calculation completion signal to the dispatching core unit, wherein the calculation completion signal is used for indicating the dispatching core unit to prompt the external equipment to read task processing results of all the calculation components; the block random access memory is used for feeding back all the task processing results to the external equipment under the condition that the reading command of the external equipment is received. It should be noted that, when the scheduling core unit receives the calculation completion signal, the external device is prompted to read the task processing results of all the calculation components in an interrupt manner. By the method, the buffer storage of the task processing results including the matrix multiplication results can be well realized, and the dynamic closed loop of task processing, result buffer storage and result reading is realized.

In some embodiments, the scheduling core unit is further configured to receive bias parameters of each of the computing layers transmitted by the external device; and respectively issuing all the bias parameters to the matrix multiplication calculation part and the layer normalization calculation part.

The matrix multiplication calculation component and the layer normalization calculation component are both used for carrying out on-chip caching on the bias parameters.

The scheduling core unit is further configured to send bias parameter marking information corresponding to a current computing task to the matrix multiplication computing unit or the layer normalization computing unit in a task execution process, where the bias parameter marking information is used to instruct the matrix multiplication computing unit or the layer normalization computing unit to determine a target bias parameter from all bias parameters cached on a self chip, and the target bias parameter is used to complete execution of the current computing task. Since the data size of the offset parameter is small, the offset parameter is issued to the matrix multiplication calculation section and the layer normalization calculation section, which need to use the offset parameter, through the scheduling core. The matrix multiplication computation unit and the layer normalization computation unit use respective on-chip RAMs to cache Bias parameters, namely Bias Buf shown in fig. 1. When the matrix multiplication computing component and the layer normalization computing component execute corresponding computing tasks, the scheduling core transmits bias parameter marking information corresponding to the current computing task to the matrix multiplication computing component or the layer normalization computing component, so that the matrix multiplication computing component or the layer normalization computing component finds required bias parameters from all bias parameters cached on a chip of the matrix multiplication computing component or the layer normalization computing component, and the calculation is completed, so that the flexibility is high.

In some embodiments, in a case where the matrix multiplication calculating means performs a calculation task of any one of the Q generation layer, the K generation layer, the V generation layer, and the first fully connected layer of the neural network model, inputs of the matrix multiplication calculating means are feature matrix data and weight matrix data; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; traversing each column of the second matrix to be multiplied for a first row of matrix blocks of the first matrix to be multiplied to obtain a product between the first row of matrix blocks and the corresponding column of the second matrix to be multiplied; traversing each column of the second matrix to be multiplied for the next row of matrix blocks of the first matrix to be multiplied until the product between the last row of matrix blocks of the first matrix to be multiplied and the last column of matrix blocks of the second matrix to be multiplied is obtained, so that a task processing result of the current computing task is obtained; the data source of the feature matrix data is the feature cache unit, and the data source of the weight matrix data is the weight cache unit.

For example, referring to fig. 2, if the input of the matrix multiplication calculating unit is the feature matrix data and the weight matrix data, it is assumed that the feature matrix data is an a matrix, the a matrix is a first to-be-multiplied matrix, the weight matrix data is a B matrix, and the B matrix is a second to-be-multiplied matrix. Traversing each column of the second matrix to be multiplied for a first row of the matrix block of the first matrix to be multiplied to obtain a product between the first row of the matrix block and a corresponding column of the second matrix to be multiplied (i.e., repeatedly reading the first row of the a matrix, traversing the first row of the a matrix through all columns of the B matrix each time of reading); and traversing each column of the second matrix to be multiplied (namely repeatedly reading the second row of the matrix A, traversing all columns of the matrix B by the second row of the matrix A for each reading, and the third row, the fourth row and the like of the matrix A) for the next row of the matrix block of the first matrix to be multiplied until the product between the last row of the matrix block of the first matrix to be multiplied and the last column of the matrix block of the second matrix to be multiplied is obtained, so that the task processing result of the current calculation task is obtained. In the calculation process, the sequence of elements in the matrix block does not need to be adjusted.

In some embodiments, in the case where the matrix multiplication computation section performs a computation task of any one of a projection layer and a second fully-connected layer of the neural network model, the inputs of the matrix multiplication computation section are feature matrix data, weight matrix data, and a residual matrix; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; multiplying the first to-be-multiplied matrix with the second to-be-multiplied matrix, and adding the obtained product with the residual matrix to obtain a task processing result of the current computing task; the data source of the feature matrix data is the feature cache unit, the data source of the weight matrix data is the weight cache unit, and the data source of the residual matrix is the feature cache unit.

For example, referring to fig. 3, in the case that the input of the matrix multiplication calculating unit is the feature matrix data, the weight matrix data and the residual matrix, assuming that the a matrix is the first to-be-multiplied matrix, the B matrix is the second to-be-multiplied matrix, and the C matrix is the residual matrix, the task processing result of the current calculation task may be obtained by multiplying the a matrix with the B matrix and adding the product thereof to the C matrix. In the calculation process, the recording sequence of each matrix is shown in fig. 3, firstly, a row of matrix blocks of the matrix A is read, a row of matrix blocks is correspondingly read for the matrix B, the product operation is carried out, and then, a matrix block is read for the residual matrix according to the row sequence and added with the current product matrix. And then, repeatedly reading the row by the matrix A, traversing all the blocks of the matrix B, and sequentially reading matrix blocks corresponding to one row by the residual matrix. The matrix a then replaces the next row, the matrix B is traversed completely, and the residual matrix is loaded with blocks of matrix for the next row in turn, so on until all computations are completed (i.e., repeat row 1 of matrix a in fig. 3, traverse all columns of matrix B and load blocks … … of matrix C in turn for the first row). In the calculation process, the sequence of elements in the matrix block does not need to be adjusted.

In some embodiments, in the case where the matrix multiplication computation means performs a computation task of a dot product attention layer of the neural network model, the inputs of the matrix multiplication computation means are a Q matrix and a K matrix, the step of the matrix multiplication computation means performing the computation task includes: partitioning the Q matrix to obtain a plurality of Q submatrices; the K matrixes are subjected to blocking operation to obtain a plurality of K submatrices, the number of the Q submatrices is the same as that of the K submatrices, and the Q submatrices are in one-to-one correspondence with the K submatrices; multiplying each Q submatrix with the corresponding K submatrix respectively to obtain a task processing result of the current computing task; and the data sources of the Q matrix and the K matrix are the characteristic cache units.

For example, please refer to fig. 4, in case the inputs of the matrix multiplication computation section are Q matrix and K matrix, Q needs to be computedK ^T. Let the Q matrix and the K matrix be 4/>6 Matrices, then, by performing the blocking operation on the Q matrix and the K matrix, respectively, 3Q sub-matrices are obtained (i.e./>, in FIG. 4、/>、/>) And 3K submatrices (i.e./>, in FIG. 4、/>、/>). The Q submatrix/>And K submatrix/>Multiplication gives/>, in FIG. 4. The Q submatrix/>And K submatrix/>Multiplication gives/>, in FIG. 4. The Q submatrix/>And K submatrix/>Multiplication gives/>, in FIG. 4. The starting address of each sub-matrix, i.e. each head, is different when multiplied, but in the same way as the multiplication between the matrices in the above described embodiments. Specifically, by repeating row 1 of matrix a, all columns of matrix B are traversed, and then row 2 of matrix a is repeated, all columns of matrix B are traversed until the calculation is completed. Wherein, in the Q submatrix/>And K submatrix/>When multiplied, Q submatrix/>Is A matrix, K submatrix/>Is a B matrix. In the Q submatrix/>And K submatrix/>When multiplied, Q submatrix/>Is A matrix, K submatrix/>Is a B matrix. In the Q submatrix/>And K submatrix/>When multiplied, Q submatrix/>Is A matrix, K submatrix/>Is a B matrix. It should be noted that, since the matrix blocks of the Q matrix and the K matrix are identical in the storage order of the feature memory cells, Q/>The K matrix in K ^T does not need to be transposed, and can be accessed sequentially.

In some embodiments, in the case where the matrix multiplication computation section performs a computation task of an attention multiplier layer of the neural network model, inputs of the matrix multiplication computation section are attention score matrix data and a V matrix, the step of the matrix multiplication computation section performing the computation task includes: performing partitioning operation on the attention score matrix data to obtain a plurality of attention score sub-matrices; the V matrix is subjected to blocking operation to obtain a plurality of V sub-matrices, the number of the attention score sub-matrices is the same as that of the V sub-matrices, and the attention score sub-matrices are in one-to-one correspondence with the V sub-matrices; obtaining a task processing result of the current computing task by performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix, wherein performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix comprises: the method comprises an inter-block transposition step and an intra-block transposition step, wherein the inter-block transposition refers to obtaining a V sub-matrix corresponding to the attention score sub-matrix in a jumping address mode in the matrix multiplication process, and the intra-block transposition refers to completing transposition of elements in each V sub-matrix block by adjusting the positions of bus elements; and the attention score matrix data and the data sources of the V matrix are the characteristic cache units. The bus refers to a connection among each computing unit, each cache unit and the scheduling core unit, and each unit in the hardware accelerator in the embodiment performs data transmission through the bus.

For example, referring to fig. 5, in the case where the inputs of the matrix multiplication calculating section are the attention score matrix data (Att matrix) and the V matrix, the attention score matrix data is determined as the a matrix, and the V matrix is determined as the B matrix. The matrix a is first partitioned to obtain a plurality of attention score sub-matrices (i.e. the attention score sub-matrices of fig. 5、/>、/>) ; The B matrix is partitioned to obtain multiple V submatrices (i.e./>, fig. 5)、/>、/>). Since the dimensions of the matrix a and the matrix B do not correspond, the matrix multiplication operation cannot be directly performed, and the matrix B needs to be transposed first. The conventional method is to transpose the B matrix, as shown on the right side of fig. 5, so as to perform a corresponding matrix operation. However, this approach introduces additional intermediate buffering and time overhead. Therefore, in the above embodiment, the block transpose is implemented by the address hopping method, that is, the transposed B matrix data block corresponding to each row of the a matrix is sequentially acquired by the address hopping method, and meanwhile, the internal elements of the B matrix block are subjected to row-column position adjustment in the manner shown in fig. 5, that is, the positions of the bus elements are adjusted to implement the intra-block transpose, thereby completing the matrix multiplication between the matrices of different dimensions, and obtaining/>, in fig. 5、/>And/>. It should be noted that, after the jump address is read and the row and column positions of the matrix B are adjusted, the adopted traversal process is the same as that in the above embodiment, i.e. the 1 st row of the matrix a is repeated, and all columns of the matrix B are traversed; repeating row 2 of the a matrix, traversing all columns … … of the B matrix also requires that the position adjustment of the elements within the matrix block be achieved by adjustment of the bit position of the bus, since the matrix block corresponds to one clock cycle of the data bus bit width in the hardware accelerator. Therefore, intermediate buffering and time expenditure caused by additional matrix transposition are avoided, operation is simplified, and calculation efficiency is improved.

In some embodiments, the hardware accelerator further comprises: the memory unit is used for receiving the weight matrix data transmitted by the external equipment, loading the weight matrix data to the weight caching unit, and performing ping-pong caching on the weight matrix data by the weight caching unit. It should be noted that the memory unit in this embodiment is DDR4 (a type of computer memory) which provides higher bandwidth and lower power consumption.

In some embodiments, the memory unit is further configured to receive feature matrix data transmitted by the external device, and load the feature matrix data into the feature cache unit. It should be noted that, by setting the memory unit and transmitting the feature matrix data to the corresponding buffer unit by using the memory unit, the control of the feature buffer unit and the weight buffer unit can be conveniently realized.

In some embodiments, the hardware accelerator further comprises: a multiplexer (i.e., MUX in fig. 1), whose input is respectively connected to the scheduling core unit, the matrix multiplication calculating unit, the Softmax calculating unit, and the layer normalization calculating unit, so as to receive a result write-back control signal sent by the scheduling core unit, and the task processing results sent by the matrix multiplication calculating unit, the Softmax calculating unit, and the layer normalization calculating unit; the result write-back control signal is used for indicating the multiplexer to write all the task processing results into a preset block random access memory, and the block random access memory is used for feeding all the task processing results back to the external equipment under the condition that a read command of the external equipment is received. By using the multiplexer, it is possible to collect the task processing results of the respective computing units. It should be mentioned that the multiplexer is connected to the scheduling core unit, and the scheduling core unit performs result writing back control on the multiplexer to control the multiplexer to write back the task processing result of each computing unit to the block random access memory.

In some embodiments, the hardware accelerator further comprises: and the multiplexer writes the task processing result into the block random access memory through the queue unit. Specifically, as shown in fig. 1, by setting two FIFO queues of FIFO1 and FIFO2, a plurality of task processing results are sequentially transmitted to a block random access memory (axi_ram), so as to realize ordered transmission of data and maintain stability of the data transmission process.

The present embodiment also provides a server, including: a hardware accelerator as claimed in any preceding claim to computationally accelerate the neural network model.

The scheduling method of the hardware accelerator provided by the invention is described below, and the scheduling method of the hardware accelerator described below and the hardware accelerator described above can be referred to correspondingly.

Referring to fig. 6, the scheduling method of the hardware accelerator provided in this embodiment includes:

S610: receiving and caching starting association data transmitted by external equipment, wherein the starting association data comprises the following steps: memory access parameters, calculation control parameters, quantization parameters and start signals of the calculation components.

S620: and receiving a starting calculation command transmitted by the external equipment, and controlling and scheduling all the calculation components based on the starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the characteristic caching unit and the weight caching unit.

S630: and receiving task completion signals returned by all the computing components, and prompting the external equipment to read task processing results of all the computing components based on the task completion signals. The scheduling method of the hardware accelerator in the embodiment can well achieve the calculation acceleration of each calculation layer of the neural network model, and achieve efficient data storage and scheduling, and has high feasibility.

It should be noted that, in this embodiment, the scheduling core unit adopts an offline scheduling manner, the external device loads a set of feature matrix data, and sends a calculation starting command to the scheduling core unit, and the scheduling core unit controls each calculation component to sequentially complete the calculation tasks of each calculation layer, and then notifies the external device through an interrupt. After the external device reads the task processing result, the next group of feature matrix data is loaded, a starting calculation command is sent out, and the process is circulated until all feature matrix data calculation is completed. The calculation process of the single cycle is controlled by the scheduling core unit according to the control parameters loaded in advance, and external equipment is not needed to participate, so that the processing efficiency is improved, and the interaction time delay is avoided.

In some embodiments, in the event that the buffering of the startup association data is completed, issuing corresponding parameters of each of the computing components in the startup association data to the respective computing components; and each computing component caches the issued parameters on-chip under the condition of receiving the issued parameters.

In some embodiments, when the start-up calculation command is received, the step of controlling and scheduling all the calculation components based on the start-up association data to cause the calculation components to perform calculation tasks of each calculation layer of the neural network model includes:

In some embodiments, further comprising: and controlling the shared cache unit to cache the feature matrix data transmitted by the external equipment and intermediate results generated in the process of executing the calculation tasks by each calculation component, wherein the intermediate results refer to task processing results obtained in the process of executing the calculation tasks of each calculation layer by the calculation component before obtaining final calculation results of the current feature matrix data, and the final calculation results refer to results output after the current feature matrix data are processed by the neural network model.

In some embodiments, further comprising: and controlling the first cache subunit and the second cache subunit to perform ping-pong cache on weight matrix data to be used by each calculation layer of the neural network model, wherein the data source of the weight matrix data is the external equipment.

In some embodiments, further comprising: controlling the third shared buffer unit to store the Q matrix and the V matrix simultaneously; and controlling the fourth shared cache unit to store data to be cached in the task processing process of the first full connection layer.

In some embodiments, further comprising: when the computing task of the input layer is executed, the first shared cache unit is controlled to cache the currently input feature matrix data, and the second shared cache unit is controlled to carry out residual backup on the currently input feature matrix data.

In some embodiments, further comprising: when executing the calculation task of the first layer normalization layer, controlling the layer normalization calculation component to acquire data to be processed from the first shared cache unit, and storing a task processing result in the first shared cache unit.

In some embodiments, further comprising: when executing the calculation task of the Q generation layer, controlling the matrix multiplication calculation component to acquire a matrix to be multiplied from the first shared buffer unit and the weight buffer unit respectively, and storing a task processing result in a first buffer space of the third shared buffer unit.

In some embodiments, further comprising: when the calculation task of the K generation layer is executed, the matrix multiplication calculation component is controlled to obtain a matrix to be multiplied from the first shared cache unit and the weight cache unit respectively, and a task processing result is stored in the fourth shared cache unit.

In some embodiments, further comprising: when the calculation task of the V generation layer is executed, the matrix multiplication calculation component is controlled to obtain a matrix to be multiplied from the first shared buffer unit and the weight buffer unit respectively, and task processing results are stored in the second buffer space.

In some embodiments, further comprising: when the calculation task of the dot product attention layer is executed, the matrix multiplication calculation component is controlled to acquire a matrix to be multiplied from the first cache space and the fourth shared cache unit respectively, and a task processing result is stored in the first shared cache unit.

In some embodiments, further comprising: when executing the calculation task of the Softmax layer, controlling the Softmax calculation component to acquire data to be processed from the first shared cache unit, and storing a task processing result in the first shared cache unit;

In some embodiments, further comprising: when the calculation task of the attention multiplication layer is executed, the matrix multiplication calculation component is controlled to acquire a matrix to be multiplied from the first shared cache unit and the second cache space respectively, and a task processing result is stored in the fourth shared cache unit.

In some embodiments, further comprising: when the calculation task of the projection layer is executed, controlling the matrix multiplication calculation component to acquire a matrix to be multiplied from the fourth shared cache unit and the weight cache unit respectively, and acquiring a residual matrix from the second shared cache unit; and storing the task processing result in the first shared cache unit, and carrying out residual backup on the task processing result of the matrix multiplication calculating unit by utilizing the third shared cache unit.

In some embodiments, further comprising: when the calculation task of the second layer normalization layer is executed, the layer normalization calculation component is controlled to acquire data to be processed from the first shared cache unit, and a task processing result is stored in the first shared cache unit.

In some embodiments, further comprising: when executing the calculation task of the first full connection layer, controlling the matrix multiplication calculation component to obtain a matrix to be multiplied by the first shared buffer unit and the weight buffer unit respectively, and storing a task processing result in the fourth shared buffer unit;

In some embodiments, further comprising: when the calculation task of the second full connection layer is executed, controlling the matrix multiplication calculation component to acquire a matrix to be multiplied from the fourth shared cache unit and the weight cache unit respectively, and acquiring a residual matrix from the third shared cache unit; and storing the task processing result in the first shared cache unit, and carrying out residual backup on the task processing result of the matrix multiplication calculating unit by utilizing the second shared cache unit.

In some embodiments, further comprising: when writing or storing data to a target shared storage, replacing original data in the target shared storage with current data to be stored, wherein the target shared storage is any one of the first shared cache unit, the second shared cache unit, the third shared cache unit and the fourth shared cache unit.

In some embodiments, further comprising: under the condition that the starting calculation command is received, corresponding data of each calculation component in the starting associated data are issued to the corresponding calculation component, so that the calculation component executes calculation tasks of each calculation layer of the neural network model;

and controlling the matrix multiplication calculation component to execute the calculation task of the calculation layer needing matrix multiplication calculation in the neural network model under the condition of receiving the corresponding data issued by the dispatching core unit, wherein the execution process comprises the following steps: loading a first to-be-multiplied matrix from the feature buffer unit, loading a second to-be-multiplied matrix from the feature buffer unit or the weight buffer unit, and performing matrix multiplication operation based on the first to-be-multiplied matrix and the second to-be-multiplied matrix to obtain a matrix multiplication result; and under the condition that the current matrix multiplication result is an intermediate result, caching the current matrix multiplication result to the feature caching unit so that the Softmax computing component and the layer normalization computing component obtain data to be computed from the feature caching unit when executing computing tasks, wherein the intermediate result refers to task processing results obtained by the computing component in the process of executing computing tasks of each computing layer before obtaining final computing results of the current feature matrix data, and the final computing results refer to results output after the current feature matrix data are processed by a neural network model.

In some embodiments, further comprising: under the condition that the current matrix multiplication result is the final calculation result of the current feature matrix data, caching the current matrix multiplication result to the feature caching unit, and transmitting the current matrix multiplication result to a preset block random access memory for storage; transmitting a calculation completion signal to the dispatching core unit, wherein the calculation completion signal is used for indicating the dispatching core unit to prompt the external equipment to read task processing results of all the calculation components; the block random access memory is used for feeding back all the task processing results to the external equipment under the condition that the reading command of the external equipment is received.

In some embodiments, further comprising: before a calculation task starts, receiving bias parameters of each calculation layer transmitted by the external equipment; respectively issuing all the bias parameters to the matrix multiplication calculation part and the layer normalization calculation part;

controlling the matrix multiplication calculating part and the layer normalization calculating part to cache the bias parameters;

In some embodiments, further comprising: in the task execution process, bias parameter marking information corresponding to a current calculation task is issued to the matrix multiplication calculation component or the layer normalization calculation component, wherein the bias parameter marking information is used for indicating the matrix multiplication calculation component or the layer normalization calculation component to determine target bias parameters from all bias parameters cached on a chip of the matrix multiplication calculation component or the layer normalization calculation component, and the target bias parameters are used for completing the execution of the current calculation task.

In some embodiments, further comprising: in a case where the matrix multiplication calculating section performs a calculation task of any one of a Q generation layer, a K generation layer, a V generation layer, and a first full connection layer of the neural network model, inputs of the matrix multiplication calculating section are feature matrix data and weight matrix data; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; traversing each column of the second matrix to be multiplied for a first row of matrix blocks of the first matrix to be multiplied to obtain a product between the first row of matrix blocks and the corresponding column of the second matrix to be multiplied; traversing each column of the second matrix to be multiplied for the next row of matrix blocks of the first matrix to be multiplied until the product between the last row of matrix blocks of the first matrix to be multiplied and the last column of matrix blocks of the second matrix to be multiplied is obtained, so that a task processing result of the current computing task is obtained; the data source of the feature matrix data is the feature cache unit, and the data source of the weight matrix data is the weight cache unit.

In some embodiments, further comprising: in the case where the matrix multiplication calculating section performs a calculation task of any one of a projection layer and a second full-connection layer of the neural network model, the inputs of the matrix multiplication calculating section are feature matrix data, weight matrix data, and a residual matrix; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; multiplying the first to-be-multiplied matrix with the second to-be-multiplied matrix, and adding the obtained product with the residual matrix to obtain a task processing result of the current computing task; the data source of the feature matrix data is the feature cache unit, the data source of the weight matrix data is the weight cache unit, and the data source of the residual matrix is the feature cache unit.

In some embodiments, further comprising: in the case where the matrix multiplication computation section performs a computation task of a dot product attention layer of the neural network model, inputs of the matrix multiplication computation section are a Q matrix and a K matrix, the step of the matrix multiplication computation section performing the computation task includes: partitioning the Q matrix to obtain a plurality of Q submatrices; the K matrixes are subjected to blocking operation to obtain a plurality of K submatrices, the number of the Q submatrices is the same as that of the K submatrices, and the Q submatrices are in one-to-one correspondence with the K submatrices; multiplying each Q submatrix with the corresponding K submatrix respectively to obtain a task processing result of the current computing task; and the data sources of the Q matrix and the K matrix are the characteristic cache units.

In some embodiments, further comprising: in the case where the matrix multiplication computation section performs a computation task of an attention multiplier layer of the neural network model, inputs of the matrix multiplication computation section are attention score matrix data and a V matrix, the matrix multiplication computation section performing the computation task includes: performing partitioning operation on the attention score matrix data to obtain a plurality of attention score sub-matrices; the V matrix is subjected to blocking operation to obtain a plurality of V sub-matrices, the number of the attention score sub-matrices is the same as that of the V sub-matrices, and the attention score sub-matrices are in one-to-one correspondence with the V sub-matrices; obtaining a task processing result of the current computing task by performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix, wherein performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix comprises: the method comprises an inter-block transposition step and an intra-block transposition step, wherein the inter-block transposition refers to obtaining a V sub-matrix corresponding to the attention score sub-matrix in a jumping address mode in the matrix multiplication process, and the intra-block transposition refers to completing transposition of elements in each V sub-matrix block by adjusting the positions of bus elements; and the attention score matrix data and the data sources of the V matrix are the characteristic cache units.

In some embodiments, further comprising: and controlling the memory unit to receive the weight matrix data transmitted by the external equipment, loading the weight matrix data to the weight caching unit, and performing ping-pong caching on the weight matrix data by the weight caching unit.

In some embodiments, further comprising: and controlling the memory unit to receive the feature matrix data transmitted by the external equipment, and loading the feature matrix data to the feature cache unit.

In some embodiments, further comprising: controlling the multiplexer to receive a result write-back control signal sent by the scheduling core unit and the task processing results sent by the matrix multiplication calculating part, the Softmax calculating part and the layer normalization calculating part; the result write-back control signal is used for indicating the multiplexer to write all the task processing results into a preset block random access memory, and the block random access memory is used for feeding all the task processing results back to the external equipment under the condition that a read command of the external equipment is received.

In some embodiments, further comprising: and controlling the multiplexer to write the task processing result into the block random access memory through the queue unit.

In some embodiments, further comprising: and controlling the multiplexer to write the task processing results of each computing component into corresponding positions in a preset task processing table in the block random access memory based on the layer number preset by each computing layer, wherein the task processing table comprises the layer numbers and the corresponding task processing results. The task processing table is used for being read by external equipment. Further, the task processing table is used for indicating the external device to compare the task processing result corresponding to each layer number with a preset real result so as to complete training or verification of the current neural network model. It should be noted that, in this embodiment, by storing the task processing results according to the layer numbers, it is possible to facilitate the subsequent external device to quickly determine the task processing results corresponding to each calculation layer based on the corresponding layer numbers, and further compare the task processing results with the corresponding real results, thereby completing training or checking of the neural network model.

The scheduling method of the hardware accelerator in the above embodiment is further explained in the following with a specific embodiment.

Referring to fig. 7, the scheduling method includes:

s710: the calculation is started. Specifically, the external device sends start-up associated data to the scheduling core unit. And the external device sends the weight matrix data of all the calculation layers in the neural network model to the memory unit for preloading. The scheduling core unit preloads the access parameters, the calculation control parameters and the quantization parameters of each calculation component to the corresponding calculation component for caching, and issues a start signal to each calculation component, wherein the start signal is used for indicating the calculation component to start running.

S720: and loading the feature matrix data. Specifically, the external device loads the feature matrix data to the DDR, and then the DDR writes the feature matrix data into the feature cache unit based on a preset first on-chip logic.

S730: and loading weight matrix data. Specifically, the external device loads the weight matrix data to the DDR, and then the DDR writes the feature matrix data into the weight cache unit based on a preset second on-chip logic.

S740: LN1 was calculated. S750: the Q matrix is calculated. S760: a K matrix is calculated. S770: a V matrix is calculated. S780: calculate QK ^T. S790: SF (computation task of Softmax layer) is computed. S7100: calculate Att/>V (calculation task of the attention multiplier layer). S7110: it is determined whether all head calculations are complete. S7120: calculation Proj. S7130: and loading weight matrix data. S7140: LN2 was calculated. S7150: FC1 was calculated. S7160: and loading weight matrix data. S7170: FC2 was calculated. S7180: and judging whether all the Block calculations are completed. S7190: and loading weight matrix data. Specifically, each computing component is controlled to perform data loading, computation and result writing back.

Referring to fig. 8, the present embodiment further provides a scheduling core unit, including:

the data receiving subunit 810 is configured to receive and buffer startup association data transmitted by an external device, where the startup association data includes: access parameters, calculation control parameters, quantization parameters, and start signals for each calculation unit, the calculation unit comprising: a matrix multiplication calculating section, a Softmax calculating section, and a layer normalization calculating section;

A scheduling subunit 820, configured to receive a start-up calculation command transmitted by the external device, and control and schedule all the calculation units based on the start-up association data when the start-up calculation command is received, so that the calculation units execute calculation tasks of each calculation layer of the neural network model, where the matrix multiplication calculation units complete execution of the calculation tasks by loading data buffered in the feature buffer unit and the weight buffer unit;

and a prompting subunit 830, configured to receive task completion signals returned by all the computing components, and prompt the external device to read task processing results of all the computing components based on the task completion signals. The scheduling core unit in the embodiment can better realize scheduling of the hardware accelerator.

Fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, which may include: processor 910, communication interface (Communications Interface) 920, memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 may call logic instructions in memory 930 to perform a method of scheduling a hardware accelerator, the method comprising: receiving and caching starting association data transmitted by external equipment, wherein the starting association data comprises: memory access parameters, calculation control parameters, quantization parameters and start signals of all calculation components; receiving a starting calculation command transmitted by external equipment, and controlling and scheduling all calculation components based on starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of all calculation layers of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading cache data in a characteristic cache unit and a weight cache unit; and receiving task completion signals returned by all the computing components, and prompting the external equipment to read task processing results of all the computing components based on the task completion signals.

Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of scheduling a hardware accelerator provided by the above methods, the method comprising: receiving and caching starting association data transmitted by external equipment, wherein the starting association data comprises: memory access parameters, calculation control parameters, quantization parameters and start signals of all calculation components; receiving a starting calculation command transmitted by external equipment, and controlling and scheduling all calculation components based on starting associated data under the condition of receiving the starting calculation command so as to enable the calculation components to execute calculation tasks of all calculation layers of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading cache data in a characteristic cache unit and a weight cache unit; and receiving task completion signals returned by all the computing components, and prompting the external equipment to read task processing results of all the computing components based on the task completion signals.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A hardware accelerator, comprising:

The scheduling core unit is used for receiving and caching starting association data transmitted by external equipment, and the starting association data comprises: memory access parameters, calculation control parameters, quantization parameters and start signals of the calculation components; and receiving a start calculation command transmitted by the external device, and controlling and scheduling all the calculation components based on the start association data under the condition of receiving the start calculation command so as to enable the calculation components to execute calculation tasks of each calculation layer of the neural network model, wherein the matrix multiplication calculation components complete execution of the calculation tasks by loading data cached in the feature cache unit and the weight cache unit; the scheduling core unit is also used for receiving task completion signals returned by all the computing components and prompting the external equipment to read task processing results of all the computing components based on the task completion signals;

The neural network model includes: the input layer, the first normalization layer, the Q generation layer, the K generation layer, the V generation layer, the dot product attention layer, the Softmax layer, the attention multiplication layer, the projection layer, the second normalization layer, the first full connection layer and the second full connection layer are sequentially connected; the Q generation layer is used for obtaining a Q matrix, the K generation layer is used for obtaining a K matrix, and the V generation layer is used for obtaining a V matrix.

2. The hardware accelerator of claim 1 wherein the feature cache elements comprise groups of shared cache elements; the shared buffer unit is used for buffering the feature matrix data transmitted by the external device and intermediate results generated in the process of executing the calculation tasks by the calculation components, wherein the intermediate results refer to task processing results obtained in the process of executing the calculation tasks of the calculation components of the calculation layers before final calculation results of the current feature matrix data are obtained, and the final calculation results refer to results output after the current feature matrix data are processed by the neural network model.

3. The hardware accelerator of claim 2 wherein the plurality of sets of shared cache units comprise: the system comprises a first shared cache unit, a second shared cache unit, a third shared cache unit and a fourth shared cache unit, wherein the cache depth of the first shared cache unit is the scale of a group of target feature matrix data, the target feature matrix data refer to feature matrix data output by each target calculation layer of the neural network model, and the target calculation layer refers to any one layer of the first layer normalization layer, the Q generation layer, the K generation layer, the V generation layer, the dot product attention layer, the Softmax layer, the attention multiplication layer, the projection layer, the second layer normalization layer and the second full connection layer; the buffer depths of the second shared buffer unit and the first shared buffer unit are the same, and the buffer depth of the third shared buffer unit is 2 times of the buffer depth of the first shared buffer unit or the second shared buffer unit so as to store the Q matrix and the V matrix simultaneously; the depth of the fourth shared cache unit is 4 times of the depth of the first shared cache unit or the second shared cache unit, and the fourth shared cache unit is used for storing data to be cached in the first full-connection layer task processing process.

4. The hardware accelerator of claim 3,

When the computing task of the input layer is executed, the first shared cache unit is used for caching the currently input feature matrix data, and the second shared cache unit is used for carrying out residual backup on the currently input feature matrix data;

5. The hardware accelerator of claim 4,

When writing or storing data to a target shared storage, replacing original data in the target shared storage with current data to be stored, wherein the target shared storage is any one of the first shared cache unit, the second shared cache unit, the third shared cache unit and the fourth shared cache unit.

6. The hardware accelerator of claim 1, wherein the hardware accelerator is configured to,

The scheduling core unit is an offline operation unit, and is specifically configured to, when receiving the start calculation command, issue a corresponding parameter of each calculation component in the start association data to the corresponding calculation component, so that the calculation component executes a calculation task of each calculation layer of the neural network model;

7. The hardware accelerator of claim 6, wherein the hardware accelerator is configured to,

Under the condition that the current matrix multiplication result is the final calculation result of the current feature matrix data, caching the current matrix multiplication result to the feature caching unit, and transmitting the current matrix multiplication result to a preset block random access memory for storage; transmitting a calculation completion signal to the dispatching core unit, wherein the calculation completion signal is used for indicating the dispatching core unit to prompt the external equipment to read task processing results of all the calculation components; the block random access memory is used for feeding back all the task processing results to the external equipment under the condition that the reading command of the external equipment is received.

8. The hardware accelerator of claim 1, wherein the scheduling core unit is further configured to receive bias parameters of each of the computing layers transmitted by the external device before a computing task begins; respectively issuing all the bias parameters to the matrix multiplication calculation part and the layer normalization calculation part;

9. The hardware accelerator according to claim 1, wherein in a case where the matrix multiplication calculating section performs a calculation task of any one of a Q generation layer, a K generation layer, a V generation layer, and a first fully-connected layer of the neural network model, inputs of the matrix multiplication calculating section are feature matrix data and weight matrix data; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; traversing each column of the second matrix to be multiplied for a first row of matrix blocks of the first matrix to be multiplied to obtain a product between the first row of matrix blocks and the corresponding column of the second matrix to be multiplied; traversing each column of the second matrix to be multiplied for the next row of matrix blocks of the first matrix to be multiplied until the product between the last row of matrix blocks of the first matrix to be multiplied and the last column of matrix blocks of the second matrix to be multiplied is obtained, so that a task processing result of the current computing task is obtained; the data source of the feature matrix data is the feature cache unit, and the data source of the weight matrix data is the weight cache unit.

10. The hardware accelerator according to claim 1, wherein in a case where the matrix multiplication calculating section performs a calculation task of any one of a projection layer and a second fully-connected layer of the neural network model, inputs of the matrix multiplication calculating section are feature matrix data, weight matrix data, and residual matrix; the step of performing the computing task by the matrix multiplication computing component includes: determining the feature matrix data as a first to-be-multiplied matrix, and determining the weight matrix data as a second to-be-multiplied matrix; multiplying the first to-be-multiplied matrix with the second to-be-multiplied matrix, and adding the obtained product with the residual matrix to obtain a task processing result of the current computing task; the data source of the feature matrix data is the feature cache unit, the data source of the weight matrix data is the weight cache unit, and the data source of the residual matrix is the feature cache unit.

11. The hardware accelerator according to claim 1, wherein in a case where the matrix multiplication computation section performs a computation task of a dot product attention layer of the neural network model, inputs of the matrix multiplication computation section are a Q matrix and a K matrix, the matrix multiplication computation section performing the computation task includes: partitioning the Q matrix to obtain a plurality of Q submatrices; the K matrixes are subjected to blocking operation to obtain a plurality of K submatrices, the number of the Q submatrices is the same as that of the K submatrices, and the Q submatrices are in one-to-one correspondence with the K submatrices; multiplying each Q submatrix with the corresponding K submatrix respectively to obtain a task processing result of the current computing task; and the data sources of the Q matrix and the K matrix are the characteristic cache units.

12. The hardware accelerator according to claim 1 or 11, wherein in a case where the matrix multiplication calculating section performs a calculation task of an attention multiplier layer of the neural network model, inputs of the matrix multiplication calculating section are attention score matrix data and a V matrix, the matrix multiplication calculating section performing the calculation task includes: performing partitioning operation on the attention score matrix data to obtain a plurality of attention score sub-matrices; the V matrix is subjected to blocking operation to obtain a plurality of V sub-matrices, the number of the attention score sub-matrices is the same as that of the V sub-matrices, and the attention score sub-matrices are in one-to-one correspondence with the V sub-matrices; obtaining a task processing result of the current computing task by performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix, wherein performing matrix multiplication on the attention score sub-matrix and the transposed V sub-matrix comprises: the method comprises an inter-block transposition step and an intra-block transposition step, wherein the inter-block transposition refers to obtaining a V sub-matrix corresponding to the attention score sub-matrix in a jumping address mode in the matrix multiplication process, and the intra-block transposition refers to completing transposition of elements in each V sub-matrix block by adjusting the positions of bus elements; and the attention score matrix data and the data sources of the V matrix are the characteristic cache units.

13. The hardware accelerator of claim 1, further comprising: the memory unit is used for receiving the weight matrix data transmitted by the external equipment and loading the weight matrix data to the weight caching unit;

14. The hardware accelerator of claim 13, wherein the weight buffer unit is divided into a first buffer subunit and a second buffer subunit based on a buffer depth, the first buffer subunit and the second buffer subunit are configured to ping-pong weight matrix data to be used by each computing layer of the neural network model, and a data source of the weight matrix data is the external device;

15. The hardware accelerator of claim 14 wherein all of the weight matrix data comprises: first, second, third, fourth, fifth and sixth matrix data, the first matrix data being weight data of a Q-generation layer of the neural network model, the second matrix data being weight matrix data of a K-generation layer of the neural network model, the third matrix data being weight matrix data of a V-generation layer of the neural network model, the fourth matrix data being weight matrix data of a projection layer of the neural network model, the fifth matrix data being weight matrix data of a first fully-connected layer of the neural network model, the sixth matrix data being weight matrix data of a second fully-connected layer of the neural network model;

16. The hardware accelerator of claim 13, wherein the memory unit is further configured to receive feature matrix data transmitted by the external device, and load the feature matrix data into the feature cache unit.

17. The hardware accelerator of claim 1, further comprising: the input end of the multiplexer is respectively connected with the scheduling core unit, the matrix multiplication computing unit, the Softmax computing unit and the layer normalization computing unit so as to receive a result write-back control signal sent by the scheduling core unit and the task processing result sent by the matrix multiplication computing unit, the Softmax computing unit and the layer normalization computing unit; the result write-back control signal is used for indicating the multiplexer to write all the task processing results into a preset block random access memory, and the block random access memory is used for feeding all the task processing results back to the external equipment under the condition that a read command of the external equipment is received.

18. The hardware accelerator of claim 17, further comprising: and the multiplexer writes the task processing result into the block random access memory through the queue unit.

19. A server, comprising: a hardware accelerator as defined in any one of claims 1 to 18 to computationally accelerate the neural network model.

20. A method of scheduling hardware accelerators as recited in any of claims 1-18, comprising:

21. The scheduling method of a hardware accelerator according to claim 20, wherein in the case of completing the buffering of the startup-associated data, the corresponding parameter of each of the computing components in the startup-associated data is issued to the corresponding computing component; and each computing component caches the issued parameters on-chip under the condition of receiving the issued parameters.

22. The scheduling method of a hardware accelerator according to claim 20 or 21, wherein the step of controlling and scheduling all the computing units based on the start-up association data to cause the computing units to execute the computing tasks of the respective computing layers of the neural network model in the case where the start-up computing command is received comprises:

23. A dispatch core unit comprising:

the scheduling subunit is configured to receive a start computing command transmitted by the external device, and control and schedule all the computing units based on the start association data under the condition that the start computing command is received, so that the computing units execute computing tasks of each computing layer of a neural network model, where the matrix multiplication computing units complete execution of the computing tasks by loading data cached in the feature cache unit and the weight cache unit, and the neural network model includes: the input layer, the first normalization layer, the Q generation layer, the K generation layer, the V generation layer, the dot product attention layer, the Softmax layer, the attention multiplication layer, the projection layer, the second normalization layer, the first full connection layer and the second full connection layer are sequentially connected; the Q generation layer is used for obtaining a Q matrix, the K generation layer is used for obtaining a K matrix, and the V generation layer is used for obtaining a V matrix;

24. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of scheduling hardware accelerators as claimed in any one of claims 20 to 22 when the program is executed by the processor.

25. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the scheduling method of the hardware accelerator of any of claims 20 to 22.