CN113469327B

CN113469327B - Integrated circuit device for performing rotation number advance

Info

Publication number: CN113469327B
Application number: CN202110703451.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2024-04-05
Anticipated expiration: 2041-06-24
Also published as: CN113469327A

Abstract

The present invention relates to an integrated circuit device for performing a revolution number advance, wherein the inventive computing device is comprised in an integrated circuit device comprising a universal interconnect interface and other processing means. The computing device interacts with other processing devices to collectively complete a user-specified computing operation. The integrated circuit device may further comprise a storage device coupled to the computing device and the other processing device, respectively, for data storage by the computing device and the other processing device.

Description

Integrated circuit device for performing rotation number advance

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to an integrated circuit device that performs rotation number advancing.

Background

In recent years, as a branch class in the artificial intelligence algorithm, the neural network algorithm exhibits good adaptability and superior performance in more and more fields, such as: image recognition, object detection, natural language processing, etc., have become a research hotspot in academia and industry.

However, the neural network algorithm has large calculation amount (up to 100 hundred million orders of magnitude of calculation amount), and the model training needs a back propagation process, consumes a large amount of hardware resources, and the conventional general processor cannot meet the requirements of the intelligent application scene in order to achieve the universality, so that the high-performance and low-power-consumption neural network accelerator becomes one of research hot spots in the architecture field in recent years.

Because the accelerator has different architecture, different constraints are applied to the aspects of data placement, blocking, moving, operation and the like, the corresponding programming system needs to consider the hardware implementation details of the bottom layer to generate instructions. Particularly, the convolution and full-connection operators in the neural network model occupy most of operation resources, and the hardware computation power is easy to be insufficient to reduce the operation efficiency.

Therefore, a compiling optimization scheme for the neural network model is urgent.

Disclosure of Invention

In order to solve at least partially the technical problem mentioned in the background art, the solution of the present invention provides an integrated circuit device performing a revolution number advance.

In one aspect, an integrated circuit device for performing turn advancing in a neural network model includes a first layer and a second layer, the first layer and the second layer including a loading phase, a computing phase, and a storage phase, respectively, the computing phase of the second layer including a quantization operation and a low-precision arithmetic operation. The integrated circuit device includes a processing device and a computing device. The processing device is used for advancing the quantization operation between the calculation stage and the storage stage of the first layer; the computing device is used for executing a loading stage, a computing stage, a quantization operation and a storage stage of the first layer in the first layer, and executing a loading stage, a low-precision operation and a storage stage of the second layer in the second layer.

In another aspect, an integrated circuit device is disclosed that performs a turn advance in an operator of a neural network model, the operator comprising a loading phase and a computing phase, the loading phase and the computing phase comprising a preprocessing operation and a post-processing operation, respectively, the preprocessing operation of the computing phase comprising a quantization operation. An integrated circuit device includes: processing means for advancing the quantization operation to a post-processing operation of the loading stage; and a computing device to perform quantization operations in post-processing operations of the loading stage.

The invention provides a scheme for advancing the revolution, which can meet the requirement of cyclic division of calculation logic according to an algorithm in a neural network operator and cyclic division of data slices in a calculation level, save interlayer data moving amount, reduce bandwidth occupation amount of hardware and improve performance.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, several embodiments of the invention are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

fig. 1 is a block diagram showing a board of an embodiment of the present invention;

fig. 2 is a block diagram showing an integrated circuit device of an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the internal architecture of a computing device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the internal architecture of a processor core according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an execution tree of an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a parse-traversing execution tree in accordance with an embodiment of the present invention;

FIG. 7 is a flow chart illustrating quantization based on an execution tree according to an embodiment of the present invention; and

fig. 8 is a schematic diagram showing the convolution and full connection layer rotation number advance according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification and drawings of the present invention are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present invention are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Specific embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the invention. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), application specific integrated circuit (applicat input output class n specific integrated circuit, ASIC), field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processor (central processing unit, compute class U), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The DRAM204 is used to store data to be processed, and is a DDR memory, typically 16G or more in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 201 is configured as a multi-core hierarchical structure, and the computing device 201 is a system-on-chip (soc) including a plurality of clusters (clusters), each of which includes a plurality of processor cores, in other words, the computing device 201 is configured in a system-on-chip (soc) -cluster-processor core hierarchy.

At the system-on-chip level, as shown in FIG. 3, computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be a plurality of external memory controllers 301, 2 being shown by way of example, for accessing external memory devices, such as DRAM204 in FIG. 2, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transferring data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being shown by way of example in the figure, and the computing device 201 of the present invention may also include 8, 16, 64, or even more clusters 305 as hardware progresses. The cluster 305 is used to efficiently execute the deep learning algorithm.

At the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.

The number of processor cores 306 is illustratively shown as 4, and the present invention is not limited to the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an operation module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 411 and an Instruction Decode Unit (IDU) 412. The instruction fetching unit 411 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 412 decodes the fetched instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 422 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 43 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 431, a weight storage unit (WRAM) 432, an input/output direct memory access module (input/output direct memory access, input/output DMA) 433, and a carrying direct memory access module (move direct memory access, MVDMA) 434.NRAM 431 is used to store the feature map for the processor core 306 to calculate and the intermediate result after calculation; WRAM 432 is configured to store weights for the deep learning network; the input/output class DMA 433 controls access to NRAM 431/WRAM 432 and DRAM204 over broadcast bus 309; MVDMA 434 is used to control access to NRAM 431/WRAM 432 and SRAM 308.

Returning to FIG. 3, the storage cores 307 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 306, as well as to perform communications between the clusters 305 and the DRAM204, between the clusters 305, between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be obtained from the processor cores 306 to the DRAM204 respectively, but is transferred between the processor cores 306 through the SRAM 308, and the memory core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to a plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip off-chip input/output access is greatly reduced.

Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and cluster 305 and DRAM204 data transfers, respectively. As will be described below, respectively.

The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM 308 to all processor cores 306, a special case of multicast.

CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201.

The GDMA 311 cooperates with the external memory controller 301 to control access of the SRAM 308 of the cluster 305 to the DRAM204 or to read data from the DRAM204 into the SRAM 308. From the foregoing, it is appreciated that communication between DRAM204 and NRAM 431 or WRAM 432 may be accomplished via 2 channels. The first channel is to directly contact the DRAM204 with the NRAM 431 or WRAM 432 through the input output class DAM 433; the second channel is to transfer data between the DRAM204 and the SRAM 308 via the GDMA 311 and then transfer data between the SRAM 308 and the NRAM 431 or WRAM 432 via the MVDMA 434. While the second channel seemingly requires more components to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, and thus communication between the DRAM204 and the NRAM 431 or WRAM 432 may be more efficient through the second channel. Embodiments of the present invention may select a data transmission channel based on its hardware conditions.

In other embodiments, the functionality of the GDMA 311 and the functionality of the input output class DMA 433 may be integrated in the same component. For convenience of description, the GDMA 311 and the input/output DMA 433 are considered as different components, so long as the functions and technical effects achieved by the components are similar to those of the present invention, and thus, the components are within the scope of the present invention. Further, the functions of the GDMA 311, the input/output DMA 433, the CDMA 310, and the MVDMA 434 may be implemented by the same component.

The neural network framework to which this embodiment applies predefines a series of neural network layers or operator interfaces. The developer sets layer parameters of each layer by calling an Application Programming Interface (API) of the neural network framework, and links the dependency relationship between data and layers to build a neural network model structure. After the network model training process, the model parameters and weight data are saved in a structured model file, stored in DRAM 204. When the deployment is running, the processing device 203 calls the API of the framework, loads the trained network model, uses the actual input data, and executes the forward reasoning process of the network model on the computing device 201 to obtain the final output result of the network. This embodiment, however, uses this information to accelerate since both the model structure and parameters are known during forward reasoning.

This embodiment proposes a tree-like neural network operator programming method, called an execution tree. Fig. 5 shows a schematic diagram of the execution tree of this embodiment. The nodes of the execution tree are an iterative data structure formed by connecting a root node 501 to a subtree, which can contain any multiple layers and any multiple child nodes, and the child nodes are divided into non-leaf nodes and leaf nodes. The non-leaf nodes are located in the middle layer of the subtree, 2 non-leaf nodes 502 and 503 are illustratively shown in FIG. 5. The leaf nodes are located at the last level of the subtree, 2 leaf nodes 504 and 505 being shown in an exemplary manner in FIG. 5. The number of layers and the number of child nodes of the subtree are determined according to the operator's requirement, and the embodiment is not limited.

The execution logic of the operations of the root node and the child node is the same, including: an initial operation, a preprocessing operation, a main body operation, a post-processing operation and an ending operation. The root node and child nodes also include a loop operation (not shown) to keep track of the number of times the node needs to execute repeatedly.

The initial operation is the first part to be executed in the same-level execution tree, is executed once and is not repeatedly executed along with the loop, and belongs to one-time initialization instructions, such as register initialization, activation operation configuration and the like. The preprocessing operation is executed after the initial operation, is repeatedly executed at least once according to the loop operation, and is responsible for preprocessing before the main body operation, for example, in a Scale operator, the fetch operation of the loop segment data corresponding to the short vector right operand, and the like. The body operation is performed after the preprocessing operation, and is also repeatedly performed at least once according to the loop operation, and is responsible for the calculation part of the operator body loop. If the node is a root node or a non-leaf node, the main body operation is only used for cutting data and distributing tasks to the child nodes of the next layer; in the case of leaf nodes, the main operation is the operation core part of the execution tree, such as cumulative addition. The post-processing operation is repeated at least once according to the cyclic operation after the main body operation, and is responsible for post-processing operation after operation, including operations of moving multiplexing data, shifting registers and the like. The ending operation is performed only once to output the calculation result.

The number and timing of execution of the operations described above are created by the processing device 203 based on a cyclic analysis of the execution instructions of the neural network operator on the computing device 201, rather than the functional limitations of the execution tree. When a cyclic operation is required, the cyclic part is a pretreatment operation, a main body operation, and a post-treatment operation.

In this embodiment, the execution of the neural network operator can be generally divided into 3 phases: the processing device 203 divides the execution tree of the neural network operator into three trees, namely, a loading stage, a calculating stage and a storing stage, wherein each execution tree of the operator consists of a root node of the loading, calculating and storing tree and a subtree thereof, namely, all execution trees of one operator belong to one of the 3 trees, and each tree has the structure of fig. 5.

When running the neural network model, 3 execution trees of one operator can implement all instructions required for the neural network operator to run on the computing device 201. First, the computing device 201 executes all instructions of the operations of the corresponding execution sequence of one leaf node of the load tree, then executes one leaf node of the computation tree, finally executes the leaf node of the store tree, and loops back and forth until all the nodes are executed.

More specifically, in the compiling stage, when parsing and traversing an execution tree, the processing device 203 performs the initial and pre-processing operations of the root node according to the order of priority traversal, then traverses all the nodes in the main operation of the subtree, and finally performs the post-processing and ending of the root node. Wherein the pretreatment, the main body, and the post-treatment are repeatedly performed while cycling.

To implement the loop operation, when repeated execution is required, a synchronization instruction is inserted after the post-processing operation of the node requiring repeated execution. When the computing device 201 runs, if a synchronization instruction is received, it returns to the preprocessing operation of the node, and executes the preprocessing operation, the main body operation and the post-processing operation again, until the number of loops of the loop operation is satisfied, and then executes the ending operation of the node.

FIG. 6 shows a schematic diagram of this embodiment parsing a traversal execution tree. The simplified execution tree includes a root node 601, a first leaf node 602, and a second leaf node 603. Assuming that the cycle operation of the root node 601 records the number of cycles of the root node 601 as 3, the cycle operation of the first leaf node 602 records the number of cycles of the first leaf node 602 as 5, and the cycle operation of the second leaf node 603 records the number of cycles of the second leaf node 603 as 1. When traversing the execution tree, the processing device 203 first executes the initial and pre-processing operations of the root node 601, and then executes the initial, pre-processing, main and post-processing operations of the first leaf node 602 according to the front-back linking order of the subtree, and at this time, receives the synchronization instruction 604, and the loop information record of the synchronization instruction 604 needs to be repeatedly executed 5 times. Since the first leaf node 602 is executed only once, the preprocessing, main body, post-processing operations of the first leaf node 602 are repeatedly executed until 5 cycles are performed, and finally the ending operation of the first leaf node 602 is executed. All operations of the subtree of the first leaf node 602 thus far are traversed.

The processing means 203 then traverses the subtree executing the second leaf node 603. Since the second leaf node 603 needs to loop once, the second leaf node 603 directly performs the initial, pre-processing, main, post-processing, and end operations without inserting a synchronization instruction, and returns to the root node 601.

The root node 601 is continuously traversed, i.e., post-processing operations of the root node 601 are performed. Since the root node 601 needs to be executed 3 times, the post-processing operation of the root node 601 is followed by the synchronization instruction 605, and the loop information recording of the synchronization instruction 605 needs to be repeatedly executed 3 times. At this time, the preprocessing operation of the processing device 203 returning to the root node 601 is repeatedly executed, then all the operation flows of all subtrees of the processing device are executed, the post-processing operation of the root node 601 is executed again until the processing device loops 3 times, and finally the ending operation of the root node 601 is executed to complete the execution of all the operations in the root node 601 tree.

As can be seen from the foregoing, the computing device 201 repeatedly traverses the execution tree according to the chain loop of the nodes of the execution tree based on load, calculate, store, when calculating the computer, in the example of fig. 6, which is the traversing order of the single execution tree.

When compiling the execution tree, the processing device 203 analyzes based on the specific algorithm of the neural network operator to obtain the calculated circulation level, and constructs the corresponding execution tree level and links the subtree relationship. And obtaining the maximum input (or output) data quantity in each calculation cycle by the occupation proportion or actual size of the data blocks such as input, output, constant and the like on the chip (mainly NRAM 431 memory space), dividing the input data quantity of a specific calculation cycle level by the maximum input data quantity of a single cycle, obtaining the cycle level of the data slice, and linking subtree relations. In each subtree, memory allocation and release are performed in proper operation according to the data amount in actual circulation. And finally, filling corresponding instructions such as loading off-chip data, moving multiplexing data, calculating and storing output data in proper operation of each subtree so as to complete operator compiling work.

Since the convolutional layer and the full connection layer occupy most of the calculation amount in the full network, the calculation needs to be optimized to improve the full network performance. This embodiment takes into account that there is some redundancy due to the large amount of parameters of the weight data in the convolutional layer and the fully connected layer, and a low-precision calculation mode is adopted based on the condition that the precision is completely lossless or the precision is lost within the allowable range. In other words, to save hardware resource usage, this embodiment uses quantization to convert high precision floating point numbers to low precision fixed point numbers to speed up neural network operations. For example, the matrix operation unit 422 supports only a multiply-accumulate operation of 8-bit fixed point number (INT 8), and before performing matrix operation, both the input data and the weight data are converted into fixed point numbers of INT8 data type, and then the fixed point numbers are introduced into the matrix operation unit 422 for calculation.

In both layers, the weight data may be converted in advance using an offline preprocessing quantization method. The weight data is stored in the model offline, can be preprocessed during compiling, is converted according to the corresponding data type, is stored in a new network model file, modifies the corresponding network model structure description file, marks the operation data type of the corresponding neural network layer, and adds corresponding parameters required by quantification. At compile time, the instruction sequence of the computing device 201 is generated in accordance with the quantized network model parameters. During operation, the computing device 201 loads the required weight data onto the WRAM 432 according to the bit width of the corresponding computing data type through the generated instruction sequence, and performs convolution and full-connection layer operation to achieve network acceleration.

However, the input data of the convolution and full-connection operator may be output results from other neural network layers in the network, and the data type cannot be converted in advance during compiling, and a corresponding instruction sequence is required to be used on a chip to complete the data type conversion operation. The above instructions are all computation-like instructions, which will be performed during the computation phase of the operator.

Fig. 7 shows a flowchart of quantization based on the execution tree of this embodiment. As previously described, the leaf nodes of the execution tree are performed in the order of load-calculate-store, and are omitted from the corresponding description of FIG. 7 because the initial and end operations are not critical operations for quantization. In step 701, a preprocessing operation to load leaf nodes is performed; in step 702, a body operation is performed to load leaf nodes, input data and weights being loaded into NRAM 431 and WRAM 432 at this time; in step 703, a post-processing operation is performed to load the leaf node; in step 704, a preprocessing operation is performed to compute leaf nodes; in step 705, a subject operation to calculate leaf nodes is performed; in step 706, a post-processing operation to compute leaf nodes is performed. After the calculation phase, there is also a load-calculate-store operation of the store phase, which is likewise omitted from the corresponding description of fig. 7.

The data type conversion instruction (quantization operation) will vectorize the continuous data in the NRAM 431 according to the corresponding data type, and is theoretically implemented in the preprocessing operation of the leaf node of the computation tree of the corresponding operator, that is, quantization in step 704. In this embodiment, the data type conversion (quantization operation) is completed in advance when the data is moved, and becomes an input/output class instruction, and the data type conversion is completed in the post-processing operation of the leaf node of the load tree, that is, the quantization is performed in advance to step 703. In this way, the data moving amount can be reduced in the ending operation of the loading stage and the preprocessing operation of the calculating stage.

Furthermore, the embodiment also provides a compiling optimization method for data type conversion scheduling in advance aiming at data type conversion operations in operators such as convolution, full connection and the like.

In addition to the rotation number advance as shown in fig. 7, in the full network operation, the matrix operation in the operator layers such as convolution and full connection involves a low-precision operation using a bit width (INT 8) of the input data type smaller than that of the other operator layers using high-precision computation (FP 16), and when the input data block source is the output of the other data layers, the processing device 203 schedules the data type conversion operation into the computation phase in the corresponding preceding operator layer. Fig. 8 is a schematic diagram showing the convolution and full connection layer rotation number advance of this embodiment, which shows 2 layers by way of example: a first layer and a second layer. Before the number of revolutions is advanced in compiling, the first layer performs operations of loading 801, calculating 802 and storing 803, the second layer is a convolution or full connection layer, loading 804, quantizing 805, convolution/full connection 806 and storing 807, wherein the quantizing 805 and the convolution/full connection 806 are calculation stages of the second layer, and the convolution/full connection 806 only accepts fixed point numbers, so that the floating point numbers are converted into fixed point numbers in the quantizing 805.

When performing the number of revolutions advancing at compile time, the processing means 203 advances the quantization 805 between the calculation 802 and the storage 803 of the first layer, i.e. after the number of revolutions advances, the computing means 201 performs the operations of loading 801→calculation 802→quantization 805→storage 803 on the first layer, and the second layer performs the operations of loading 804→convolution/full connection 806→storage 807. In this way, the data amount in the output data operation of the first layer 803 and the input data operation of the second layer 804 is only half of the original data, so that the bandwidth is saved and the performance is improved.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present invention may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, etc. Furthermore, the electronic equipment or the electronic device can be used in cloud end, edge end, terminal and other application scenes related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the high-power electronic device or apparatus according to the present invention may be applied to a cloud device (e.g., a cloud server), and the low-power electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of simplicity, the present invention represents some methods and embodiments thereof as a series of acts and combinations thereof, but it will be understood by those skilled in the art that the aspects of the present invention are not limited by the order of acts described. Thus, those skilled in the art will appreciate, in light of the present disclosure or teachings, that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described herein may be considered as alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or all aspects of the present invention. In addition, the description of some embodiments of the present invention is also focused on according to the different schemes. In view of this, those skilled in the art will appreciate that portions of one embodiment of the invention that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, those skilled in the art will appreciate that several embodiments of the disclosed invention may be implemented in other ways not disclosed by this embodiment. For example, in terms of each unit in the foregoing embodiment of the electronic device or apparatus, this embodiment splits the unit in consideration of the logic function, and another splitting manner may be implemented in practice. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separated, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purposes of the solution according to the embodiments of the present invention. In addition, in some scenarios, multiple units in embodiments of the invention may be integrated into one unit or each unit may physically reside separately.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices (e.g., computing devices or other processing devices) described in this embodiment may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing has outlined rather broadly the more detailed description of embodiments of the invention in order that the detailed description of the principles and embodiments of the invention may be implemented in conjunction with the detailed description of embodiments of the invention that follows, the above examples being merely intended to facilitate an understanding of the methods of the invention and their core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An integrated circuit device that performs a turn advance in a neural network model, the neural network model comprising a first layer and a second layer, the first layer and the second layer each comprising a loading phase, a computing phase, and a storage phase, the loading phase, the computing phase, and the storage phase being performed sequentially, the computing phase of the second layer comprising a quantization operation and a low-precision arithmetic operation, the integrated circuit device comprising:

processing means for advancing the quantization operation between a computation phase and a storage phase of the first layer; and

the computing device is used for processing input data of computer vision, voice or natural language, executing the loading stage, the computing stage, the quantization operation and the storage stage of the first layer in the first layer, and executing the loading stage, the low-precision operation and the storage stage of the second layer in the second layer.

2. The integrated circuit device according to claim 1, wherein the computing means comprises a neuron memory unit, the loading phase, the computing phase, and the storing phase are each in the structure of an execution tree, and the processing means calculates the maximum amount of data input and output within a cycle from each time the execution tree is compiled, the occupation ratio or the actual size of an input data block, an output data block, and a constant data block in the neuron memory unit.

3. The integrated circuit device according to claim 2, wherein the processing device divides an amount of input data of a particular computation loop level by the amount of data to derive a loop level of data slices to link subtree relationships in the execution tree.

4. The integrated circuit device according to claim 1, wherein the computing device comprises a vector operation unit to vectorize the quantization operation.

5. The integrated circuit device according to claim 1, wherein the low precision arithmetic operation is one of convolution and full-join operation.

6. An integrated circuit device that performs turn advancing in an operator of a neural network model, the operator comprising a loading phase and a computing phase, the loading phase and the computing phase each comprising a preprocessing operation and a post-processing operation, the preprocessing operation of the computing phase comprising a quantization operation, the integrated circuit device comprising:

the processing device is used for advancing the quantization operation to the post-processing operation of the loading stage so as to finish quantization in advance when data is moved; and

a computing device for processing input data of computer vision, voice, or natural language and performing the quantization operation in a post-processing operation of the loading stage.

7. The integrated circuit device according to claim 6, wherein the computing means comprises a neuron memory unit, the loading phase and the computing phase each have a structure of an execution tree, and the processing means calculates a maximum amount of data input and output in a cycle from an occupation ratio or an actual size of each input data block, each output data block, each constant data block in the neuron memory unit when compiling the execution tree.

8. The integrated circuit device according to claim 7, wherein the processing device divides an amount of input data of a particular computation loop level by the amount of data to derive a loop level of data slices to link subtree relationships in the execution tree.

9. The integrated circuit device according to claim 6, wherein the computing device comprises a vector operation unit to vectorize the quantization operation.