CN113469333A

CN113469333A - Artificial intelligence processor, method and related product for executing neural network model

Info

Publication number: CN113469333A
Application number: CN202110721919.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-10-01
Anticipated expiration: 2041-06-28
Also published as: CN113469333B

Abstract

An artificial intelligence processor, processing method, and related products for executing a neural network model are disclosed. The artificial intelligence processor may be implemented such that the computing device is included in a combined processing device that may also include interface devices and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme disclosed by the invention provides a fusion processing scheme of an upper pooling layer and a deep convolution layer in a neural network model, which can effectively reduce the off-chip memory access bandwidth, relieve the memory access pressure and improve the processing efficiency of a machine.

Description

Artificial intelligence processor, method and related product for executing neural network model

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to artificial intelligence processors, chips, boards, and methods of using artificial intelligence processors to execute neural network models.

Background

At present, Deep Learning (Deep Learning) has become an important branch in machine Learning, and the development of Artificial Intelligence (AI) is also greatly promoted. The core technology of deep learning, Deep Neural Network (DNN), has been widely used in many industries.

To improve the expressive power of neural network models, DNNs are continually evolving towards deeper or wider network scales. However, the increase of the network hierarchy also brings the problems of large data IO amount, insufficient access and storage supply and the like. Therefore, in order to fully exert the advantages of the neural network model, the problem of access and storage puzzlement of the artificial intelligence processor needs to be solved.

Disclosure of Invention

To at least partially solve one or more technical problems mentioned in the background, the present disclosure provides an artificial intelligence processor, a chip, a board, and a method for executing a neural network model using the artificial intelligence processor.

In a first aspect, the present disclosure discloses an artificial intelligence processor executing a neural network model comprising control circuitry, operational circuitry, and on-chip storage circuitry, the neural network model comprising an upper pooling layer and a depth convolution layer, wherein: the control circuit is used for controlling the on-chip storage circuit to load the input data of the upper pooling layer and the convolution kernel of the depth convolution layer from the off-chip storage circuit to the on-chip storage circuit; the operation circuit is used for executing the fusion operation of the upper pooling layer and the depth convolution layer aiming at the input data and the convolution kernel, and writing the fusion operation result back to the on-chip storage circuit; and the control circuit is further used for controlling the output of the fusion operation result from the on-chip storage circuit to the off-chip storage circuit.

In a second aspect, the present disclosure provides a chip comprising the artificial intelligence processor of any of the embodiments of the first aspect described above.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a method of executing a neural network model using the artificial intelligence processor of any of the embodiments of the first aspect described above.

Through the artificial intelligence processor for executing the neural network model, the method for executing the neural network model by using the artificial intelligence processor, the chip and the board card, the disclosed embodiment provides a fusion optimization processing scheme of an upper pooling layer and a deep convolution layer in the neural network model, which can effectively reduce off-chip memory access bandwidth, relieve memory access pressure and improve the processing efficiency of a machine.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 shows a block diagram of a board card of an embodiment of the present disclosure;

FIG. 2 shows a block diagram of a combined processing device of an embodiment of the disclosure;

FIG. 3 illustrates an internal structural diagram of a processor core of a single or multi-core computing device of an embodiment of the present disclosure;

FIG. 4 shows an exemplary diagram of a neural network model to which embodiments of the present disclosure may be applied;

FIG. 5 shows a schematic of the operation of the upper pooling layer;

FIG. 6 shows an operational schematic of a depth convolution layer;

FIG. 7 illustrates an exemplary operational procedure of the upper pooling layer before merging;

FIG. 8 illustrates an exemplary operational procedure for a depth convolution layer before merging;

FIG. 9 illustrates a fusion operation process of an upper pooling layer and a depth convolution layer of an embodiment of the present disclosure;

FIG. 10 illustrates a schematic diagram of an index mapping relationship of an embodiment of the present disclosure;

FIG. 11 illustrates an exemplary block diagram of an artificial intelligence processor of an embodiment of the disclosure; and

FIG. 12 illustrates an exemplary flow chart of a method of executing a neural network model by an artificial intelligence processor of an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structure diagram of a processor core when the computing device 201 is a single-core or multi-core device. The computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the computing device 301 includes three major modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a parameter storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the computing device 301 and the DRAM 204.

The embodiment of the disclosure provides an artificial intelligence processor executing a neural network model based on the foregoing hardware environment, and supports the fusion optimization processing of the upper pooling layer and the deep convolutional layer in the neural network model.

FIG. 4 shows an exemplary diagram of a neural network model to which embodiments of the present disclosure may be applied.

As shown, the neural network model 400 includes two parts: a convolutional network portion 410 and a deconvolution network portion 420. The convolutional network part acts as a feature extractor, converting the input picture into a multi-dimensional feature representation. The deconvolution network part is equivalent to a shape generator, and the features extracted by the convolution network are used for generating a target segmentation result.

Convolutional network portion 410 includes a plurality of layers, which may include spaced convolutional layers 411, pooled layers 412, and so on. The convolutional layer may perform feature extraction by applying several filters to the input data. The pooling layer is mainly used to scale down the input data and reduce overfitting. There are many ways to achieve pooling, the most common being maximum combining, mean combining, and random combining. There may be a linear layer between the convolutional and pooling layers.

The deconvolution network portion 420 is a mirror image of the convolution network portion and may include spaced-apart upper pooling layers 421, deconvolution layers 422, and so on. In the neural network, since the input image is characterized by the convolutional layer, the output size tends to be small, and sometimes the image needs to be restored to the original size for further calculation, such as semantic segmentation of the image, which can be realized by the upper pooling layer 421. The upper pooling layer 421 may be paired with the pooling layer 412 in the convolutional network portion. The deconvolution layer is the inverse of the convolution layer and is used to restore the size of the picture before convolution. The operation of the deconvolution layer may be a deep convolution operation, and is therefore sometimes referred to as a deep convolution layer. A linear layer may exist between the upper pooling layer and the deconvolution layer.

It is to be understood that the above description about the neural network model is merely exemplary, and the structure of the neural network model is not limited to the structure shown in the drawings, and those skilled in the art can make modifications to the structure shown in the drawings as necessary. The operation of the relevant layer is described in detail below.

FIG. 5 shows a schematic of the operation of the paired pooling layers and upper pooling layer. In this example, a maximum pooling approach is employed. Note that typically pooling is not pooling in the channel direction (also referred to as depth direction), so pooling and pooling on only a single channel is shown in the figure.

As shown, the left side is an operation diagram of maximum pooling, where the size of the input image 510 is 4 × 4, the pooling window size is 2 × 2, and the pooling step size is (2, 2). That is, in steps of 2 in both XY directions, the maximum value is selected in each 2 × 2 pooling window as the value of the window, as selected by the dark squares in the figure. The maximum value selected in each pooling window constitutes the output image 520, which is 2 x 2 in size. At the same time, the position of the maximum in the input image, i.e. the pooling index 530, needs to be recorded.

The right side of the figure shows the operation diagram of the paired upper pooling, the input information of the upper pooling layer comprising a 2 x 2 input image 540, and a pooling index 530. The pooling-on-layer uses the pooling indices 530 to restore the positions of the elements in the input image 540, as indicated by the indices, to the pre-pooling size 4 x 4, while the remaining positions are zero-filled, resulting in a pooled output image 550.

FIG. 6 shows an operational schematic of a depth convolution layer. Depth convolution differs from standard convolution in that the depth direction, which is referred to herein as the input channel, is not accumulated. In the standard convolution, each convolution kernel needs to be calculated and accumulated with all layers (input channels) of the input feature map, so that the number of input channels of each convolution kernel is equal to that of the input feature map. Each convolution kernel in the deep convolution is a single channel, one convolution kernel is responsible for one channel, and one channel is only convoluted by one convolution kernel.

As shown, the dimension of the input feature map 610 is 12 × 12 × 3, i.e., includes three channels, each channel including 12 × 12 images. In this deep convolution, 3 convolution kernels 620 are used, each of which is a single channel, for example, of size 5 × 5 × 1. Each convolution kernel convolves only one channel of the input feature map 610, each such convolution yields an output of size 8 x 1, which is then stacked together to create an 8 x 3 image, resulting in an output feature map 630 of size 8 x 3. As can be seen from the figure, the depth (number of channels) of the output feature map remains the same as the input feature map.

The principle of operation of the upper pooling layer and the depth convolution layer is described above. In the conventional neural network model, the operations between the network layers are relatively independent and sequentially executed, that is, after the operation of the previous layer is completed, the operation result is restored to the off-chip memory circuit, and when the operation is executed by the next layer, the data is read from the off-chip memory circuit, and then the corresponding operation is executed.

FIG. 7 illustrates an exemplary operational procedure of the upper pooling layer before merging. In the example of fig. 7, the parameters of the upslugging are assumed to be as follows: the pooling window size is 2 x 2, the pooling step size is equal to the pooling window size, also 2 x 2, and the pooling result is filled with (1,0,1,0), i.e. one row is filled above, one column is filled to the left, and none is filled below and to the right. In addition, the pooling indices are consistent across the various pooling windows, all shown in the upper left corner of the figure.

As shown, the input feature map 710 is 2 × 2 in size, the pooling windows are also 2 × 2, the corresponding pooling indices 720 are 4 × 4, and the index in each window is fixed to 0 (i.e., the upper left corner in the window). After performing the pooling operation on the input feature map 710 according to the pooling index 720, an output feature map 730 is obtained, the dimension size of which is 5 × 5, wherein the gray part is a filling part, and the data in the input feature map is respectively filled in the positions indicated by the corresponding pooling indexes in the output feature map, and the rest positions are filled with zeros.

When an operation to the upper pooling layer is performed without merging, the inputs (e.g., input signature 710 and pooling index 720) of the upper pooling layer are first read from off-chip storage circuitry (e.g., storage 204, DRAM of fig. 2) to on-chip storage circuitry, e.g., NRAM 331 of fig. 3 stores input signature 710 and WRAM 332 of fig. 3 stores pooling index 720.

Subsequently, an arithmetic circuit (e.g., vector arithmetic unit 321 or matrix arithmetic unit 322 in fig. 3) fetches the data from NRAM 331 and WRAM 332 to complete the operation, and writes the pooled result (e.g., output feature map 730) back to NRAM 331.

Finally, the processor again writes the result of the pooling-up from NRAM 331 back to the off-chip memory circuit (e.g., memory device 204 of fig. 2, DRAM) as input to the next neural network layer.

FIG. 8 illustrates an exemplary operational procedure for the depth convolution layer before merging. In the example of fig. 8, the depth convolution layer is the next layer of the pooling-up or the like of fig. 7, which performs depth convolution on the pooled-up input feature map. The parameters of the deep convolution are assumed to be as follows: the convolution kernel size is 2 × 2, the convolution step size is 1 × 1, and the padding is (0,0,0, 0).

As shown, the input signature graph 810, i.e., the output signature graph 730 of fig. 7, has a size of 5 × 5 and the convolution kernel 820 is 2 × 2. After performing deep convolution on the input feature map 810 using the convolution kernel 820, an output feature map 830 is obtained, which has a dimension size of 4 × 4.

When performing operations to the deep convolutional layer without blending, the output of the previous layer (e.g., the upper pooling layer of fig. 7) is first read from the off-chip memory circuit (e.g., memory device 204, DRAM of fig. 2) as the input of the deep convolutional layer (e.g., input signature 810) to the on-chip memory circuit, e.g., NRAM 331 of fig. 3 stores input signature 810. In addition, the convolution kernel 820 of the deep convolutional layer is also read from the off-chip memory circuit into WRAM 332 in FIG. 3.

Then, an arithmetic circuit (e.g., vector arithmetic unit 321 or matrix arithmetic unit 322 in fig. 3) fetches from NRAM 331 and WRAM 332 to complete the operation, and writes back the result of the deep convolution (e.g., output feature map 830) to NRAM 331.

Finally, the processor again writes the result of the deep convolution from NRAM 331 back to off-chip storage circuitry (e.g., storage 204 of fig. 2, DRAM) as input to the next neural network layer.

As can be seen from the operational procedure described in connection with fig. 7 and 8, sequential execution of the various network layers may unnecessarily increase data handling of the same block of data (in this example, the output profile 730 of the upper pooling layer, which is the input profile 810 of the deep convolution layer) from the on-chip storage circuit to the off-chip storage circuit, and again from the off-chip storage circuit to the on-chip storage circuit, thereby increasing the pressure of data access.

In view of this, the present disclosure provides a fusion scheme of an upper pooling layer and a deep convolutional layer, which fuses operations of two network layers together, thereby avoiding the back-and-forth transport of intermediate data between off-chip and on-chip storage circuits, and reducing data access pressure. Further, in some embodiments, the operation process is adjusted and effectively fused by analyzing the operation characteristics of the two network layers, so that the operation performance can be further improved.

As can be seen from the above-described operation processes of the upper pooling layer and the depth convolution layer, since zero is padded to positions other than the position indicated by the pooling index in the upper pooling process, the final result of the depth convolution operation is only related to the operation result of the input feature map (i.e., non-zero input data elements) of the upper pooling layer, and therefore, the multiply-add operation with the convolution kernel can be directly performed on the input feature map of the upper pooling layer. In addition, when only one nonzero data element exists in the receptive field corresponding to each convolution output point, the calculation results do not need to be accumulated. Further, as can be seen from the output characteristic diagram 830 in fig. 8, the order of the multiplication results in the output characteristic diagram is related to the pooling index, and the final result can be obtained only by adjusting and rearranging according to the pooling index.

Accordingly, in the fusion operation scheme of the embodiments of the present disclosure, the fusion operation of the pooling-up and the depth convolution may include two steps: calculating the result of the operation (e.g., product result) of the convolution kernel and the pooled input data; and rearranging the operation result according to the pooling index to obtain a final fusion operation result.

FIG. 9 illustrates a fusion operation process of an upper pooling layer and a depth convolution layer of an embodiment of the present disclosure.

As shown in fig. 9, the input data of the upper pooling layer and the convolution kernel of the depth convolution layer may be first loaded from the off-chip storage circuit to the on-chip storage circuit. The size of the input feature map 910 of the upper pooling layer is 2 × 2 × C, where C represents the channel dimension, also referred to as the depth direction. The size of the convolution kernel 920 of the depth convolution layer is also 2 × 2 × C.

During operation, the operation result of the convolution kernel and the pooled input data may be divided into multiple rounds for operation, so that each weight vector in the depth direction in the convolution kernel 920 and the input vector in the depth direction of the input feature map 910 perform a bit multiplication operation to obtain multiple result vectors in the depth direction.

For example, a weight vector with a height H and a width W dimension fixed in the channel dimension/depth direction, for example, a weight vector a with H and W both 1 represented by the dark portion in 920, may be first taken, and the weight vector a is subjected to a vector multiplication operation, that is, a bit multiplication operation, on each input vector in the depth direction of the input feature map 910, respectively, so as to obtain 4 result vectors shown in 930. Similarly, the weight vector b may be multiplied by each input vector in the next round of operation to obtain 4 result vectors 940. The

result vectors

950 and 960 corresponding to the weight vectors c and d can also be obtained.

The result vectors are then reordered to obtain the final fused result 970, as indicated by the pooling indices of the upper pooling layer.

As shown, for the 1 st weight vector a in fig. 9: a [1,2,3,4] ═ a,2 a,3 a,4 a ] can be understood as the product results when 1,2,3, and 4 are located at the 1 st position of the convolution kernel, respectively. Similarly, for the operation of the 2 nd weight vector b in fig. 9: b [1,2,3,4] - [1 × b,2 × b,3 × b,4 × b ], can be understood as the product results when 1,2,3 and 4 are located at the 2 nd position of the convolution kernel, respectively. Thus, the final position of each result vector may be determined in combination with the pooled index.

Specifically, in some embodiments, based on the pooled indices, indices of the respective input vectors are determined; and then determining the index of the corresponding result vector according to the index mapping relation and the index of the input vector. Finally, the result vectors can be rearranged according to the index sequence of the result vectors to obtain the fusion operation result.

There may be multiple representations of the pooled index, which may be one-dimensional or two-dimensional, and the present disclosure is not limited in this respect. The indexes of different dimensions may be transformed into each other, for example, according to a predetermined traversal rule, a two-dimensional index is transformed into a one-dimensional index, and vice versa. Taking the pooling index 720 in fig. 7 as an example, for a total of 16 positions, the position indices with non-zero data (dark squares) are (1,1), (1,3), (3,1) and (3, 3), respectively, which may also be expressed as one-dimensional indices, 0, 2, 8 and 10, respectively. In some embodiments, the pooling indices of the various pooling windows in the upper pooling layer are consistent, e.g., the upper left corner in both the examples of fig. 7 and 9. In these embodiments, the pooled index may only need to be represented by one index in the pooled window, e.g., 0 represents the top left position in a 2 × 2 pooled window.

The post-pooling indices of the respective input vectors may be correspondingly determined from the pooling indices, for example, the post-pooling index of input vector "1" in the input feature map 910 is 0, the post-pooling index of input vector "2" is 2, the post-pooling index of input vector "3" is 8, and the post-pooling index of input vector "4" is 10.

The index of the corresponding result vector may then be determined from the index of the input vector according to the index mapping relationship. Each result vector is obtained by multiplication of a weight vector and an input vector in the convolution kernel, so the index mapping indicates the relationship between the position of the weight vector and the index of the input vector and the corresponding result vector in the final deep convolution result, in other words, the index of the vector result multiplied by the weight vector can be determined based on the position of the weight vector in the convolution kernel and the index of the input vector.

In some embodiments, there is a padding (padding) operation for the up-pooling operation and/or the deep convolution operation. For example, in a target detection algorithm based on point cloud data, the same padding (same padding) is required, that is, the padding is performed so that the shape of input data is the same as the shape of output data after convolution operation. It will be appreciated that in other application scenarios of convolution operations, different padding rules may exist. In some embodiments of the present disclosure, the padding of the upper pooling layer is (1,0,1,0), i.e., 1 is added to each of the top and left sides, and the bottom and right sides are not padded, and the padding of the depth convolution layer is 0.

Padding, when present, has an effect on the index of the input vector. Referring again to fig. 8, the gray portion of the input feature map 810 represents the padded area. As can be seen from the figure, any data point (x, y) in the initial input data, whose coordinates in the input data after padding become (x + pad _ left, y + pad _ top), where pad _ left is the left padding amount and pad _ top is the top padding amount. This makes it possible to adjust the index of the input data by a simple addition operation and according to the padding rule.

Thus, in these embodiments, the index of the input vector may be adjusted based on the padding rules of the upper pooling layer and the depth convolution layer before determining the index of the result vector from the index mapping relationship. It will be understood by those skilled in the art that the index adjustment process may also be performed after or during the index mapping, and only the influence of the padding rule needs to be considered, and the embodiments of the present disclosure are not limited in this respect.

Fig. 10 illustrates a weight vector and an index mapping relationship between an input vector and a result vector. The padding rules have been considered in this example, as shown in the grey portion. 1010, 1020, 1030 and 1040 represent the multiplication operation of the 1 st weight vector a and the corresponding input vector, respectively. The arrows in the figure indicate the corresponding locations of the respective product operation results in the convolution operation result 1050.

Specifically, the product result of the input vector "1" (the pooled index is (2,2)) indicated by 1010 and the weight vector a corresponds to the (2,2) position in the convolution result of 4 × 4, the product result of the input vector "2" (the pooled index is (2,4)) indicated by 1020 and the weight vector a corresponds to the (2,4) position in the convolution result, the product result of the input vector "3" (the pooled index is (4,2)) indicated by 1030 and the weight vector a corresponds to the (4,2) position in the convolution result, and the product result of the input vector "4" (the pooled index is (4,4)) indicated by 1040 and the weight vector a corresponds to the (4,4) biton in the convolution result.

As can be seen from the figure, when the input vector is multiplied by the 1 st weight vector of the convolution kernel, the following mapping relationship exists in the index: assuming that the index of the input vector is (x, y), the index of the resultant vector multiplied by the 1 st weight vector is also (x, y).

Similarly, the relationship between the index of the resultant vector and the index of the input vector when the input vector is multiplied by other weight vectors can be derived. For example, when the input vector is multiplied by the 2 nd weight vector of the convolution kernel, the following mapping relationship exists in the index: assuming that the index of the input vector is (x, y), the index of the resultant vector multiplied by the 2 nd weight vector is (x, y-1). When the input vector and the 3 rd weight vector of the convolution kernel carry out product operation, the indexes thereof have the following mapping relation: assuming that the index of the input vector is (x, y), the index of the resultant vector multiplied by the 3 rd weight vector is (x-1, y).

In summary, each input vector traverses the 2 × 2 convolution kernel in turn, so the offset of the input vector with respect to the center point of the convolution kernel is fixed. According to this feature, the index of the vector product result associated with each input vector may be determined directly based on the index of that input vector. That is, only the index of the input vector needs to be known, and the index of the product result of multiplying the input vector by all the weight vectors can be determined.

Therefore, according to the index of the input vector, based on the fixed coordinate offset, the index of the center point of the corresponding convolution kernel can be solved in turn when the convolution kernel is traversed. The index of each vector product result produced by the input vector can then be determined by mapping the index of the center point to the index of the output point.

In addition, in some embodiments, some vector product results may overflow the convolution result range and fall within invalid results. For example, with a pooling index of 3, i.e., the input vector is in the lower right corner of the pooling window, some of the result vectors will be out of the convolution result range. For these cases, the index beyond the range of convolution results (i.e., the range of output feature map dimension sizes) may be set to a predetermined value, such as-1, to identify these invalid results in the reordering process without performing the reordering process.

After the indices of the respective result vectors are determined, the result vectors may be reordered in order of their indices to obtain a result of the fusion operation.

For example, according to the index mapping relationship of fig. 10, the indexes of the four result vectors in 930 of fig. 9 are (2,2), (2,4), (4,2) and (4,4), respectively, which are arranged at corresponding positions in the final fusion operation result 970 accordingly. Other result vectors are similarly rearranged.

The operation circuit may then write the fused operation result 970 back to an on-chip storage circuit, such as NRAM 331 in fig. 3. Finally, the result of the fusion operation may be further output from the on-chip storage circuit to the off-chip storage circuit.

The fusion operation procedure of the pooling-up and depth convolution according to the embodiment of the present disclosure is described in detail above with reference to the drawings. It will be appreciated that in the above example, since the pooling indices within the respective pooling windows are identical and the convolution kernels are the same size as the pooling windows, there is at most one input vector within the receptive field of each convolution output point and therefore no accumulation between the resultant vectors is required. When the pooling indexes in the pooling windows are not consistent, a plurality of input vectors may appear in the receptive field of each convolution output point, and at this time, the result vectors having the same index after index mapping need to be subjected to vector alignment accumulation. Other operations are the same as those described above and are not described herein again.

The disclosed embodiments also provide an artificial intelligence processor for executing a neural network model, and a method of executing a neural network model implemented by the artificial intelligence processor. The neural network model includes at least an upper pooling layer and a depth convolution layer.

FIG. 11 illustrates a schematic block diagram of an artificial intelligence processor in which embodiments of the disclosure may be implemented. As shown in FIG. 11, artificial intelligence processor 1100 includes control circuitry 1110, arithmetic circuitry 1120, and on-chip storage circuitry 1130. The role and function of the control circuit, arithmetic circuit and on-chip memory circuit are similar to those of the control module, arithmetic module and memory module described in fig. 3 and will not be described in detail here.

In some embodiments, the control circuitry 1110 may be configured to control loading of input data for the upper pooling layer of the neural network model and convolution kernels for the depth convolution layer from the off-chip storage circuitry to the on-chip storage circuitry 1130. The operation circuit 1120 may be configured to perform a fusion operation of the above-pooling layer and the depth convolution layer of the embodiments of the present disclosure with respect to the input data and the convolution kernel, and write the fusion operation result back to the on-chip storage circuit 1130. The control circuit 1110 may further be used to control the output of the fused operation result from the on-chip memory circuit 1130 to the off-chip memory circuit.

In some embodiments, the arithmetic circuit 1120 may include a multiplication circuit 1122 and a rearrangement circuit 1124.

The multiplication circuit 1122 may be configured to perform bit multiplication on each weight vector in the depth direction in the convolution kernel and the input vector in the depth direction of the input data, respectively, to obtain a plurality of result vectors in the depth direction.

In some embodiments, multiplication circuit 1122 may include a plurality of vector multipliers 1123. In operation, the operation circuit 1120 may distribute respective input vectors in the depth direction of the input data to a plurality of vector multipliers 1123, for example, one input vector per vector multiplier 1123. The operation circuit 1120 may also broadcast the weight vectors in the depth direction in the convolution kernel to a plurality of vector multipliers 1123, for example, broadcast the weight vector a, the weight vector b, etc. to all the vector multipliers in turn. At this time, each vector multiplier 1123 may be used to perform a bit multiplication operation on the broadcasted weight vector and the distributed input vector, resulting in a resultant vector.

The reordering circuit 1124 may be configured to reorder the plurality of result vectors obtained by the multiplication circuit 1122 according to the pooling index of the upper pooling layer to obtain the result of the fusion operation.

In some embodiments, the reordering circuitry 1124 may be further configured to: determining an index for each input vector based on the pooled indices; determining the index of the corresponding result vector according to the index of the input vector according to the index mapping relation; and rearranging the result vectors according to the sequence of the indexes of the result vectors to obtain a fusion operation result.

In some embodiments, the reordering circuitry may adjust the index of the input vector based on padding rules of the upper pooling layer and the depth convolution layer before determining the index of the result vector according to the index mapping.

FIG. 12 illustrates an exemplary flow chart of a method of executing a neural network model implemented by an artificial intelligence processor in accordance with an embodiment of the disclosure.

Specifically, in step 1210, the control circuitry controls loading of the input data of the upper pooling layer and the convolution kernel of the depth convolution layer from the off-chip storage circuitry to the on-chip storage circuitry.

Next, in step 1220, the arithmetic circuitry performs a fusion operation of the upper pooling layer and the depth convolution layer for the input data and the convolution kernel, and writes the fusion operation result back to the on-chip storage circuitry. The specific operation process of the operation circuit has been described in detail with reference to the drawings, and is not described herein again.

Finally, in step 1230, the control circuit further controls the output of the fused operation result from the on-chip memory circuit to the off-chip memory circuit.

Those skilled in the art will appreciate that the description of the fusion operation process with the upper pooling layer and the depth convolution layer of the embodiments of the present disclosure described above in connection with the figures may be equally applied to the artificial intelligence processor of fig. 11 and the method of fig. 12, and therefore, the description will not be repeated.

The present disclosure also provides a chip that may include the artificial intelligence processor of any of the embodiments described above in connection with the figures. Further, the disclosure also provides a board card, which may include the aforementioned chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. An artificial intelligence processor that executes a neural network model, comprising control circuitry, operational circuitry, and on-chip storage circuitry, the neural network model comprising an upper pooling layer and a depth convolution layer, wherein:

the control circuit is used for controlling the on-chip storage circuit to load the input data of the upper pooling layer and the convolution kernel of the depth convolution layer from the off-chip storage circuit to the on-chip storage circuit;

the operation circuit is used for executing the fusion operation of the upper pooling layer and the depth convolution layer aiming at the input data and the convolution kernel, and writing the fusion operation result back to the on-chip storage circuit; and

the control circuit is further configured to control output of the fused operation result from the on-chip storage circuit to the off-chip storage circuit.

2. The artificial intelligence processor of claim 1, wherein the arithmetic circuitry comprises:

the multiplication circuit is used for executing counterpoint multiplication operation on each weight vector in the depth direction in the convolution kernel and the input vector in the depth direction of the input data respectively to obtain a plurality of result vectors in the depth direction; and

and the rearrangement circuit is used for rearranging the plurality of result vectors according to the pooling indexes of the upper pooling layer to obtain the fusion operation result.

3. The artificial intelligence processor of claim 2, wherein the reordering circuitry is further to:

determining an index for each of the input vectors based on the pooled indices;

determining the index of the corresponding result vector according to the index of the input vector according to the index mapping relation; and

and rearranging the result vectors according to the sequence of the indexes of the result vectors to obtain the fusion operation result.

4. The artificial intelligence processor of claim 3, wherein the index mapping further indicates that indices beyond a size range of output data dimensions are set to a predetermined value.

5. The artificial intelligence processor of any one of claims 3-4 wherein the reordering circuitry is further configured to:

adjusting the index of the input vector based on the padding rules of the upper pooling layer and the depth convolution layer before determining the index of the result vector according to the index mapping relationship.

6. The artificial intelligence processor of any of claims 2-5 wherein the pooling indices of the respective pooling windows in the upper pooling layer are consistent.

7. The artificial intelligence processor of any of claims 2-6 wherein the multiplication circuit comprises a plurality of vector multipliers, and

the operational circuit is further to: distributing each input vector in the depth direction of the input data to the vector multipliers, and broadcasting and transmitting the weight vector in the depth direction in the convolution kernel to the vector multipliers;

each vector multiplier is for: and performing para-position multiplication operation aiming at the broadcasted weight vector and the distributed input vector to obtain a result vector.

8. A chip comprising an artificial intelligence processor according to any of claims 1 to 7.

9. A board comprising the chip of claim 8.

10. A method of performing a neural network model using an artificial intelligence processor according to any one of claims 1 to 7.