CN111738429B

CN111738429B - Computing device and related product

Info

Publication number: CN111738429B
Application number: CN201910229823.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2023-10-13
Anticipated expiration: 2039-03-25
Also published as: CN111738429A

Abstract

The application provides a computing device and a related product, wherein the related product comprises a neural network chip and a board card, the board card comprises a storage device, an interface device, a control device and the neural network chip, and the neural network chip is respectively connected with the storage device, the control device and the interface device; the memory device is used for storing data; the interface device is used for realizing data transmission between the neural network chip and external equipment; the control device is used for monitoring the state of the neural network chip. The embodiment of the application is beneficial to solving the problems of high data transmission delay and high energy consumption in the operation process of the neural network, and breaks the bottleneck of the operation of the neural network, thereby meeting the actual demands of users and improving the user experience.

Description

Computing device and related product

Technical Field

The application relates to the technical field of information processing, in particular to a computing device and related products.

Background

With the continuous development of information technology and the increasing demands of people, the requirements of people on data storage are higher and higher, especially in the operation process of a neural network algorithm.

However, in the prior art, the storage device has the problems of low storage density, large area, high power consumption and high access delay, so that the data transmission delay and the high energy consumption are caused in the operation process of the neural network, the data transmission becomes the bottleneck of the operation of the neural network, the actual requirements of users cannot be met, and the experience of the users is poor.

Disclosure of Invention

The embodiment of the application provides a computing device and a related product, which are beneficial to solving the problems of high data transmission delay and high energy consumption in the operation process of a neural network, breaking the bottleneck of the operation of the neural network, thereby meeting the actual demands of users and improving the user experience.

In a first aspect, an embodiment of the present application provides a computing device, where the computing device includes a storage unit and a controller unit, where the storage unit includes: a 3D decoder and a 3D memory;

the controller unit is used for sending an access instruction to the 3D decoder;

the 3D decoder is used for decoding the access instruction transmitted by the controller unit to obtain address information of data to be accessed carried by the access instruction;

the 3D decoder is further configured to send the address information to the 3D memory;

The 3D memory is used for accessing the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder.

In a second aspect, an embodiment of the present application provides a neural network chip, where the neural network chip includes the computing device described in the first aspect.

In a third aspect, an embodiment of the present application provides a board, where the board includes the neural network chip packaging structure described in the second aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes the neural network chip described in the second aspect or the board card described in the third aspect.

In a fifth aspect, an embodiment of the present application further provides a computing method for executing a machine learning model, where the computing method is applied to a computing device, the computing device includes a storage unit and a controller unit, and the storage unit includes: a 3D decoder and a 3D memory; the method comprises the following steps:

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1A is a schematic structural diagram of a computing device according to an embodiment of the present application.

Fig. 1B is a block diagram of a memory cell according to an embodiment of the present application.

Fig. 1C is a block diagram of a memory cell according to another embodiment of the present application.

FIG. 1D is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1E is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1F is a schematic diagram of a computing device according to an embodiment of the present application.

FIG. 1G is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1H is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1I is a block diagram of yet another computing device provided by an embodiment of the present application.

Fig. 1J is a schematic structural diagram of a tree module according to an embodiment of the present application.

FIG. 1K is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1L is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1M is a block diagram of yet another computing device provided by an embodiment of the present application.

FIG. 1N is a block diagram of yet another computing device provided by an embodiment of the present application.

Fig. 2A is a schematic structural diagram of a computing device according to an embodiment of the present application.

Fig. 3A is a schematic structural diagram of a board according to an embodiment of the present application.

Fig. 3B is a schematic structural diagram of a board according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

A computing device for use with the present application will first be described. Referring to fig. 1A, there is provided a computing device including a storage unit 10 and a controller unit 11, wherein the storage unit 10 includes: a 3D decoder 201 and a 3D memory 202;

the controller unit 11 is configured to send an access instruction to the 3D decoder.

In an alternative, specifically, when the access instruction is a read instruction, the read instruction carries address information and a data size of data to be accessed, where in an alternative, the structure of the read instruction may be as shown in the following table:

Layer＿id

Cell_id

Row_id

Col_id

Data_size

wherein layer_id is active Layer index address information indicating the position of the active Layer of the target active Layer in the 3D memory; cell_id is memory index address information indicating the location of the target memory in the target active layer; row_id is Row index address information indicating a Row address of a target storage space of data to be accessed in a target memory; col_id is column index address information indicating a column address of a target storage space of data to be accessed in the target memory; data_size is a Data size indicating the Data size of the Data to be accessed.

When the access instruction is a write instruction, the write instruction carries address information and data to be accessed, where in an alternative scheme, the structure of the write instruction may be as follows:

Layer＿id

Cell_id

Row_id

Col_id

Data

wherein layer_id is active Layer index address information indicating the position of the active Layer of the target active Layer in the 3D memory; cell_id is memory index address information indicating the location of the target memory in the target active layer; row_id is Row index address information indicating a Row address of a target storage space of data to be accessed in a target memory; col_id is column index address information indicating a column address of a target storage space of data to be accessed in the target memory; data is Data representing Data to be accessed.

The 3D decoder 201 is configured to decode the access instruction transmitted by the controller unit, and obtain address information of data to be accessed carried by the access instruction.

The 3D decoder 201 is further configured to send the address information to the 3D memory.

The 3D memory 202 is configured to access the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder.

According to the technical scheme provided by the application, the storage unit is set into the 3D memory and the 3D decoder, so that the storage requirement of the neural network operation is met, meanwhile, the data I/O unit is reduced, the problems of high data transmission delay and high energy consumption in the neural network operation process are solved, the bottleneck of the neural network operation is broken, the actual requirement of a user is met, and the user experience is improved.

Referring to fig. 1B, when the access instruction is a read instruction, the 3D decoder 201 is specifically configured to decode the read instruction transmitted by the controller unit, to obtain address information and a data size of data to be accessed carried by the read instruction; and sending the address information and the data size to the 3D memory.

The 3D memory 202 is specifically configured to read, from the 3D memory, the data to be accessed matching the data size according to the address information transmitted by the 3D decoder.

Referring to fig. 1C, when the above access instruction is a write instruction, the 3D decoder 201 is specifically configured to decode the write instruction transmitted by the controller unit, to obtain address information of data to be accessed and data to be accessed carried by the write instruction; and sending the address information and the data to be accessed to the 3D memory.

The 3D memory 202 is specifically configured to write the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder.

Optionally, the 3D memory 202 includes: n active layers, wherein each active layer in the N active layers comprises a 2D memory, the 2D memory in the ith active layer is connected with the 2D memory in the (i+1) active layer, i is more than or equal to 1 and less than or equal to N, i is an integer, each 2D memory is formed by arranging M memory arrays of the same type, N is a positive integer greater than 1, and M is a positive integer.

In one possible implementation manner of the scheme, the 2D memory in the ith active layer and the 2D memory in the (i+1) th active layer are connected through a silicon channel.

For example, i may be a number of 1, 2, 3, 4, 5, 7, 8, 9, or the like.

Wherein M may be, for example, a number of 1, 2, 3, 4, 5, 7, 8, 9, etc.

Wherein N may be, for example, a number of 2, 3, 4, 5, 7, 8, 9, etc.

The types of the memory may include, for example: dynamic random access memory, static random access memory, registers, flash memory, etc.

For example, a 3D-register has 4 active layers, each active layer includes a 2D register, adjacent 2D registers are connected through a silicon channel, and each 2D register is obtained by arranging M register arrays, that is, M rows and n columns of registers form a 2D register, where m×n=m, where M and n are positive integers greater than or equal to 1.

For example, a 3D-sram has 4 active layers, each active layer includes a 2D sram, adjacent 2D sram are connected by a silicon channel, and each 2D sram is formed by arranging M sram arrays, that is, M rows and n columns of sram form a 2D sram, where m=m, n is a positive integer greater than or equal to 1.

For example, a 3D-dram has 4 active layers, each active layer includes a 2D dram, adjacent 2D dram are connected by a silicon channel, and each 2D dram is formed by arranging M dram arrays, i.e., M rows and n columns of dram constitute a 2D dram, where m=m, where M and n are positive integers greater than or equal to 1.

In one possible implementation manner of the present solution, the address information includes: active layer index address information, memory index address information, row index address information, and column index address information; when the data to be accessed is accessed in the 3D memory according to the address information transmitted by the 3D decoder, the 3D memory 202 is specifically configured to:

accessing the data to be accessed to a target storage space in the 3D memory, wherein the target storage space is a storage space corresponding to the row index address information and the column index address information in a target memory, the target memory is a memory corresponding to the memory index address information in a target active layer, and the target active layer is an active layer corresponding to the active layer index address information in the 3D memory.

Optionally, in an embodiment of the present solution, the external device 13 and the 3D memory may be as shown in fig. 1D and fig. 1E, and the computing apparatus may further include: an external device 13, an external storage unit of the external device 13 including: 3D-dynamic random access memory and 3D-static random access memory; the data to be accessed comprises: input data and scalar data in the input data, wherein the input data comprises: inputting neuron data and weight data;

when the external memory unit is the 3D-dynamic random access memory, the 3D memory 102 includes: 3D-sram and 3D-register;

the 3D-SRAM is used for storing the input data;

the 3D-register for storing the scalar data;

alternatively, when the external memory unit is the 3D-sram, the 3D memory 102 is a 3D-register;

the 3D-register is used for storing the input data;

the 3D-register is further configured to store the scalar data.

Optionally, in another embodiment of the present solution, as shown in fig. 1F, when the external storage unit 301 is the 3D-dynamic random access memory 3012, the 3D memory 202 enters a first operation mode, where the first operation mode includes:

The 3D-sram 2021 accesses the input data in the 3D-sram according to the address information transmitted by the 3D decoder; the method comprises the steps of,

the 3D-register 2022 accesses the scalar data in the 3D-register according to the address information transmitted by the 3D decoder;

alternatively, when the external storage unit is the 3D-sram 3011, the 3D memory 202 enters a second operation mode, where the second operation mode includes:

the 3D-register 2022 accesses the input data and the scalar data in the 3D-register according to the address information transmitted by the 3D decoder.

In one embodiment of the present solution, the computing device is configured to perform a machine learning computation, and the computing device further includes: an arithmetic unit 12, wherein the controller unit 11 is connected with the arithmetic unit 12, the arithmetic unit 12 comprising: a master processing circuit and a plurality of slave processing circuits;

the controller unit 11 is further configured to obtain input data and a calculation instruction; in an alternative, the manner of acquiring the input data and calculating the instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above-described computing instructions include, but are not limited to: the embodiments of the present application are not limited to the specific form of the calculation instructions described above, as for example, forward or reverse training instructions, or other neural network operation instructions, etc., such as convolution operation instructions.

The controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 for performing preamble processing on the input data and transmitting data and operation instructions with the plurality of slave processing circuits;

a plurality of slave processing circuits 102, configured to execute intermediate operations in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

According to the technical scheme provided by the application, the operation unit is arranged into a master multi-slave structure, and for the calculation instruction of forward operation, the data can be split according to the calculation instruction of forward operation, so that the part with larger calculation amount can be operated in parallel through a plurality of slave processing circuits, the operation speed is improved, the operation time is saved, and the power consumption is further reduced.

Optionally, the machine learning calculation may specifically include: the artificial neural network operation, the input data may specifically include: neuron data and weight data are input. The calculation result may specifically be: and outputting the neuron data as a result of the artificial neural network operation.

The operation in the neural network can be one-layer operation in the neural network, and in the multi-layer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network of the upper layer is completed, the operation instruction of the lower layer can take the output neuron calculated in the operation unit as the input neuron of the lower layer to perform operation (or perform certain operations on the output neuron and then take the operation as the input neuron of the lower layer), and meanwhile, the weight is replaced by the weight of the lower layer; in the backward operation, when the backward operation of the artificial neural network of the previous layer is completed, the next-layer operation instruction performs an operation with the input neuron gradient calculated by the operation unit as the output neuron gradient of the next layer (or performs some operations on the input neuron gradient and then uses the operation as the output neuron gradient of the next layer), and simultaneously replaces the weight with the weight of the next layer.

The machine learning computation may also include support vector machine operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, a specific scheme of machine learning calculation is described below by taking an artificial neural network operation as an example.

For the artificial neural network operation, if the artificial neural network operation has multiple layers of operation, the input neurons and the output neurons of the multiple layers of operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the forward operation of the network are the input neurons, and the neurons in the upper layer of the forward operation of the network are the output neurons. Taking convolutional neural networks as an example, let a convolutional neural network have L layers, k=1, 2,..l-1, for the K-th layer and the K + 1-th layer, we refer to the K-th layer as the input layer, where the neurons are the input neurons, the k+1-th layer as the output layer, where the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a store queue unit 113;

An instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to parse the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: a plurality of arithmetic instructions or calculation instructions to be executed in the order of the queue.

For example, in an alternative embodiment, the main arithmetic processing circuit may also include a controller unit, which may include a main instruction processing unit, specifically for decoding instructions into micro instructions. In another alternative of course, the slave processing circuit may also comprise a further controller unit comprising a slave instruction processing unit, in particular for receiving and processing microinstructions. The micro instruction may be the next instruction of the instruction, and may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instructions may be as shown in the following table.

Operation code

Registers or immediate

Register/immediate

...

The ellipses in the table above represent that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an operation code. The computing instructions may include neural network computing instructions. Taking a neural network operation instruction as an example, as shown in table 1, a register number 0, a register number 1, a register number 2, a register number 3, and a register number 4 may be operation domains. Wherein each of register number 0, register number 1, register number 2, register number 3, register number 4 may be a number of one or more registers.

The register may be an off-chip memory, or may be an on-chip memory in practical applications, and may be used to store data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, for example, n=1 is 1-dimensional data, i.e., a vector, where n=2 is 2-dimensional data, i.e., a matrix, where n=3 or more is a multidimensional tensor.

Optionally, the controller unit may further include:

the dependency relationship processing unit 108 is configured to determine, when a plurality of operation instructions are provided, whether a first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction, if the first operation instruction has an association relationship with the zeroth operation instruction, cache the first operation instruction in the instruction storage unit, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the execution of the zeroth operation instruction is completed;

The determining whether the association relationship exists between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:

extracting a first storage address interval of required data (for example, a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have overlapping areas, determining that the first operation instruction and the zeroth operation instruction have an association relationship, if the first storage address interval and the zeroth storage address interval do not have overlapping areas, and determining that the first operation instruction and the zeroth operation instruction do not have an association relationship.

In an alternative embodiment, the arithmetic unit 12 may comprise one master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 1G and 1H. In one embodiment, as shown in FIGS. 1G and 1H, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, and the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, wherein the k slave processing circuits are as follows: the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column are described as K slave processing circuits shown in fig. 1G and 1H, and each of the K slave processing circuits includes only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, each of the K slave processing circuits is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.

K slave processing circuits for forwarding data and instructions between the master processing circuit and the plurality of slave processing circuits.

In another embodiment, the operation instruction is a matrix-by-matrix instruction, an accumulate instruction, an activate instruction, or the like calculation instruction.

The specific calculation method of the calculation device shown in fig. 1A is described below by the neural network operation instruction. For a neural network operation instruction, the formulas that it is actually required to execute may be: s=s (Σwx) _i +b), wherein the weight w is multiplied by the input data x _i And summing, adding the bias b, and performing an activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 1I, the arithmetic unit includes: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits;

the above tree module has a transmitting and receiving function, for example, as shown in fig. 1I, and is a transmitting function, as shown in fig. 2A, and is a receiving function.

The tree module is used for forwarding the data blocks, the weights and the operation instructions between the master processing circuit and the plurality of slave processing circuits.

Alternatively, the tree module is an optional result of the computing device, which may include at least a layer 1 node, which is a line structure with forwarding functionality, and which may not itself have computing functionality. Such as a tree module, has zero level nodes, i.e., the tree module is not required.

Alternatively, the tree module may be in a tree structure of n-branches, for example, a tree structure of two branches as shown in fig. 1J, or may be in a tree structure of three branches, where n may be an integer greater than or equal to 2. The embodiment of the present application is not limited to the specific value of n, and the number of layers may be 2, and the processing circuit may be connected to a node of a layer other than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 1J.

Alternatively, the above-mentioned operation unit may carry a separate cache, as shown in fig. 1K, and may include: a neuron buffering unit 63 which buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 1L, the operation unit may further include: the weight buffer unit 64 is used for buffering the weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12 may include a branch processing circuit 103 as shown in fig. 1M and 1N; the specific connection structure is shown in fig. 1M and 1N, wherein,

the master processing circuit 101 is connected to the branch processing circuit(s) 103, and the branch processing circuit 103 is connected to the one or more slave processing circuits 102;

branch processing circuitry 103 for executing data or instructions that are forwarded between the master processing circuitry 101 and the slave processing circuitry 102.

In an alternative embodiment, taking the example of the fully connected operation in the neural network operation, the process may be: y=f (wx+b), where x is the input neuron matrix, w is the weight matrix, b is the bias scalar, and f is the activation function, which may be specifically: a sigmoid function, a tanh, relu, softmax function. Assuming here a binary tree structure with 8 slave processing circuits, the method implemented may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, distributes the 8 sub-matrices to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

The slave processing circuit performs multiplication operation and accumulation operation of 8 submatrices and an input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

the main processing circuit is used for sequencing the 8 intermediate results to obtain an operation result of wx, executing the operation of the bias b on the operation result, executing the activating operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 1A may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), the weight w and the offset b are transmitted to the main processing circuit of the operation unit, the controller unit extracts the input data Xi from the storage unit, and the input data Xi is transmitted to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines that input data Xi are broadcast data, determines weight data are distribution data, and splits the weight w into n data blocks;

an instruction processing unit of the controller unit determines a multiplication instruction, a bias instruction and an accumulation instruction according to the at least one operation code, sends the multiplication instruction, the bias instruction and the accumulation instruction to a main processing circuit, and the main processing circuit sends the multiplication instruction and input data Xi to a plurality of slave processing circuits in a broadcast manner, and distributes the n data blocks to the plurality of slave processing circuits (for example, n slave processing circuits are provided, and each slave processing circuit sends one data block); and the main processing circuit is used for executing accumulation operation on the intermediate results sent by the plurality of slave processing circuits according to the accumulation instruction to obtain an accumulation result, executing addition offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme provided by the application, the multiplication operation and the bias operation of the neural network are realized through one instruction, namely the neural network operation instruction, the intermediate result calculated by the neural network is not required to be stored or extracted, and the storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a machine learning operation device which comprises one or more computing devices, wherein the computing devices are used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting an execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices may be linked and data transferred by a specific structure, such as interconnection and data transfer via a PCIE bus, to support larger scale machine learning operations. At this time, the same control system may be shared, or independent control systems may be provided; the memory may be shared, or each accelerator may have its own memory. In addition, the interconnection mode can be any interconnection topology.

The machine learning operation device has higher compatibility and can be connected with various types of servers through PCIE interfaces.

In some embodiments, a chip is also disclosed, which includes the machine learning computing device.

In some embodiments, a board card is provided that includes the chip package structure described above. Referring to fig. 3A and 3B, fig. 3A and 3B provide a 2 board that may include other mating components in addition to the chip 389, including but not limited to: storage 390, interface 391, control 392, and external device 394;

the external memory unit of the external device 394 may be a 3D-dynamic random access memory or a 3D-static random access memory;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include the memory cell 393 of fig. 3A or 3B. The memory unit is connected with the chip through a bus. It will be appreciated that where the external storage unit of the external device 394 is a 3D-dynamic random access memory, the storage unit may be a 3D decoder, a 3D-static random access memory, and a 3D-register; when the external memory unit of the external device 394 is a 3D-static random access memory, the memory unit is a 3D decoder and a 3D-register.

The interface device is electrically connected with the chip in the chip packaging structure. The interface means is used for enabling data transmission between the chip and an external device, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. Preferably, when using PCIE3.0X16 interface transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may be another interface, and the present application is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the calculation result of the chip is still transmitted back to the external device (e.g. a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may drive a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a memory, and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the application, wherein the principles and embodiments of the application are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A computing device comprising a memory unit and a controller unit, wherein the memory unit comprises: a 3D decoder and a 3D memory;

the 3D memory is configured to access the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder;

the 3D memory includes: n active layers, wherein each active layer in the N active layers comprises a 2D memory, the 2D memory in the ith active layer is connected with the 2D memory in the (i+1) th active layer, i is more than or equal to 1 and less than N, i is an integer, and N is a positive integer greater than 1.

2. The computing device of claim 1, wherein each 2D memory is derived from an arrangement of M memory arrays of the same type, where M is a positive integer.

3. The computing device of claim 1 or 2, wherein the address information comprises: active layer index address information, memory index address information, row index address information, and column index address information; when the data to be accessed is accessed in the 3D memory according to the address information transmitted by the 3D decoder, the 3D memory is specifically configured to:

4. The computing apparatus of claim 1, wherein the computing apparatus comprises an external device, an external storage unit of the external device comprising: 3D-dynamic random access memory and 3D-static random access memory; the data to be accessed comprises: input data and scalar data in the input data, wherein the input data comprises: inputting neuron data and weight data;

when the external memory unit is the 3D-dynamic random access memory, the 3D memory includes: 3D-sram and 3D-register;

the 3D-SRAM is used for storing the input data;

the 3D-register for storing the scalar data;

Or when the external storage unit is the 3D-static random access memory, the 3D memory is a 3D-register;

the 3D-register is used for storing the input data;

the 3D-register is further configured to store the scalar data.

5. The computing device of claim 4, wherein when the external storage unit is the 3D-dynamic random access memory, the 3D memory enters a first mode of operation, the first mode of operation comprising:

the 3D-SRAM accesses the input data in the 3D-SRAM according to the address information transmitted by the 3D decoder; the method comprises the steps of,

the 3D-register accessing the scalar data in the 3D-register according to the address information transmitted by the 3D decoder;

or when the external storage unit is the 3D-sram, the 3D memory enters a second operation mode, where the second operation mode includes:

the 3D-register accesses the input data and the scalar data in the 3D-register according to the address information transmitted by the 3D decoder.

6. A neural network chip, characterized in that it comprises a computing device according to any one of claims 1 to 5.

7. An electronic device comprising the chip of claim 6.

8. A board, characterized in that, the board includes: a memory device, an interface device and a control device, and a neural network chip as claimed in claim 6;

the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

the control device is used for monitoring the state of the chip.

9. A computing method of executing a machine learning model, the computing method being applied to a computing device comprising a storage unit and a controller unit, wherein the storage unit comprises: a 3D decoder and a 3D memory; the method comprises the following steps:

10. The method of claim 9, wherein each 2D memory is formed from an arrangement of M memory arrays of the same type, where M is a positive integer.

11. The method according to claim 9 or 10, wherein the address information comprises: active layer index address information, memory index address information, row index address information, and column index address information; when the data to be accessed is accessed in the 3D memory according to the address information transmitted by the 3D decoder, the 3D memory accesses the data to be accessed to a target storage space in the 3D memory, wherein the target storage space is a storage space corresponding to the row index address information and the column index address information in a target memory, the target memory is a memory corresponding to the memory index address information in a target active layer, and the target active layer is an active layer corresponding to the active layer index address information in the 3D memory.

12. The method of claim 9, wherein the computing device comprises an external device, an external storage unit of the external device comprising: 3D-dynamic random access memory and 3D-static random access memory; the data to be accessed comprises: input data and scalar data in the input data, wherein the input data comprises: inputting neuron data and weight data;

the 3D-SRAM is used for storing the input data;

the 3D-register for storing the scalar data;

the 3D-register is used for storing the input data;

the 3D-register is further configured to store the scalar data.

13. The method of claim 12, wherein when the external memory unit is the 3D-dynamic random access memory, the 3D memory enters a first operation mode, the first operation mode comprising: