WO2022001454A1 - 集成计算装置、集成电路芯片、板卡和计算方法 - Google Patents

集成计算装置、集成电路芯片、板卡和计算方法 Download PDF

Info

Publication number
WO2022001454A1
WO2022001454A1 PCT/CN2021/094721 CN2021094721W WO2022001454A1 WO 2022001454 A1 WO2022001454 A1 WO 2022001454A1 CN 2021094721 W CN2021094721 W CN 2021094721W WO 2022001454 A1 WO2022001454 A1 WO 2022001454A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing circuit
sub
circuit
circuits
data
Prior art date
Application number
PCT/CN2021/094721
Other languages
English (en)
French (fr)
Inventor
刘少礼
于涌
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2022001454A1 publication Critical patent/WO2022001454A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • This disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to an integrated computing device, an integrated circuit chip, a board, and a method for performing arithmetic operations using the aforementioned integrated computing device.
  • Existing artificial intelligence operations often include a large number of data operations, such as convolution operations, image processing, etc. As the amount of data increases, the amount of operations and storage involved in data operations such as matrix operations will increase sharply due to the increase in the size of the data.
  • a general-purpose processor such as a central processing unit (“CPU") or a graphics processing unit (“GPU”) is usually used for computing.
  • CPU central processing unit
  • GPU graphics processing unit
  • general-purpose processors often have high power consumption due to their general-purpose features and high device redundancy, thus resulting in limited performance.
  • the existing operation processing circuit usually adopts a single hardware architecture, which can only process operations under a certain type of architecture, and cannot flexibly select a suitable processing circuit according to actual needs.
  • the existing operation processing circuit usually adopts a single hardware architecture, which can only process operations under a certain type of architecture, and cannot flexibly select a suitable processing circuit according to actual needs.
  • some fixed hardware architectures that use hard-connection when the data scale expands or the data format changes, it may not only be unable to support a certain type of operation, but also make the operation performance extremely large during the operation. Limited, or even inoperable.
  • the present disclosure provides a solution that supports multiple types of operations and operation modes, improves operation efficiency, and saves operation cost and overhead. Specifically, the present disclosure provides the aforementioned solutions in the following aspects.
  • the present disclosure provides an integrated computing device comprising a main control circuit, a first main processing circuit, and a second main processing circuit, wherein: the main control circuit is configured to obtain computational instructions and process the The calculation instruction is parsed to obtain an operation instruction, and the operation instruction is sent to at least one of the first main processing circuit and the second main processing circuit; the first main processing circuit includes a set of or a plurality of sets of pipeline arithmetic circuits, wherein each set of pipeline arithmetic circuits is configured to perform a pipeline operation according to the received data and the arithmetic instructions; and the second main processing circuit, which includes a plurality of sub-processing circuits, wherein each sub-processing circuit is configured to perform multi-threaded operations according to the received data and the operation instructions.
  • the present disclosure provides an integrated circuit chip comprising an integrated computing device of various embodiments of the foregoing and later described.
  • the present disclosure provides a board including the aforementioned integrated circuit chip.
  • the present disclosure provides a method of performing an arithmetic operation using an integrated computing device, wherein the integrated computing device includes a main control circuit, a first main processing circuit, and a second main processing circuit, the method comprising: Using the main control circuit to acquire and parse the calculation instruction to obtain an operation instruction, and send the operation instruction to at least one of the first main processing circuit and the second main processing circuit ; utilize one or more groups of pipeline operation circuits included in the first main processing circuit to perform pipeline operations according to the received data and the operation instructions; and utilize multiple pipeline operation circuits included in the second main processing circuit a sub-processing circuit to perform multi-threading operations according to the received data and the operation instructions.
  • FIG. 1 is an overall architecture diagram illustrating an integrated computing device according to an embodiment of the present disclosure
  • FIG. 2 is an exemplary specific architecture diagram illustrating an integrated computing device according to an embodiment of the present disclosure
  • FIG. 3 is an example block diagram illustrating a first main processing circuit according to an embodiment of the present disclosure
  • 4a, 4b and 4c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure
  • 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of sub-processing circuits according to an embodiment of the present disclosure
  • 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of sub-processing circuits according to an embodiment of the present disclosure
  • FIG. 7a and 7b are schematic diagrams respectively showing different loop structures of a sub-processing circuit according to an embodiment of the present disclosure
  • FIGS. 8a and 8b are schematic diagrams respectively showing further different loop structures of sub-processing circuits according to embodiments of the present disclosure.
  • FIG. 9 is a schematic architectural diagram illustrating an integrated computing device and a slave processing circuit according to an embodiment of the present disclosure
  • FIG. 10 is a simplified flow diagram illustrating a method of performing computational operations using an integrated computing device according to an embodiment of the present disclosure
  • FIG. 11 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • FIG. 1 is a general architectural diagram illustrating an integrated computing device 100 according to an embodiment of the present disclosure.
  • the integrated computing device 100 of the present disclosure may include a main control circuit 102 , a first main processing circuit 104 and a second main processing circuit 106 .
  • the main control circuit may be configured to acquire and parse the computation instructions to obtain the arithmetic instructions, and send the arithmetic instructions to the first main processing circuit and the at least one of the second main processing circuits.
  • the computational instructions may be a form of hardware instructions and include one or more opcodes, and each opcode may represent one or more operations to be executed by the first main processing circuit or the second main processing circuit a specific operation.
  • These operations may include different types of operations according to different application scenarios, for example, may include arithmetic operations such as addition operations or multiplication operations, logical operations, comparison operations, or table lookup operations, or any combination of the foregoing types of operations.
  • the operation instruction may be one or more micro-instructions executed inside the processing circuit obtained by parsing the calculation instruction.
  • an operation instruction may include one or more micro-instructions corresponding to an operation code in the calculation instruction to complete one or more operations.
  • the main control circuit 102 may be configured to obtain instruction identification information in the calculation instruction, and send the operation instruction to the computer according to the instruction identification information. At least one of the first main processing circuit and the second main processing circuit. It can be seen that with the aid of the aforementioned instruction identification information, the main control circuit can specifically send an operation instruction to the first main processing circuit and/or the second main processing circuit identified in the instruction identification information. Further, according to different application scenarios, the operation instruction obtained after parsing the calculation instruction may be an operation instruction decoded by the main control circuit or may be an operation instruction not decoded by the main control circuit.
  • corresponding decoding circuits may be included in the first main processing circuit and the second main processing circuit to perform decoding of the operation instruction, so as to obtain, for example, a plurality of microcomputers. instruction.
  • the main control circuit in the process of parsing the calculation instruction, may be configured to decode the acquired calculation instruction, and then according to the decoding result and the first main process The working state of the circuit and the second main processing circuit, sending the operation instruction to at least one of the first main processing circuit and the second main processing circuit.
  • both the first main processing circuit and the second main processing circuit support non-specific operations of the same type. Therefore, in order to improve the utilization rate of the main processing circuit and improve the operation efficiency, the operation instruction can be sent to the main processing circuit whose usage rate is not high or is in an idle state.
  • the first main processing circuit 104 may include one or more groups of pipeline operation circuits, wherein each group of pipeline operation circuits may be configured to perform pipeline operations according to received data and operation instructions.
  • each group of pipeline operation circuits may include at least one operator (eg, one or more adders) to perform one-stage pipeline operations.
  • the operators included in each group of pipeline operation circuits need to be graded or include multiple types of operators, the group of pipeline operation circuits can form a multi-stage operation pipeline, and can be configured to perform multi-stage pipeline operations.
  • the structure of a group of pipeline operation circuits may include a three-stage pipeline composed of a first-stage adder, a second-stage multiplier, and a third-stage adder to perform addition and multiplication operations.
  • the structure of a group of pipeline operation circuits may include a three-stage pipeline composed of a multiplier, an adder, and a nonlinear operator, for the pipeline to complete addition, multiplication and activation operations.
  • the second main processing circuit 106 may include a plurality of sub-processing circuits, wherein each sub-processing circuit may be configured to perform multi-threaded operations according to received data and operation instructions.
  • the connection between the multiple sub-processing circuits may be either a hard-connected manner through hard-wired arrangement, or a logical connection manner configured according to, for example, micro-instructions, so as to form a plurality of sub-processing circuit arrays. Topology.
  • the aforementioned multiple sub-processing circuits can be connected and arranged in a one-dimensional or multi-dimensional array topology (as shown in FIG. 5 and FIG.
  • each sub-processing circuit can be connected to a specified direction and a certain range within a certain range.
  • the other sub-processing circuits of the predetermined interval pattern are connected.
  • a plurality of sub-processing circuits may be connected in series to form one or more closed loops (as shown in Figures 7 and 8) via the connection.
  • FIG. 2 is an example specific architectural diagram illustrating an integrated computing device 200 according to an embodiment of the present disclosure. It can be seen from FIG. 2 that the integrated computing device 200 not only includes the main control circuit 102, the first main processing circuit 104 and the second main processing circuit 106 of the integrated computing device 100 in FIG. The plurality of circuits included in the processing circuit 104 and the second main processing circuit 106, so the technical details described with respect to FIG. 1 also apply to what is shown in FIG. 2 . Since the functions of the main control circuit, the first main processing circuit, and the second main processing circuit have been described in detail above with reference to FIG. 1 , they will not be described in detail below.
  • the first main processing circuit 104 may include multiple groups of pipeline operation circuits 109, wherein each group of pipeline operation circuits may include one or more operators, and when each group of pipeline operation circuits includes multiple
  • the plurality of arithmetic units may be configured to perform multi-stage pipeline operations, that is, constitute a multi-stage operation pipeline.
  • a set of three-stage pipeline operation circuits including a multiplier, an adder, and a nonlinear operator of the present disclosure can be applied to perform the operation.
  • the multiplier of the first-stage pipeline can be used to calculate the product of the input data ina and a to obtain the first-stage pipeline operation result.
  • the adder of the second-stage pipeline can be used to perform an addition operation on the first-stage pipeline operation result (a*ina) and b to obtain the second-stage pipeline operation result.
  • the relu activation function of the third-stage pipeline can be used to activate the second-stage pipeline operation result (a*ina+b) to obtain the final operation result result.
  • Expressed convolution operation where the two input data ina and inb can be, for example, neuron data.
  • an addition operation may be performed on the first-stage pipeline operation result "product” by using the addition tree in the second-stage pipeline operation circuit to obtain the second-stage pipeline operation result sum.
  • use the nonlinear operator of the third-stage pipeline operation circuit to perform an activation operation on "sum” to obtain the final convolution operation result.
  • one or more operators included in each group of pipeline operation circuits can not only perform the above-mentioned four arithmetic operations, but also perform various operations such as table lookup or data type conversion.
  • data type conversion when the input data ina is floating-point 32-bit data (represented as float32), the operator can be used to convert the required floating-point 16-bit data (represented as float32) according to the actual operation requirements. data types such as float16), fixed-point 32-bit data (represented as fix32), or integer 8-bit data (represented as int8).
  • the pipeline operation circuit of the present disclosure can not only support the above-mentioned conversion operations of various data types, but also support functions such as absolute value operation and hardening operation of various data types.
  • the first main processing circuit 104 may further include an operation processing circuit 111, which may be configured to preprocess the data (eg, input neurons) before the operation performed by the pipeline operation circuit according to the operation instruction or to perform operations on The latter data (eg output neurons) are post-processed.
  • the arithmetic processing circuit 111 may also be used in cooperation with the slave processing circuit 112 shown in FIG. 9 to complete the intended arithmetic operation.
  • the aforementioned preprocessing and postprocessing may include, for example, data splitting and/or data splicing operations.
  • the operation processing circuit can divide the data N according to even numbers. Rows (denoted as N_2i, where i can be a natural number greater than or equal to 0) and odd-numbered rows (denoted as N_2i+1) are split. Further, in the scenario of performing a splicing operation on the data, the lower 256 bits of the even row "N_2i" of the split data N in the previous example can be used as the lower bit and the lower 256 bits of the odd row "N_2i+1" according to predetermined requirements. The bits are concatenated as high-order bits, thus forming a new data with 512 bits.
  • the operation processing circuit may use the lower 256 bits of the even-numbered rows of the data M as the first 8 bits as the One unit of data is split to obtain 32 even-numbered lines of unit data (represented as M_2i 0 to M_2i 31 , respectively ).
  • the lower 256 bits of the odd row of data M can also be split with 8 bits as 1 unit data to obtain 32 odd row unit data (represented as M_(2i+1) 0 to M_(2i+ 1) 31 ).
  • 32 odd-numbered row unit data and 32 even-numbered row unit data after splitting are arranged alternately in sequence according to the order of data bits from low to high, first even-numbered rows and then odd-numbered rows.
  • the even-numbered line unit data 0 (M_2i 0 ) is arranged in the lower order, and then the odd-numbered line unit data 0 (M_(2i+1) 0 ) is sequentially arranged.
  • even line unit data 1 (M_2i 1 ) . . . are arranged.
  • 64 unit data are concatenated to form a new data of 512 bits.
  • the first main processing circuit 104 may further include a data conversion circuit 113, which may be configured to perform a data conversion operation according to the operation instruction.
  • the data conversion operation may be a transformation performed on the arrangement position of the matrix elements.
  • the transformation may include, for example, matrix transposition and mirroring (described later in conjunction with Figures 4a-4c), matrix rotation according to a predetermined angle (eg, 90 degrees, 180 degrees or 270 degrees), and transformation of matrix dimensions.
  • the second main processing circuit 106 may include a plurality of sub-processing circuits 115 .
  • Each of the sub-processing circuits may include a logic operation circuit 1151, which may be configured to perform logic operations according to operation instructions and received data, such as performing logical operations such as NOR, shift operation or comparison operation on the received data.
  • each sub-processing circuit may also include an arithmetic operation circuit 1153, which may be configured to perform arithmetic operation operations, such as linear operations such as addition, subtraction, or multiplication.
  • each sub-processing circuit may include storage circuitry 1152 including data storage circuitry and/or predicate storage circuitry, wherein the data storage circuitry may be configured to store operational data (eg, pixels) for the sub-processing circuitry and at least one of the intermediate operation results.
  • the predicate storage circuit may be configured to store the predicate storage circuit serial number and predicate information of each of the sub-processing circuits obtained by using the operation instruction.
  • the storage circuit 1152 can be implemented by using a register or a memory such as a static random access memory "SRAM" according to actual needs.
  • the predicate storage circuit may include a 1-bit register for storing predicate information.
  • the predicate storage circuit in the sub-processing circuit may include 32 1-bit registers sequentially numbered from 00000 to 11111.
  • the sub-processing circuit can read the predicate information in the register corresponding to the serial number "00101" according to the register serial number "00101" specified in the received operation instruction.
  • the predicate storage circuit may be configured to update the predicate information according to the operation instruction.
  • the predicate information can be directly updated according to the configuration information in the operation instruction, or the configuration information can be acquired according to the configuration information storage address provided in the operation instruction, so as to update the predicate information.
  • the predicate storage circuit may also update the predicate information according to the comparison result of each of the sub-processing circuits (which is a form of operation result in the context of the present disclosure).
  • the predicate information may be updated by comparing the input data received by the sub-processing circuit with stored data in its data storage circuit. When the input data is greater than the stored data, the predicate information of the sub-processing circuit is set to 1. Conversely, when the input data is smaller than the stored data, the predicate information is set to 0, or its original value is kept unchanged.
  • each sub-processing circuit Before executing the arithmetic operation, each sub-processing circuit can judge whether the sub-processing circuit executes the operation of the arithmetic instruction according to the information in the arithmetic instruction. Further, each of the sub-processing circuits may be configured to acquire the predicate information corresponding to the predicate storage circuit according to the sequence number of the predicate storage circuit in the operation instruction, and determine the predicate information according to the predicate information. Whether the sub-processing circuit executes the operation instruction.
  • the sub-processing circuit executes the operation instruction (for example, the sub-processing circuit can be made to read the predicate information). the data pointed to in the instruction, and store the read data into the data storage circuit of the sub-processing circuit).
  • the value obtained by the sub-processing circuit reading the predicate information according to the sequence number of the predicate storage circuit specified in the operation instruction is 0, it means that the sub-processing circuit does not execute the operation instruction.
  • the second main processing circuit 106 may further include a data processing circuit 117, which may include at least one of a pre-processing circuit and a post-processing circuit.
  • the preprocessing circuit may be configured to perform a preprocessing operation (described later in conjunction with FIG. 7b ) on the operation data before the sub-processing circuit performs the operation, such as performing a data splicing or data placement operation.
  • the post-processing circuit may be configured to perform a post-processing operation on the operation result after the sub-processing circuit performs the operation, such as performing data restoration or data compression.
  • the integrated computing device 200 of the present disclosure may also include a main storage circuit 108, which may receive and store data from the main control circuit as input to the first and/or second main processing circuits data.
  • the main storage circuit may be further divided according to the storage mode or the characteristics of the stored data, and the main storage circuit 108 may include at least one of the main storage module 119 and the main cache module 121 .
  • the main storage module 119 may be configured to store the data (for example, neuron or pixel data in the neural network) used for the operation to be performed in the first main processing circuit and/or the second main processing circuit and the data after the operation is performed.
  • the result of the operation (for example, it can be the result of the convolution operation in the neural network).
  • the main cache module 121 may be configured to cache an intermediate operation result after at least one of the first main processing circuit and the second main processing circuit performs an operation.
  • the pipeline operation circuit in the first main processing circuit can also perform corresponding operations by means of the mask stored in the main storage circuit.
  • the pipeline operation circuit can read a mask from the main storage circuit, and can use the mask to indicate whether the data for the operation operation in the pipeline operation circuit is valid.
  • the main storage circuit can not only perform internal storage applications, but also have the function of data interaction with storage devices outside the integrated computing device of the present disclosure, such as data exchange with external storage devices through direct memory access ("DMA").
  • DMA direct memory access
  • FIGS. 1 to 2 The architecture and functions of the integrated computing device are described in detail above with reference to FIGS. 1 to 2 , and an exemplary description of the specific application of the first main processing circuit will be given below with reference to FIGS. 3 and 4 a to 4 c .
  • FIG. 3 is an example block diagram illustrating a first main processing circuit according to an embodiment of the present disclosure.
  • the following will further describe the cooperative relationship between multiple groups of pipeline operation circuits in the first main processing circuit and between multi-stage pipelines.
  • the first main processing circuit 104 may include one or more sets of pipeline operation circuits 109 (two sets as shown in the figure).
  • Each group of pipeline operation circuits may include one or more stages of pipeline operation circuits (the first stage pipeline operation circuit to the Nth stage pipeline operation circuit shown in each group in the figure).
  • the one-stage or multi-stage pipeline operation circuit can perform one-stage or multi-stage pipeline operations according to the received data and operation instructions.
  • the structure of a group of pipeline operation circuits may include one or more types of multiple operators such as counters, adders, multipliers, addition trees, accumulators and nonlinear operators, etc. Used to perform multi-stage pipeline operations.
  • multi-stage pipeline operations can be performed serially or in parallel.
  • one operation instruction of the present disclosure can be executed by a set of multi-stage pipeline operation circuits.
  • the operation instruction includes a plurality of serial operations, and each of the first stage, the second stage or the Nth stage pipeline operation circuit in a group of pipeline operation circuits can execute one operation to complete the operation instruction.
  • the convolution operation performed by the three-stage pipeline operation circuit described above in conjunction with FIG. 2 is a serial pipeline operation.
  • multiple groups of pipeline operation circuits 109 all perform operation operations they can simultaneously execute multiple operation instructions, that is, parallel operations among multiple instructions.
  • a bypass operation can be performed on a one-stage or multi-stage pipeline operation circuit that will not be used in the operation operation, that is, one or more stages of the multi-stage pipeline operation circuit can be selectively used according to the needs of the operation operation. , without having to make the operation go through all the multi-stage pipeline operations.
  • the circuit is used to perform operations to obtain the final operation result, and the pipeline operation circuit that is not used can be bypassed before or during the operation of the pipeline operation.
  • each group of pipeline operation circuits can independently perform the pipeline operations.
  • each of the plurality of groups of pipelined circuits may also perform the pipelining operations cooperatively.
  • the output of the first stage and the second stage in the first group of pipeline operation circuits after performing serial pipeline operation can be used as the input of the third stage of pipeline operation of another group of pipeline operation circuits.
  • the first stage and the second stage in the first group of pipeline operation circuits perform parallel pipeline operations, and respectively output the results of their pipeline operations as the first and/or second stage pipelines of another group of pipeline operation circuits. Operation input.
  • 4a, 4b and 4c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure.
  • the following will further describe the transposition operation and the horizontal mirror operation performed by the original matrix as an example.
  • the original matrix is a matrix of (M+1) rows by (N+1) columns.
  • the data conversion circuit can perform a transposition operation on the original matrix shown in FIG. 4a to obtain the matrix shown in FIG. 4b.
  • the data conversion circuit can exchange the row numbers and column numbers of elements in the original matrix to form a transposed matrix.
  • the coordinate in the original matrix shown in Fig. 4a is the element "10" in the 1st row and 0th column
  • its coordinate in the transposed matrix shown in Fig. 4b is the 0th row and the 1st column.
  • the coordinate is the element "M0" in the M+1th row and the 0th column
  • its coordinate in the transposed matrix shown in Figure 4b is the 0th row M+ 1 column.
  • the data conversion circuit may perform a horizontal mirror operation on the original matrix shown in FIG. 4a to form a horizontal mirror matrix.
  • the data conversion circuit can convert the arrangement order from the first row element to the last row element in the original matrix into the arrangement order from the last row element to the first row element through the horizontal mirror operation, and the elements in the original matrix
  • the column numbers remain the same.
  • the coordinates in the original matrix shown in Figure 4a are the element "00" in the 0th row and the 0th column and the element "10" in the 1st row and the 0th column, respectively.
  • the coordinates are the M+1 row, column 0, and the M row, column 0, respectively.
  • the coordinate is the element "M0" in the M+1th row and 0th column
  • the coordinate is the 0th row and 0th column.
  • 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of sub-processing circuits according to an embodiment of the present disclosure.
  • the multiple sub-processing circuits of the present disclosure may be connected in a one-dimensional or multi-dimensional array topology.
  • the multi-dimensional array may be a two-dimensional array, and the sub-processing circuits located in the two-dimensional array may be arranged in a row direction, a column direction or a diagonal line thereof. In at least one direction of the direction, it is connected with the remaining one or more of the sub-processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional interval pattern.
  • Figures 5a to 5c exemplarily show the topology of various forms of two-dimensional arrays between a plurality of sub-processing circuits.
  • five sub-processing circuits are connected to form a simple two-dimensional array. Specifically, with a sub-processing circuit as the center of the two-dimensional array, one sub-processing circuit is connected to each of the four horizontal and vertical directions relative to the sub-processing circuit, thereby forming a three-row and three-column size 2D array. Further, since the sub-processing circuits located in the center of the two-dimensional array are respectively directly connected to the sub-processing circuits adjacent to the previous column and the next column of the same row, and to the sub-processing circuits adjacent to the previous row and the next row of the same column, the spaced sub-processing circuits are The number of circuits ("interval number" for short) is zero.
  • each sub-processing circuit is connected to its adjacent sub-processing circuits in the previous row and the next row, the previous column and the next column respectively. , that is, the number of intervals connected by adjacent sub-processing circuits is all zero.
  • the first sub-processing circuit located in each row or column in the two-dimensional Torus array is also connected to the last sub-processing circuit of the row or column, and the sub-processing circuits between the end-to-end sub-processing circuits of each row or column are connected. The number of intervals is 2.
  • the sub-processing circuits with four rows and four columns can also be connected to form a two-dimensional array in which the number of intervals between adjacent sub-processing circuits is 0, and the number of intervals between non-adjacent sub-processing circuits is 1. Further, in the two-dimensional array, the sub-processing circuits adjacent to the same row or the same column are directly connected, that is, the number of intervals is 0, and the sub-processing circuits that are not adjacent to the same row or the same column are connected to the sub-processing circuit whose interval number is 1.
  • a three-dimensional Torus array is based on the two-dimensional Torus array, and uses a spacing pattern similar to that between rows and columns for interlayer connection. For example, firstly, the sub-processing circuits in the same row and the same column of adjacent layers are directly connected, that is, the number of intervals is 0. Next, the sub-processing circuits of the first layer and the last layer in the same column are connected, that is, the number of intervals is 2. Finally, a three-dimensional Torus array with four layers, four rows and four columns can be formed.
  • connection relationship of other multi-dimensional arrays of sub-processing circuits can be formed on the basis of the two-dimensional array by adding new dimensions and increasing the number of sub-processing circuits.
  • the solutions of the present disclosure may also configure logical connections to sub-processing circuits by using configuration instructions.
  • the solution of the present disclosure can also selectively connect some sub-processing circuits or selectively bypass some sub-processing circuits through configuration instructions to form One or more logical connections.
  • the aforementioned logical connections can also be adjusted according to actual operation requirements (eg, data type conversion).
  • the solutions of the present disclosure can configure the connection of the sub-processing circuits, including, for example, configuring into a matrix or configuring into one or more closed computing loops.
  • FIGS. 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of sub-processing circuits according to an embodiment of the present disclosure. As can be seen from the figures, FIGS. 6a to 6d are still another exemplary connection relationship of the multi-dimensional array formed by the plurality of sub-processing circuits shown in FIGS. 5a to 5d . The technical details also apply to what is shown in Figures 6a to 6d.
  • the sub-processing circuit of the two-dimensional array includes a central sub-processing circuit located in the center of the two-dimensional array and three sub-processing circuits respectively connected to the central sub-processing circuit in four directions in the same row and column. Therefore, the number of bays connected between the central sub-processing circuit and the remaining sub-processing circuits is 0, 1, and 2, respectively.
  • the sub-processing circuit of the two-dimensional array includes a central sub-processing circuit located in the center of the two-dimensional array, and three sub-processing circuits in two opposite directions parallel to the sub-processing circuit, and the sub-processing circuit corresponding to the sub-processing circuit A sub-processing circuit in two opposite directions of the same column. Therefore, the number of intervals between the central sub-processing circuit and the sub-processing circuits in the same row is 0 and 2 respectively, and the number of intervals between the sub-processing circuits in the same column is 0.
  • a multi-dimensional array formed by a plurality of sub-processing circuits can be a three-dimensional array formed by a plurality of layers. Wherein each layer of the three-dimensional array may comprise a two-dimensional array of a plurality of the sub-processing circuits arranged along its row and column directions. Further, the sub-processing circuits located in the three-dimensional array can be in a row, column, diagonal, and layer direction with a predetermined three-dimensional spacing pattern in at least one of the row direction, column direction, diagonal direction, and layer direction. or the remaining one or more sub-processing circuit connections on different layers.
  • the predetermined three-dimensional spacing pattern and the number of mutually spaced sub-processing circuits in the connection may be related to the number of spaced layers.
  • the connection manner of the three-dimensional array will be further described below with reference to FIG. 6c and FIG. 6d.
  • FIG. 6c shows a multi-layer, multi-row and multi-column three-dimensional array formed by connecting a plurality of sub-processing circuits.
  • the sub-processing circuit located in the lth layer, the rth row, and the cth column (denoted as (l, r, c)) as an example, it is located in the center of the array, and is respectively in the same layer as the previous column (l, r , the sub-processing circuit at c-1) and the sub-processing circuit at the next column (l, r, c+1), the sub-processing circuit at the same layer and the same column at the previous row (l, r-1, c) and the latter
  • the sub-processing circuit at c) is connected. Further, the number of intervals at which
  • FIG. 6d shows a three-dimensional array when the number of spaces connected between a plurality of sub-processing circuits in the row direction, the column direction, and the layer direction is all one.
  • the sub-processing circuit located at the center of the array (l, r, c) as an example, it is separated from (l, r, c-2) and (l, r, c+) by one column before and after different columns in the same layer.
  • the sub-processing circuit at 2) and the sub-processing circuits at (1, r-2, c) and (1, r+2, c) in the same layer and the same column and different rows are connected.
  • the sub-processing circuits at (l-2, r, c) and (l+2, r, c) are connected to the sub-processing circuits at (l-2, r, c) and (l+2, r, c) at the same row and the different layers in the same row before and after each other.
  • the remaining sub-processing circuits at (l, r, c-3) and (l, r, c-1) in the same layer and one column are connected to each other, and (l, r, c+1) and (l, r, c+1) are connected with each other.
  • the sub-processing circuits at (l, r, c+3) are connected to each other.
  • the sub-processing circuits at (l, r-3, c) and (l, r-1, c) in the same layer and the same column are connected to each other, (l, r+1, c) and (l, r)
  • the sub-processing circuits at +3, c) are connected to each other.
  • the sub-processing circuits at (l-3, r, c) and (l-1, r, c) in the same row and one layer are connected to each other, and (l+1, r, c) and (l+ 3.
  • the sub-processing circuits at r, c) are connected to each other.
  • connection relationship of the multi-dimensional array formed by a plurality of sub-processing circuits has been exemplarily described above, and the loop structure formed by the sub-processing circuits will be further exemplarily described below with reference to FIG. 7 and FIG. 8 .
  • FIG. 7a and 7b are schematic diagrams respectively illustrating different loop structures of a sub-processing circuit according to an embodiment of the present disclosure.
  • the four adjacent sub-processing circuits 115 are sequentially numbered "0, 1, 2 and 3".
  • the connection sequence shown in FIG. 7a is only an example and not a limitation. Those skilled in the art can also connect the four sub-processing circuits in a counterclockwise direction in series to form a closed circuit according to actual calculation needs. the loop.
  • multiple sub-processing circuits may be combined into a sub-processing circuit group to represent one data. For example, suppose a sub-processing circuit can process 8-bit data. When 32-bit data needs to be processed, four sub-processing circuits can be combined into a sub-processing circuit group, so that four 8-bit data can be connected to form a 32-bit data. Further, a sub-processing circuit group formed by the aforementioned four 8-bit sub-processing circuits can serve as a sub-processing circuit 115 shown in FIG. 7b, thereby supporting higher bit-width arithmetic operations.
  • FIG. 7b the layout of the sub-processing circuits shown is similar to that shown in Fig. 7a, but the number of intervals connected between the sub-processing circuits in Fig. 7b is different from that in Fig. 7a.
  • the four sub-processing circuits numbered sequentially with 0, 1, 2, and 3 start from sub-processing circuit 0 in a clockwise direction, connect sub-processing circuit 1, sub-processing circuit 3, and sub-processing circuit 2 in sequence, and the sub-processing circuits Processing circuit 2 is connected to sub-processing circuit 0 to form a closed loop in series.
  • the number of intervals of the sub-processing circuits shown in Figure 7b is 0 or 1, for example, the interval between sub-processing circuits 0 and 1 is 0, and the interval between sub-processing circuits 1 and 3 is 1.
  • the physical addresses of the four sub-processing circuits in the illustrated closed loop may be 0-1-2-3, while the logical addresses are 0-1-3-2. Therefore, when data of high bit width needs to be split and allocated to different sub-processing circuits, the data sequence can be rearranged and allocated according to the logical addresses of the sub-processing circuits.
  • the pre-processing circuit can rearrange the input data according to the physical addresses and logical addresses of the plurality of sub-processing circuits, so as to meet the requirements of data operation. Assuming that the four sequentially arranged sub-processing circuits 0 to 3 are connected as shown in FIG. 7a, since the physical and logical addresses of the connections are both 0-1-2-3, the pre-processing circuit can convert the input Data (eg, pixel data) aa0, aa1, aa2, and aa3 are sequentially transferred to the corresponding sub-processing circuits.
  • the input Data eg, pixel data
  • the circuit needs to rearrange the input data aa0, aa1, aa2 and aa3 into aa0-aa1-aa3-aa2 for transmission to the corresponding sub-processing circuit.
  • the solution of the present disclosure can ensure the correctness of the data operation sequence.
  • the post-processing circuit described in conjunction with FIG. 2 can be used to restore and adjust the order of the operation output results to bb0- bb1-bb2-bb3, to ensure the consistency of arrangement between input data and output result data.
  • Figures 8a and 8b are schematic diagrams respectively showing further different loop structures of sub-processing circuits according to embodiments of the present disclosure, wherein more sub-processing circuits are shown arranged and connected in different ways to form a closed loop .
  • the 16 sub-processing circuits 115 numbered sequentially from 0, 1 . . . 15 start from sub-processing circuit 0, and sequentially connect and combine every two sub-processing circuits to form a sub-processing circuit group.
  • sub-processing circuit 0 is connected with sub-processing circuit 1 to form a sub-processing circuit group . . .
  • the sub-processing circuit 14 is connected with the sub-processing circuit 15 to form one sub-processing circuit group, and finally eight sub-processing circuit groups are formed.
  • the eight sub-processing circuit groups can also be connected in a manner similar to the aforementioned sub-processing circuits, including connection according to, for example, predetermined logical addresses, so as to form a closed loop of the sub-processing circuit groups.
  • a plurality of sub-processing circuits 115 are connected in an irregular or non-uniform manner to form a closed loop.
  • the number of intervals between the sub-processing circuits can be 0 or 3 to form a closed loop, for example, the sub-processing circuit 0 can be connected with the sub-processing circuit 1 (the interval number is 0) and the sub-processing circuit respectively. 4 (the number of intervals is 3) is connected.
  • the sub-processing circuits of the present disclosure may be spaced by different numbers of sub-processing circuits so as to be connected to form a closed loop.
  • any number of intermediate intervals can also be selected for dynamic configuration, so as to form a closed loop.
  • the connection of the plurality of sub-processing circuits may be a hard connection formed by hardware, or may be a soft connection configured by software.
  • FIG. 9 is a schematic architectural diagram illustrating an integrated computing device and a slave processing circuit according to an embodiment of the present disclosure. It should be noted that the architectural diagrams of the integrated computing device and the slave processing circuit of the present disclosure are only illustrative and not limiting. In addition to performing pipeline operations and multi-threading operations, the solution of the present disclosure can also perform other types of data operations in cooperation with slave processing circuits.
  • an integrated computing device with a similar architecture to that of FIGS. 1-2 includes a main control circuit 102 , a first main processing circuit 104 and a second main processing circuit 106 . Further, at least one of the first master processing circuit and the second master processing circuit may communicate with at least one slave processing circuit 112 through the interconnection circuit 110 .
  • the interconnecting circuit 110 may be used to forward data, operation instructions or intermediate operation results transmitted between the first master processing circuit or the second master processing circuit and the at least one slave processing circuit.
  • the at least one slave processing circuit may be configured to receive data and operation instructions transmitted from at least one of the first master processing circuit and the second master processing circuit through an interconnect circuit to perform intermediate operations in parallel to obtain multiple an intermediate operation result.
  • the plurality of intermediate operation results may be transmitted to at least one of the first main processing circuit or the second main processing circuit through an interconnection circuit.
  • the first main processing circuit may be configured to receive and execute the operation instruction in a single instruction multiple data (“SIMD") manner
  • the second main processing circuit may It is configured to receive and execute the operation instruction in a single instruction multiple thread (Single Instruction Multiple Thread, "SIMT") mode.
  • FIGS. 1-2 are simplified flow diagrams illustrating a method 1000 of performing computational operations using an integrated computing device, according to an embodiment of the present disclosure.
  • the integrated computing device may apply the architecture shown in FIGS. 1-2 .
  • the method 1000 may utilize the main control circuit to obtain calculation instructions, and may parse the calculation instructions to obtain operation instructions. Also, an operation instruction may be sent to at least one of the first main processing circuit and the second main processing circuit.
  • the main control circuit may determine the first and/or second main processing circuit to perform the operation according to the instruction identification information in the calculation instruction, and send the operation instruction to the first main processing circuit and the second main processing circuit. At least one of the second main processing circuits performs corresponding operations specified by the operation instruction.
  • the main control circuit may perform a decoding operation on the calculation instruction, and send the operation instruction to the first main processing circuit and the second main processing circuit according to the decoding result. at least one of the main processing circuits.
  • the main control circuit can send an operation instruction to the user according to the load conditions of the first main processing circuit and the second main processing circuit.
  • the main processing circuit with a low rate or in an idle state.
  • the operation instruction obtained after parsing the calculation instruction may also be an operation instruction that has not been decoded by the main control circuit.
  • the first or second main processing circuit may include a corresponding decoding circuit to decode the received operation instructions, for example, to generate a plurality of micro-instructions, so that the first or second main processing circuit can decode the received operation instructions according to the micro-instructions. Perform subsequent operations.
  • the flow may proceed to steps 1020 and/or 1030 according to at least one of the first main processing circuit and the second main processing circuit determined at step 1010 to perform the next operation.
  • the method 1000 may utilize one or more groups of pipeline operation circuits included in the first main processing circuit, and perform pipeline operation according to received data (eg neuron data) and operation instructions operate.
  • each of the multiple sets of pipelined circuits may perform the pipeline operations independently or cooperatively.
  • the multiple groups of pipeline operation circuits of the present disclosure support independent completion of respective pipeline operations, and these pipeline operations can be performed in parallel with each other. Further, these parallel pipeline operations may involve the same or different arithmetic operations.
  • each set of pipelined circuits may include one-stage pipeline operations (eg, may include one operator or multiple operators) or multiple stages of pipeline operations (eg, operations may be performed serially or in parallel).
  • the method 1000 may utilize a plurality of sub-processing circuits included in the second main processing circuit to perform multi-threading operations according to received data (eg, pixel data) and operation instructions.
  • the plurality of sub-processing circuits may be connected in a one-dimensional or multi-dimensional array topology, and the plurality of sub-processing circuit arrays connected in series through the connection may form one or more closed loops.
  • a plurality of sub-processing circuits may determine whether to execute the operation of the operation instruction according to the information in the received operation instruction (eg, predicate information).
  • FIG. 11 is a structural diagram illustrating a combined processing apparatus 1100 according to an embodiment of the present disclosure.
  • the combined processing device 1100 includes a computing processing device 1102 , an interface device 1104 , other processing devices 1106 and a storage device 1108 .
  • one or more computing devices 1110 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIGS. 1-10 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 1202 shown in FIG. 12).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 11 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 1206 shown in FIG. 12 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 12 .
  • FIG. 12 is a schematic structural diagram illustrating a board 1200 according to an embodiment of the present disclosure.
  • the board includes a storage device 1204 for storing data, which includes one or more storage units 1210 .
  • the storage device can be connected to the control device 1208 and the chip 1202 described above for connection and data transmission through, for example, a bus.
  • the board also includes an external interface device 1206, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 1212 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • An integrated computing device comprising a main control circuit, a first main processing circuit, and a second main processing circuit, wherein:
  • the main control circuit configured to acquire and parse the calculation instruction to obtain an operation instruction, and send the operation instruction to at least one of the first main processing circuit and the second main processing circuit ;
  • the first main processing circuit comprising one or more groups of pipeline operation circuits, wherein each group of pipeline operation circuits is configured to perform pipeline operations according to the received data and operation instructions;
  • the second main processing circuit includes a plurality of sub-processing circuits, wherein each sub-processing circuit is configured to perform multi-threading operations according to received data and operation instructions.
  • the operation instruction is sent to at least one of the first main processing circuit and the second main processing circuit according to the instruction identification information.
  • each set of pipelined arithmetic circuits includes one or more operators, and when each set of pipelined circuits includes a plurality of operators, the plurality of operators
  • the processor is configured to perform multi-stage pipeline operations.
  • Clause 6 The integrated computing device according to Clause 1, wherein the first main processing circuit further comprises an arithmetic processing circuit configured to preprocess the data before the operation performed by the pipeline arithmetic circuit or to perform an arithmetic operation according to an arithmetic instruction.
  • the data is post-processed.
  • Clause 7 The integrated computing device of clause 1, wherein the first main processing circuit further comprises a data conversion circuit configured to perform a data conversion operation in accordance with the arithmetic instruction.
  • Clause 8 The integrated computing device of clause 1, wherein the plurality of sub-processing circuits are connected in a topology of a one-dimensional or multi-dimensional array.
  • Clause 9 The integrated computing device of clause 8, wherein the multi-dimensional array is a two-dimensional array and the sub-processing circuits located in the two-dimensional array are in a row, column, or diagonal direction thereof. At least one is connected to the remaining one or more of the sub-processing circuits in the same row, column or diagonal in a predetermined two-dimensional spaced pattern.
  • Clause 10 The integrated computing device of clause 9, wherein the predetermined two-dimensional spacing pattern is associated with a number of sub-processing circuits spaced in the connection.
  • the multi-dimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes two of a plurality of the sub-processing circuits arranged in a row direction and a column direction.
  • dimensional array where:
  • the sub-processing circuits located in the three-dimensional array are in a predetermined three-dimensional spacing pattern in at least one of the row, column, diagonal and layer directions thereof with the sub-processing circuits on the same row, the same column, the same diagonal or a different layer.
  • the remaining one or more sub-processing circuits are connected.
  • Clause 12 The integrated computing device of clause 11, wherein the predetermined three-dimensional spacing pattern is related to a number of mutually spaced sub-processing circuits in the connection and a number of spaced layers.
  • Clause 13 The integrated computing device of any of clauses 8-12, wherein a plurality of sub-processing circuits connected in series via the connections form one or more closed loops.
  • Clause 14 The integrated computing device of Clause 1, wherein the plurality of sub-processing circuits are configured to determine whether to participate in an operation according to an operation instruction.
  • each of the sub-processing circuits comprises:
  • a logic operation circuit configured to perform a logic operation according to the operation instruction and data
  • a storage circuit including a data storage circuit, wherein the data storage circuit is configured to store at least one of operation data and an intermediate operation result of the sub-processing circuit.
  • Clause 16 The integrated computing device of clause 15, wherein the storage circuit further comprises a predicate storage circuit, wherein the predicate storage circuit is configured to store a predicate storage for each of the sub-processing circuits obtained with the operation instruction Circuit number and predicate information.
  • the predicate information is updated according to the operation result of each of the sub-processing circuits.
  • each of the sub-processing circuits is configured to:
  • Whether the sub-processing circuit executes the operation instruction is determined according to the predicate information.
  • each of the sub-processing circuits includes an arithmetic operation circuit configured to perform an arithmetic operation operation.
  • the second main processing circuit further comprises a data handling circuit, the data handling circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the pre-processing circuit The circuit is configured to perform a preprocessing operation on the operation data before the sub-processing circuit performs the operation, and the post-processing circuit is configured to perform a post-processing operation on the operation result after the sub-processing circuit performs the operation.
  • Clause 21 The integrated computing device of clause 1, wherein the integrated computing device further comprises a main storage circuit comprising at least one of a main storage module and a main cache module, wherein the main storage module configures to store the data used for performing the operation in the main processing circuit and the operation result after the operation is performed, and the main cache module is configured to cache at least one of the first main processing circuit and the second main processing circuit after the operation is performed. the intermediate operation result.
  • At least one slave processing circuit configured to perform intermediate operations in parallel based on data and operation instructions transmitted from at least one of the first master processing circuit and the second master processing circuit to obtain a plurality of intermediate results, and The plurality of intermediate results are communicated to at least one of the first main processing circuit and the second main processing circuit.
  • Clause 23 The integrated computing device of clause 22, wherein the first main processing circuit is configured to receive and execute the operational instructions in a SIMD manner.
  • Clause 24 The integrated computing device of clause 22, wherein the second main processing circuit is configured to receive and execute the operational instructions in a SIMT manner.
  • Clause 27 A method of performing an arithmetic operation using an integrated computing device, wherein the integrated computing device includes a main control circuit, a first main processing circuit, and a second main processing circuit, the method comprising:
  • main control circuit uses the main control circuit to acquire and parse the calculation instruction to obtain an operation instruction, and send the operation instruction to at least one of the first main processing circuit and the second main processing circuit;
  • a plurality of sub-processing circuits included in the second main processing circuit are used to perform multi-threading operations according to received data and operation instructions.
  • Clause 28 The method of clause 27, wherein in parsing the computing instruction, the method utilizes the main control circuit to perform the following steps:
  • the operation instruction is sent to at least one of the first main processing circuit and the second main processing circuit according to the instruction identification information.
  • Clause 29 The method of clause 27, wherein in parsing the computing instructions, the method utilizes a main control circuit to perform the following steps:
  • Clause 30 The method of clause 27, wherein the pipelining operation is performed independently or cooperatively with each group of the plurality of groups of pipelining circuits.
  • each set of pipelined arithmetic circuits includes one or more operators, and when each set of pipelined circuits includes multiple operators, the method utilizes multiple operators to perform multi-stage pipeline operations.
  • Clause 32 The method of clause 27, wherein the first main processing circuit further comprises an arithmetic processing circuit, the method further comprising utilizing the arithmetic processing circuit to perform pre-operational operations on the pipeline arithmetic circuit according to an operational instruction. Data is preprocessed or post-processing is performed on the calculated data.
  • Clause 33 The method of clause 27, wherein the first main processing circuit further comprises a data conversion circuit, the method further comprising utilizing the data conversion circuit to perform a data conversion operation according to the arithmetic instruction.
  • Clause 34 The method of clause 27, wherein the plurality of sub-processing circuits are connected in a one-dimensional or multi-dimensional array topology.
  • Clause 35 The method of clause 34, wherein the multidimensional array is a two-dimensional array and the sub-processing circuits located in the two-dimensional array are connected in a row, column or diagonal direction thereof At least one of the sub-processing circuits is connected to the remaining one or more of the sub-processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional interval pattern.
  • Clause 36 The method of clause 35, wherein the predetermined two-dimensional spacing pattern is associated with a number of sub-processing circuits spaced in the connection.
  • Clause 37 The method of clause 34, wherein the multi-dimensional array is a three-dimensional array of a plurality of layers, wherein each layer includes a two-dimensional array of a plurality of the sub-processing circuits arranged in a row direction and a column direction , wherein the method includes:
  • Clause 38 The method of clause 37, wherein the predetermined three-dimensional spacing pattern is related to a number of mutually spaced sub-processing circuits in the connection and a number of spaced layers.
  • Clause 39 The method of any of clauses 34-38, wherein a plurality of sub-processing circuits connected in series via the connection form one or more closed loops.
  • Clause 40 The method of Clause 27, wherein whether the plurality of sub-processing circuits participate in an operation is determined according to the operation instruction.
  • each of the sub-processing circuits includes a logic operation circuit and a storage circuit, wherein the storage circuit includes a data storage circuit, the method comprising utilizing the logic operation circuit to operate according to an operation The instructions and data perform logical operations, and the data storage circuit is used to store at least one of the operation data of the sub-processing circuit and the intermediate operation result.
  • Clause 42 The method of clause 41, wherein the storage circuit further comprises a predicate storage circuit, wherein the method comprises storing, with the predicate storage circuit, a predicate of each of the sub-processing circuits obtained with the operation instruction Stores circuit number and predicate information.
  • the predicate information is updated according to the operation result of each of the sub-processing circuits.
  • Whether the sub-processing circuit executes the operation instruction is determined according to the predicate information.
  • each of the sub-processing circuits includes an arithmetic operation circuit, and the method utilizes the arithmetic operation circuit to perform arithmetic operation operations.
  • Clause 46 The method of clause 27, wherein the second main processing circuit further comprises a data handling circuit, the data handling circuit comprising at least one of a pre-handling circuit and a post-handling circuit, wherein the method comprises The pre-processing circuit is used to preprocess the operation data before the sub-processing circuit performs the operation, and the post-processing circuit is used to perform a post-processing operation on the operation result after the sub-processing circuit performs the operation.
  • Clause 47 The method of clause 27, wherein the integrated computing device further comprises a main storage circuit comprising at least one of a main storage module and a main cache module, wherein the method comprises utilizing the main storage
  • the storage module stores the data for performing the operation in the main processing circuit and the operation result after the operation, and uses the main cache module to cache at least one of the first main processing circuit and the second main processing circuit after the operation is performed. the intermediate operation result.
  • Clause 48 The method of any of clauses 27-38 or 40-47, wherein the integrated computing device further comprises at least one slave processing circuit, the method comprising utilizing the at least one slave processing circuit to The data and operation instructions transmitted by at least one of the first main processing circuit and the second main processing circuit are used to perform intermediate operations in parallel to obtain multiple intermediate results, and transmit the multiple intermediate results to the first main processing circuit. At least one of a main processing circuit and a second main processing circuit.
  • Clause 49 The method of clause 48, wherein the first main processing circuit is configured to receive and execute the operational instructions in a SIMD manner.
  • Clause 50 The method of clause 48, wherein the second main processing circuit is configured to receive and execute the operational instructions in a SIMT manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

一种集成计算装置、集成电路芯片、板卡和使用前述集成计算装置来执行运算操作的方法。集成计算装置可以包括在组合处理装置中,组合处理装置还可以包括通用互联接口和其他处理装置。集成计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与设备和其他处理装置连接,用于存储设备和其他处理装置的数据。

Description

集成计算装置、集成电路芯片、板卡和计算方法
相关申请的交叉引用
本申请要求于2020年6月30日申请的,申请号为2020106181487,名称为“集成计算装置、集成电路芯片、板卡和计算方法”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种集成计算装置、集成电路芯片、板卡和使用前述集成计算装置来执行运算操作的方法。
背景技术
现有的人工智能运算往往包含大量的数据运算,如卷积运算、图像处理等。随着数据量的增多,例如矩阵运算的数据运算所涉及的运算量和存储量都会由于数据规模的增大而急剧增加。现有的运算方式中,通常利用中央处理器(“CPU”)或者图像处理单元(“GPU”)等通用处理器进行运算。然而,通用处理器往往由于其通用性特征以及使用的器件冗余性较高,从而使其功耗开销较大,因此导致其使用性能受限。
另外,现有的运算处理电路通常采用单一的硬件架构,只能单一地处理某一类架构下的运算,而无法根据实际需求灵活地选择适合的处理电路。另外,对于一些采用硬连接方式的固定硬件架构来说,当数据规模扩大或数据格式发生变化时,不仅可能会出现不能支持某类运算的情形,而且会在运算过程中使其运算性能极大受限,甚至达到不能操作的情形。
发明内容
为了至少解决上述现有技术中存在的缺陷,本披露提供了一种支持多种类型运算和操作模式、提高运算效率并且节省运算成本和开销的解决方案。具体地,本披露在如下的多个方面中提供前述的解决方案。
在第一方面中,本披露提供一种集成计算装置,包括主控制电路、第一主处理电路和第二主处理电路,其中:所述主控制电路,其配置成获取计算指令并对所述计算指令进行解析以获得运算指令,并且将所述运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个;所述第一主处理电路,其包括一组或多组流水运算电路,其中每组流水运算电路配置成根据接收到的数据和所述运算指令执行流水操作;以及所述第二主处理电路,其包括多个子处理电路,其中每个子处理电路配置成根据接收到的数据和所述运算指令执行多线程操作。
在第二方面中,本披露提供一种集成电路芯片,包括前述及其稍后描述的多个实施例的集成计算装置。
在第三方面中,本披露提供一种板卡,包括前述的集成电路芯片。
在第四方面中,本披露提供一种使用集成计算装置来执行运算操作的方法,其中所述集成计算装置包括主控制电路、第一主处理电路和第二主处理电路,所述方法包括:利用所述主控制电路来获取计算指令并对所述计算指令进行解析以获得运算指令, 并将所述运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个;利用包括在所述第一主处理电路中的一组或多组流水运算电路来根据接收到的数据和所述运算指令执行流水操作;以及利用包括在所述第二主处理电路中的多个子处理电路来根据接收到的数据和所述运算指令执行多线程操作。
通过利用本披露的集成计算装置、集成电路芯片、板卡和方法,可以克服单一类型硬件架构下的操作限制,提升包括例如人工智能领域在内的各类数据处理领域在数据处理和运算方面的运行效率,减小数据运算的时间和功耗,并降低运算的开销和成本。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1是示出根据本披露实施例的集成计算装置的总体架构图;
图2是示出根据本披露实施例的集成计算装置的示例具体架构图;
图3是示出根据本披露实施例的第一主处理电路的示例结构图;
图4a,4b和4c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图;
图5a,5b,5c和5d是示出根据本披露实施例的多个子处理电路的多种连接关系的示意图;
图6a,6b,6c和6d是示出根据本披露实施例的多个子处理电路的另外多种连接关系的示意图;
图7a和图7b是分别示出根据本披露实施例的子处理电路的不同环路结构的示意图;
图8a和图8b是分别示出根据本披露实施例的子处理电路的另外不同环路结构的示意图;
图9是示出根据本披露实施例的集成计算装置和从处理电路的示意架构图;
图10是示出根据本披露实施例的使用集成计算装置来执行运算操作的方法的简化流程图;
图11是示出根据本披露实施例的一种组合处理装置的结构图;以及
图12是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
下面结合附图来详细描述本披露的具体实施方式。
图1是示出根据本披露实施例的集成计算装置100的总体架构图。如图1中所示,本披露的集成计算装置100可以包括主控制电路102、第一主处理电路104和第二主处理电路106。在执行例如计算操作的各种运算操作中,所述主控制电路可以配置成 获取计算指令并对所述计算指令进行解析以获得运算指令,并且将运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。根据本披露的方案,计算指令可以是一种形式的硬件指令并且包括一个或多个操作码,而每个操作码可以表示将要由第一主处理电路或第二主处理电路执行的一个或多个具体的操作。这些操作可以根据应用场景的不同而包括不同类型的操作,例如可以包括加法操作或乘法操作等算术运算、逻辑运算、比较运算或者查表运算,或者前述各类运算的任意多种组合。相应地,在本披露中,运算指令可以是根据计算指令解析后得到的处理电路内部执行的一个或多个微指令。具体地,一个运算指令中可以包括对应于计算指令中的一个操作码的一个或多个微指令,以完成一个或多个操作。
在一个实施例中,在解析所述计算指令的过程中,所述主控制电路102可以配置成获取所述计算指令中的指令标识信息,并且根据所述指令标识信息将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。可以看出,借助于前述的指令标识信息,主控制电路可以针对性地向指令标识信息中标识的第一主处理电路和/或第二主处理电路发送运算指令。进一步,根据应用场景的不同,在解析所述计算指令后获得的运算指令可以是经主控制电路译码后的运算指令或者可以是未经主控制电路译码的运算指令。当运算指令是未经主控制电路译码的运算指令时,则第一主处理电路和第二主处理电路内可以包括相应的译码电路来执行运算指令的译码,以例如得到多个微指令。
在另一个实施例中,在解析所述计算指令的过程中,所述主控制电路可以配置成对获取的计算指令进行译码,并且接着根据所述译码的结果以及所述第一主处理电路和第二主处理电路的工作状态,将所述运算指令发送给所述第一主处理电路和第二主处理电路中的至少一个。在该实施例中,第一主处理电路和第二主处理电路二者都支持非特定的相同类型运算。因此,为了提高主处理电路的利用率和提高运算效率,可以将运算指令发送给使用占用率不高或处于空闲态的主处理电路。
在一个或多个实施例中,所述第一主处理电路104可以包括一组或多组流水运算电路,其中每组流水运算电路可以配置成根据接收到的数据和运算指令执行流水操作。在一些应用场景中,每组流水运算电路可以包括至少一个运算器(例如一个或多个加法器),以执行一级流水运算。进一步,当每组流水运算电路包括的运算器需要进行分级或者包括多种类型的运算器时,则该组流水运算电路可以构成一条多级运算流水线,并且可以配置成执行多级流水运算。例如,一组流水运算电路的结构可以包括第一级加法器、第二级乘法器、第三级加法器构成的三级流水,以执行加法和乘法操作。又例如,一组流水运算电路的结构可以包括如乘法器、加法器、非线性运算器构成的三级流水,用于流水完成加法、乘法和激活操作。
在一些实施例中,所述第二主处理电路106可以包括多个子处理电路,其中每个子处理电路可以配置成根据接收到的数据和运算指令执行多线程操作。在不同的应用场景中,多个子处理电路之间的连接方式既可以是通过硬线布置的硬连接方式,或者可以是根据例如微指令进行配置的逻辑连接方式,以形成多种子处理电路阵列的拓扑结构。例如,前述的多个子处理电路之间可以一维或多维阵列的拓扑结构进行连接排布(如图5与图6中示出的),并且每个子处理电路可以在一定范围内与指定方向和预定间隔模式的其他子处理电路进行连接。进一步,多个子处理电路可以经所述连接 而串接形成一个或多个闭合的环路(如图7与图8中示出的)。
图2是示出根据本披露实施例的集成计算装置200的示例具体架构图。从图2中可以看出,集成计算装置200不仅包括图1中的集成计算装置100的主控制电路102、第一主处理电路104和第二主处理电路106,还进一步示出了第一主处理电路104和第二主处理电路106中包含的多个电路,因此关于图1描述的技术细节同样也适用于图2所示出的内容。鉴于前文已经结合图1对主控制电路、第一主处理电路以及第二主处理电路的功能进行了详细描述,下文将不再赘述。
如图2所示,第一主处理电路104可以包括多组流水运算电路109,其中所述每组流水运算电路可以包括一个或多个运算器,并且当所述每组流水运算电路包括多个运算器时,所述多个运算器可以配置成执行多级流水运算,也即构成一条多级运算流水线。
在一些应用场景中,本披露的流水运算电路可以支持一元运算(即只有一项输入数据的情形)。以神经网络中的scale层+relu层处的运算操作为例,假设待执行的计算指令表达为result=relu(a*ina+b),其中ina是输入数据(例如可以是向量或矩阵),a、b均为运算常量。对于该计算指令,可以应用本披露的包括乘法器、加法器、非线性运算器的一组三级流水运算电路来执行运算。具体来说,可以利用第一级流水的乘法器计算输入数据ina与a的乘积,以获得第一级流水运算结果。接着,可以利用第二级流水的加法器,对该第一级流水运算结果(a*ina)与b执行加法运算获得第二级流水运算结果。最后,可以利用第三级流水的relu激活函数,对该第二级流水运算结果(a*ina+b)进行激活操作,以获得最终的运算结果result。
在一些应用场景中,本披露的流水运算电路可以支持二元运算(例如卷积计算指令result=conv(ina,inb))或三元运算(例如卷积计算指令result=conv(ina,inb,bias)),其中输入数据ina、inb与bias既可以是向量(例如可以是整型、定点型或浮点型数据),也可以是矩阵。这里以卷积计算指令result=conv(ina,inb)为例,可以利用三级流水运算电路结构中包括的多个乘法器、至少一个加法树和至少一个非线性运算器来执行该计算指令所表达的卷积运算,其中两个输入数据ina和inb可以例如是神经元数据。具体来说,首先可以利用三级流水运算电路中的第一级流水乘法器进行计算,从而可以获得第一级流水运算结果product=ina*inb(视为运算指令中的一条微指令,其对应于乘法操作)。继而可以利用第二级流水运算电路中的加法树对第一级流水运算结果“product”执行加和操作,以获得第二级流水运算结果sum。最后,利用第三级流水运算电路的非线性运算器对“sum”执行激活操作,从而得到最终的卷积运算结果。
在一个应用场景中,每组流水运算电路包括的一个或多个运算器既可以执行上述的算术四则运算,还可以进行查表或数据类型转换等多种运算操作。例如在数据类型转换的运算操作中,当输入数据ina为浮点型32位数据(表示为float32)时,可以根据实际运算需求,利用运算器转换成所需的浮点型16位数据(表示为float16)、定点型32位数据(表示为fix32)或整型8位数据(表示为int8)等数据类型。根据操作需求的不同,本披露的流水运算电路不但可以支持上述多种数据类型的转换操作,还可以支持多种数据类型的求绝对值操作和硬化操作等功能。
在一个实施例中,第一主处理电路104还可以包括运算处理电路111,其可以配置成根据运算指令对所述流水运算电路执行运算前的数据(例如输入神经元)进行预 处理或者对运算后的数据(例如输出神经元)进行后处理。在一些实施例中,运算处理电路111还可以与图9中示出的从处理电路112配合使用,以完成预期的运算操作。在一些应用场景中,前述的预处理和后处理可以例如包括数据拆分和/或数据拼接操作。在对数据执行拆分操作的场景中,假设在对一个指定位宽、按行排列的数据N(例如可以是按矩阵排列的形式)执行运算前,运算处理电路可以对该数据N分别按偶数行(表示为N_2i,其中i可以是大于或等于0的自然数)和奇数行(表示为N_2i+1)进行拆分。进一步,在对数据执行拼接操作的场景中,可以根据预定的要求将前例中拆分后的该数据N的偶数行“N_2i”的低256位作为低位与其奇数行“N_2i+1”的低256位作为高位进行拼接,由此组成一个具有512位的新数据。
在另一些应用场景中,在针对执行运算后获得的数据M(例如可以是按矩阵排列的形式)进行处理操作时,运算处理电路可以将数据M偶数行的低256位,先以8位作为1个单位数据进行拆分,以得到32个偶数行单位数据(分别表示为M_2i 0至M_2i 31)。类似地,可以将数据M奇数行的低256位也以8位作为1个单位数据进行拆分,以得到32个奇数行单位数据(分别表示为M_(2i+1) 0至M_(2i+1) 31)。进一步,将拆分后的32个奇数行单位数据与32个偶数行单位数据,根据数据位由低到高、先偶数行后奇数行的顺序依次交替布置。具体地,将偶数行单位数据0(M_2i 0)布置在低位,再顺序布置奇数行单位数据0(M_(2i+1) 0)。接着,布置偶数行单位数据1(M_2i 1)……。以此类推,当完成奇数行单位数据31(M_(2i+1) 31)的布置时,64个单位数据拼接组成一个512位的新数据。
在一个实施例中,第一主处理电路104还可以包括数据转换电路113,其可以配置成根据所述运算指令执行数据转换操作。在一些运算操作中,当数据是矩阵时,数据转换操作可以是针对矩阵元素的排列位置进行的变换。该变换可以例如包括矩阵转置与镜像(稍后结合图4a-图4c描述)、矩阵按照预定的角度(例如是90度、180度或270度)旋转和矩阵维度的转换。
进一步,第二主处理电路106可以包括多个子处理电路115。其中每个子处理电路可以包括逻辑运算电路1151,其可以配置成根据运算指令和接收到的数据执行逻辑运算,例如对接收到的数据执行与或非、移位操作或比较操作等逻辑运算操作。进一步,每个子处理电路还可以包括算术运算电路1153,其可以配置成执行算术运算操作,例如加法、减法或乘法等线性运算。
在一个实施例中,每个子处理电路可以包括存储电路1152,其包括数据存储电路和/或谓词存储电路,其中所述数据存储电路可以配置成存储所述子处理电路的运算数据(例如像素)与中间运算结果中的至少一项。进一步,所述谓词存储电路可以配置成存储利用所述运算指令获取的每个所述子处理电路的谓词存储电路序号和谓词信息。在具体的存储应用中,存储电路1152可以根据实际需要采用寄存器或者静态随机存取存储器“SRAM”等存储器来实现。
在一个应用场景中,谓词存储电路可以包括a个1位寄存器,以用于存储谓词信息。进一步,可以用b位的二进制数来表示a个1位寄存器的序号,其中b>=log 2(a)。例如,假设子处理电路中的谓词存储电路可以包括从00000~11111顺序编号的32个1位寄存器。由此,该子处理电路可以根据接收到的运算指令中指定的寄存器序号“00101”来读取对应序号为“00101”的寄存器中的谓词信息。
在一个实施例中,所述谓词存储电路可以配置成根据所述运算指令对所述谓词信息进行更新。例如,可以根据运算指令中的配置信息直接更新谓词信息,或者也可以根据运算指令中提供的配置信息存储地址来获取配置信息,以便对谓词信息进行更新。在子处理电路执行运算的过程中,谓词存储电路还可以根据每个所述子处理电路的比较结果(其在本披露的上下文中是运算结果的一种形式)对所述谓词信息进行更新。例如,可以利用所述子处理电路接收到的输入数据与其数据存储电路中的存储数据进行比较来更新谓词信息。当所述输入数据大于所述存储数据时,则设置该子处理电路的谓词信息为1。反之,当所述输入数据小于所述存储数据时,则设置所述谓词信息为0,或者维持其原数值不变。
在执行运算操作前,每个子处理电路可以根据运算指令中的信息,来判断该子处理电路是否执行该运算指令的操作。进一步,每个所述子处理电路可以配置成根据所述运算指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息,并且根据所述谓词信息来确定该所述子处理电路是否执行所述运算指令。例如,当子处理电路根据所述运算指令中指定的谓词存储电路序号读取谓词信息获得的数值是1时,则表示该子处理电路执行所述运算指令(如可以是令子处理电路读取该指令中指向的数据,并且将读取的数据存入该子处理电路的数据存储电路)。反之,当子处理电路根据所述运算指令中指定的谓词存储电路序号读取谓词信息获得的数值是0时,则表示该子处理电路不执行所述运算指令。
在一个实施例中,所述第二主处理电路106还可以包括数据处置电路117,其可以包括前处置电路和后处置电路中的至少一个。所述前处置电路可以配置成在所述子处理电路执行运算前对运算数据进行预处理操作(稍后结合图7b描述),例如执行数据拼接或数据摆放操作。所述后处置电路可以配置成在所述子处理电路执行运算后对运算结果进行后处理操作,例如执行数据还原或数据压缩。
为了实现数据的传递和存储,本披露的集成计算装置200还可以包括主存储电路108,其可以接收并存储来自于主控制电路的数据,以作为第一和/或第二主处理电路的输入数据。具体地,可以根据存储方式或存储数据的特征进一步对主存储电路进行划分,所述主存储电路108可以包括主存储模块119和主缓存模块121中的至少一个。其中所述主存储模块119可以配置成存储用于第一主处理电路和/或第二主处理电路中待执行运算的数据(例如可以是神经网络中的神经元或像素数据)与执行运算后的运算结果(例如可以是神经网络中的卷积运算结果)。所述主缓存模块121可以配置成缓存所述第一主处理电路与所述第二主处理电路中至少一个执行运算后的中间运算结果。
在主存储电路与第一主处理电路的交互应用中,第一主处理电路中的流水运算电路还可以借助于存储在主存储电路中的掩码进行对应的操作。例如,在执行运算的过程中,该流水运算电路可以从主存储电路中读取一个掩码,并且可以利用该掩码来表示该流水运算电路中执行运算操作的数据是否有效。主存储电路不仅可以进行内部的存储应用,还具有与本披露的集成计算装置外的存储装置进行数据交互的功能,例如可以通过直接存储器访问(“DMA”)与外部的存储装置进行数据交换。
上文结合图1-图2对集成计算装置的架构及其功能进行了详细描述,下文将结合图3、图4a至图4c对第一主处理电路的具体应用做出示例性说明。
图3是示出根据本披露实施例的第一主处理电路的示例结构图。鉴于上文对第一主处理电路架构功能进行了详细描述,下文将结合第一主处理电路中多组流水运算电路之间,以及多级流水间的协作关系做出进一步地说明。
如图3所示,第一主处理电路104可以包括一组或多组流水运算电路109(如图中所示出的两组)。每组流水运算电路可以包括一级或多级流水运算电路(如图中每组中所示出的第一级流水运算电路~第N级流水运算电路)。所述一级或多级流水运算电路根据接收到的数据和运算指令可以执行一级或多级流水操作。在多级流水操作的应用中,一组流水运算电路的结构可以包括计数器、加法器、乘法器、加法树、累加器和非线性运算器等一种或多种类型的多个运算器,以用于执行多级流水运算。进一步,根据应用场景的不同,多级流水运算可以串行或者并行地执行流水运算。正如本领域技术人员所理解的,本披露的一条运算指令可以通过一组多级流水运算电路来执行。该运算指令包括多个串行的操作,可以通过一组流水运算电路中的第一级、第二级或第N级流水运算电路各执行一个操作,以完成该运算指令。例如,前文结合图2描述地利用三级流水运算电路执行的卷积运算即为串行流水运算。进一步,当多组流水运算电路109都执行运算操作时,其可以同时执行多条运算指令,也即多条指令间的并行操作。
在一些应用场景中,可以对运算操作中将不使用的一级或多级流水运算电路执行旁路操作,即可以根据运算操作的需要选择性地使用多级流水运算电路的一级或多级,而无须令运算操作经过所有的多级流水操作。以计算欧式距离的运算操作为例,假设其计算指令表示为dis=sum((ina-inb)^2),可以只使用由加法器、乘法器、加法树和累加器构成的若干级流水运算电路来进行运算以获得最终的运算结果,而对于未使用的流水运算电路,可以在流水运算操作前或操作中予以旁路。
在前述的流水操作中,每组流水运算电路可以独立地执行所述流水操作。然而,多组中的每组流水运算电路也可以协同地执行所述流水操作。例如,第一组流水运算电路中的第一级、第二级执行串行流水运算后的输出可以作为另一组流水运算电路的第三级流水的输入。又例如,第一组流水运算电路中的第一级、第二级执行并行流水运算,并分别输出各自流水运算的结果,作为另一组流水运算电路的第一级和/或第二级流水操作的输入。
图4a,4b和4c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图。为了更好地理解第一主处理电路中的数据转换电路113执行的转换操作,下面将以原始矩阵进行的转置操作与水平镜像操作为例做进一步描述。
如图4a所示,原始矩阵是(M+1)行×(N+1)列的矩阵。根据应用场景的需求,数据转换电路可以对图4a中示出的原始矩阵进行转置操作转换,以获得如图4b所示出的矩阵。具体来说,数据转换电路可以将原始矩阵中元素的行序号与列序号进行交换操作以形成转置矩阵。具体来说,在图4a示出的原始矩阵中坐标是第1行第0列的元素“10”,其在图4b示出的转置矩阵中的坐标则是第0行第1列。以此类推,在图4a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”,其在图4b示出的转置矩阵中的坐标则是第0行第M+1列。
如图4c所示,数据转换电路可以对图4a示出的原始矩阵进行水平镜像操作以形成水平镜像矩阵。具体来说,所述数据转换电路可以通过水平镜像操作,将原始矩阵 中从首行元素到末行元素的排列顺序转换成从末行元素到首行元素的排列顺序,而对原始矩阵中元素的列号保持不变。具体来说,图4a示出的原始矩阵中坐标分别是第0行第0列的元素“00”与第1行第0列的元素“10”,在图4c中示出的水平镜像矩阵中的坐标则分别是第M+1行第0列与第M行第0列。以此类推,在图4a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”,在图4c示出的水平镜像矩阵中的坐标则是第0行第0列。
图5a,5b,5c和5d是示出根据本披露实施例的多个子处理电路的多种连接关系的示意图。本披露的多个子处理电路之间可以一维或多维阵列的拓扑结构进行连接。当多个子处理电路之间以多维阵列进行连接时,所述多维阵列可以是二维阵列,并且位于所述二维阵列中的所述子处理电路可以在其行方向、列方向或对角线方向的至少一个方向上,以预定的二维间隔模式与同行、同列或同对角线上的其余一个或多个所述子处理电路连接。其中所述预定的二维间隔模式可以与所述连接中间隔的子处理电路的数目相关联。图5a至图5c示例性示出的是多个子处理电路之间的多种形式的二维阵列的拓扑结构。
如图5a所示,五个子处理电路连接形成一个简单的二维阵列。具体来说,以一个子处理电路作为二维阵列的中心,向相对于该子处理电路的水平和垂直的四个方向上各连接一个子处理电路,从而形成一个具有三行和三列大小的二维阵列。进一步,由于位于二维阵列中心的子处理电路分别与同行的前一列和后一列相邻的子处理电路、与同列的上一行和下一行相邻的子处理电路直接连接,从而间隔的子处理电路的数目(简称“间隔数目”)为0。
如图5b所示,四行四列的子处理电路可以连接形成一个二维Torus阵列,其中每个子处理电路分别与其相邻的前一行和后一行、前一列和后一列的子处理电路进行连接,即相邻子处理电路连接的间隔数目均为0。进一步,位于该二维Torus阵列中每行或每列的第一个子处理电路还与该行或该列的最后一个子处理电路相连,每行或每列首尾相连的子处理电路之间的间隔数目均为2。
如图5c所示,四行四列的子处理电路还可以连接形成一个相邻子处理电路之间的间隔数目为0、不相邻子处理电路之间的间隔数目为1的二维阵列。进一步,该二维阵列中同行或同列相邻的子处理电路直接连接,即间隔数目为0,而同行或同列不相邻的子处理电路与间隔数目为1的子处理电路进行连接。可以看出,当多个子处理电路连接形成二维阵列时,图5b和图5c示出的同行或同列的子处理电路之间可以有不同的间隔数目。类似地,在一些场景中,也可以不同的间隔数目将对角线方向上的子处理电路进行连接。
如图5d所示,利用四个如图5b示出的二维Torus阵列,可以按照预定的间隔排列成四层二维Torus阵列进行连接,以形成一个三维Torus阵列。该三维Torus阵列在二维Torus阵列的基础上,利用与行间、列间类似的间隔模式进行层间连接。例如,首先将相邻层同行同列的子处理电路直接相连,即间隔数目为0。接着,将第一层和最后一层同行同列的子处理电路进行连接,即间隔数目为2。最终可以形成四层四行四列的三维Torus阵列。
通过上面这些示例,本领域技术人员可以理解子处理电路的其他多维阵列的连接关系可以在二维阵列的基础上,通过增加新的维度和增加子处理电路的数目来形成。 在一些应用场景中,本披露的方案也可以通过使用配置指令来对子处理电路配置逻辑连接。换句话说,尽管子处理电路之间可能存在硬线连接,但本披露的方案也可以通过配置指令来选择性地令一些子处理电路连接,或者选择性地旁路一些子处理电路,以形成一个或多个逻辑连接。在一些实施例中,还可以根据实际运算的需求(例如数据类型的转换)来调整前述的逻辑连接。总之,针对于不同的计算场景,本披露的方案可以对子处理电路的连接进行配置,包括例如配置成矩阵或者配置成一个或多个闭合的计算环路。
图6a,6b,6c和6d是示出根据本披露实施例的多个子处理电路的另外多种连接关系的示意图。从图中可以看出,图6a至6d是在图5a至图5d示出的多个子处理电路形成的多维阵列的又一种示例性连接关系,鉴于此,结合图5a至图5d所描述的技术细节也同样适用于图6a至图6d所示出的内容。
如图6a所示,二维阵列的子处理电路包括位于二维阵列中心的中心子处理电路和与该中心子处理电路同行和同列的四个方向上分别连接的三个子处理电路。因此,该中心子处理电路与其余子处理电路之间连接的间隔数目分别是0、1和2。如图6b所示,二维阵列的子处理电路包括位于二维阵列中心的中心子处理电路、和与该子处理电路同行的两个相对方向上的三个子处理电路,以及与该子处理电路同列的两个相对方向上的一个子处理电路。因此,中心子处理电路与同行的子处理电路之间连接的间隔数目分别为0和2,与同列的子处理电路之间连接的间隔数目均为0。
正如图5d所示出的,多个子处理电路形成的多维阵列可以由多个层构成的三维阵列。其中所述三维阵列的每个层可以包括沿其行方向和列方向排列的多个所述子处理电路的二维阵列。进一步,位于所述三维阵列中的所述子处理电路可以在其行方向、列方向、对角线方向和层方向的至少一个方向上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个子处理电路连接。进一步,所述预定的三维间隔模式与所述连接中相互间隔的子处理电路的数目可以和间隔的层数目相关。下面将结合图6c与图6d对三维阵列的连接方式作出进一步描述。
图6c示出多个子处理电路连接形成的多层多行多列的三维阵列。以位于第l层、第r行、第c列(表示为(l,r,c))的子处理电路为例,其位于阵列中心位置,并且分别与同层同行的前一列(l,r,c-1)处的子处理电路和后一列(l,r,c+1)处的子处理电路、同层同列的前一行(l,r-1,c)处的子处理电路和后一行(l,r+1,c)处的子处理电路,以及同行同列不同层的前一层(l-1,r,c)处的子处理电路和后一层(l+1,r,c)处的子处理电路进行连接。进一步,(l,r,c)处的子处理电路与其他子处理电路在行方向、列方向和层方向上连接的间隔数目均为0。
图6d示出当多个子处理电路之间在行方向、列方向、和层方向上连接的间隔数目均为1时的三维阵列。以位于阵列中心位置(l,r,c)的子处理电路为例,其分别与同层同行不同列的前后各间隔一列的(l,r,c-2)和(l,r,c+2)处的子处理电路、同层同列不同行的前后各间隔一行的(l,r-2,c)和(l,r+2,c)处的子处理电路进行连接。进一步,其与同行同列不同层的前后各间隔一层的(l-2,r,c)和(l+2,r,c)处的子处理电路进行连接。类似地,其余的同层同行间隔一列的(l,r,c-3)与(l,r,c-1)处的子处理电路彼此进行连接,而(l,r,c+1)与(l,r,c+3)处的子处理电路彼此进行连接。接着,同层同列间隔一行的(l,r-3,c)与(l,r-1,c)处的子 处理电路彼此进行连接、(l,r+1,c)与(l,r+3,c)处的子处理电路彼此进行连接。另外,同行同列间隔一层的(l-3,r,c)与(l-1,r,c)处的子处理电路彼此进行连接、而(l+1,r,c)与(l+3,r,c)处的子处理电路彼此进行连接。
上文对多个子处理电路形成的多维阵列的连接关系进行了示例性描述,下文将结合图7与图8对子处理电路形成的环路结构做出进一步示例性说明。
图7a和图7b是分别示出根据本披露实施例的子处理电路的不同环路结构的示意图。如图7a所示,四个相邻的子处理电路115顺序编号为“0、1、2和3”。接着,从子处理电路0开始按照顺时针方向将该四个子处理电路顺序相连,并且子处理电路3与子处理电路0进行连接,以使四个子处理电路串联形成一个闭合的环路(简称“成环”)。从该环路中可以看出,图7a中示出的子处理电路的间隔数目为0或2,例如子处理电路0与1之间间隔数目为0,而子处理电路3与0之间间隔数目为2。进一步,所示环路中的四个子处理电路的物理地址可以为0-1-2-3,而其逻辑地址同样为0-1-2-3。需要注意的是,图7a所示出的连接顺序仅仅是示例性的而非限制性的,本领域技术人员根据实际计算需要,也可以以逆时针方向对四个子处理电路进行串联连接以形成闭合的环路。
在一些实际场景中,当一个子处理电路支持的数据位宽不能满足运算数据的位宽要求时,可以利用多个子处理电路组合成一个子处理电路组以表示一个数据。例如,假设一个子处理电路可以处理8位数据。当需要处理32位的数据时,则可以将4个子处理电路进行组合成为一个子处理电路组,以便对4个8位数据进行连接以形成一个32位数据。进一步,前述4个8位子处理电路形成的一个子处理电路组可以充当图7b中示出的一个子处理电路115,从而可以支持更高位宽的运算操作。
从图7b中可以看出,其所示出的子处理电路的布局与图7a示出的类似,但图7b中子处理电路之间连接的间隔数目与图7a不同。如图7b所示,以0、1、2和3顺序编号的四个子处理电路按顺时针方向从子处理电路0开始,顺序连接子处理电路1、子处理电路3和子处理电路2,并且子处理电路2连接至子处理电路0,从而串联形成一个闭合的环路。从该环路中可以看出,图7b中示出的子处理电路的间隔数目为0或1,例如子处理电路0与1之间间隔为0,而子处理电路1与3之间间隔为1。进一步,所示闭合环路中的四个子处理电路的物理地址可以为0-1-2-3,而逻辑地址则为0-1-3-2。因此,当需要对高比特位宽的数据进行拆分以分配给不同的子处理电路时,可以根据子处理电路的逻辑地址对数据顺序进行重新排列和分配。
上述的拆分和重新排列的操作可以由结合图2描述的前处置电路来执行。特别地,该前处置电路可以根据多个子处理电路的物理地址和逻辑地址来对输入数据进行重新排列,以用于满足数据运算的要求。假设四个顺序排列的子处理电路0至子处理电路3如图7a中所示出的连接,由于连接的物理地址和逻辑地址都为0-1-2-3,因此前处置电路可以将输入数据(例如像素数据)aa0、aa1、aa2和aa3依次传送到对应的子处理电路中。然而,当前述的四个子处理电路按图7b所示出的连接时,其物理地址保持0-1-2-3不变,而逻辑地址变为0-1-3-2,此时前处置电路需要将输入数据aa0、aa1、aa2和aa3重新排列为aa0-aa1-aa3-aa2,以传送到对应的子处理电路中。基于上述的输入数据重排列,本披露的方案可以保证数据运算顺序的正确性。类似地,如果前述获得的四个运算输出结果(例如是像素数据)的顺序是bb0-bb1-bb3-bb2,可以利用结合 图2描述的后处置电路将运算输出结果的顺序还原调整为bb0-bb1-bb2-bb3,以用于保证输入数据和输出结果数据之间的排列一致性。
图8a和图8b是分别示出根据本披露实施例的子处理电路的另外不同环路结构的示意图,其中示出更多的子处理电路以不同方式进行排列和连接,以形成闭合的环路。
如图8a所示,以0,1…15顺序编号的16个子处理电路115从子处理电路0开始,顺序地每两个子处理电路进行连接和组合,以形成一个子处理电路组。例如,如图中所示,子处理电路0与子处理电路1连接形成一个子处理电路组……。以此类推,子处理电路14与子处理电路15连接以形成一个子处理电路组,最终形成八个子处理电路组。进一步,该八个子处理电路组也可以类似于前述的子处理电路的连接方式进行连接,包括按照例如预定的逻辑地址来进行连接,以形成一个子处理电路组的闭合的环路。
如图8b所示,多个子处理电路115以不规则或者说不统一的方式来连接,以形成一个闭合的环路。具体来说,在图8b中示出子处理电路之间可以间隔数目为0或3来形成闭合的环路,例如子处理电路0可以分别与子处理电路1(间隔数目为0)和子处理电路4(间隔数目为3)相连。
由上述结合图7a、7b、8a和8b的描述可知,本披露的子处理电路可以间隔有不同数目的子处理电路,以便连接成闭合的环路。当子处理电路总数变化时,也可以选择任意的中间间隔数目进行动态配置,从而连接成闭合的环路。还可以将多个子处理电路组合成为子处理电路组,并连接成子处理电路组的闭合的环路。另外,多个子处理电路的连接可以是硬件构成的硬连接方式,或者可以是软件配置的软连接方式。
图9是示出根据本披露实施例的集成计算装置和从处理电路的示意架构图。需要注意的是,本披露的集成计算装置和从处理电路的架构图仅是示意性的而非限制性的。本披露的方案除执行流水操作与多线程操作外,还可以与从处理电路配合执行其他类型的数据操作。
如图9所示,与图1-图2架构类似的集成计算装置包括主控制电路102、第一主处理电路104和第二主处理电路106。进一步,所述第一主处理电路和第二主处理电路中的至少一个可以通过互联电路110,与至少一个从处理电路112进行通信。其中所述互联电路110可以用于转发第一主处理电路或第二主处理电路与所述至少一个从处理电路之间传送的数据、运算指令或中间运算结果。在一个实施例中,所述至少一个从处理电路可以配置成通过互联电路接收从第一主处理电路和第二主处理电路中至少一个传输的数据和运算指令来并行执行中间运算,以得到多个中间运算结果。以及可以通过互联电路将所述多个中间运算结果传输给所述第一主处理电路或第二主处理电路中的至少一个。在另一个实施例中,所述第一主处理电路可以配置成以单指令多数据(Single Instruction Multiple Data,“SIMD”)方式接收并执行所述运算指令,而所述第二主处理电路可以配置成以单指令多线程(Single Instruction Multiple Thread,“SIMT”)方式接收并执行所述运算指令。
图10是示出根据本披露实施例的使用集成计算装置来执行运算操作的方法1000的简化流程图。所述集成计算装置可以应用如图1-图2所示出的架构。
如图10所示,在步骤1010处,方法1000可以利用所述主控制电路来获取计算指令,并且可以对所述计算指令进行解析以获得运算指令。而且可以将运算指令发送 至所述第一主处理电路和所述第二主处理电路中的至少一个。在一个实施例中,主控制电路可以根据所述计算指令中的指令标识信息确定执行操作的第一和/或第二主处理电路,将所述运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个,以执行运算指令指定地相应操作。
在一个或多个实施例中,在解析计算指令的过程中,主控制电路可以对计算指令执行译码操作,根据译码结果将所述运算指令发送给所述第一主处理电路和第二主处理电路中的至少一个。当第一主处理电路和第二主处理电路二者都支持非特定的相同类型运算时,主控制电路可以根据第一主处理电路与第二主处理电路的负载情况,发送运算指令给使用占用率不高或处于空闲态的主处理电路。进一步,根据应用场景的不同,在解析所述计算指令后获得的运算指令也可以是未经主控制电路译码的运算指令。而第一或第二主处理电路中可以包含相应的译码电路对接收到的运算指令进行译码,以例如生成多个微指令,从而第一或第二主处理电路可以根据所述微指令执行后续操作。
接着,流程可以根据步骤1010处确定的执行下一步操作的第一主处理电路和第二主处理电路中的至少一个,前进到步骤1020和/或1030处。具体地,当执行步骤1020时,方法1000可以利用包括在所述第一主处理电路中的一组或多组流水运算电路,以及根据接收到的数据(例如神经元数据)和运算指令执行流水操作。在一个实施例中,多组流水运算电路中的各组可以独立地或协同地执行所述流水操作。具体来说,本披露的多组流水运算电路之间支持独立地完成各自的流水操作,并且可以相互之间并行地执行这些流水操作。进一步,这些并行的流水操作可以涉及相同或者不同的运算操作。相较而言,在协同地执行所述流水操作中,多组流水运算电路可以例如根据计算指令或者控制信号,在不同的组上执行流水操作时相互之间协作、等待或传递中间或结果数据以完成计算等操作。在另一个实施例中,每组流水运算电路可以包括一级流水操作(例如可以包括一个运算器或多个运算器)或多级流水操作(例如可以串行执行或并行执行操作)。
在步骤1030处,方法1000可以利用包括在所述第二主处理电路中的多个子处理电路来根据接收到的数据(例如像素数据)和运算指令执行多线程操作。在一个实施例中,多个子处理电路可以一维或多维阵列的拓扑结构进行连接,并且经过所述连接而串接的多个子处理电路阵列可以形成一个或多个闭合的环路。在另一个实施例中,多个子处理电路可以根据接收到的运算指令中的信息(例如谓词信息)判断是否执行该运算指令的操作。
图11是示出根据本披露实施例的一种组合处理装置1100的结构图。如图11中所示,该组合处理装置1100包括计算处理装置1102、接口装置1104、其他处理装置1106和存储装置1108。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置1110,该计算装置可以配置用于执行本文结合图1-图10所描述的操作。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可 以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图12中示出的芯片1202)。在一种实现中,该芯片是一种***级芯片(System on Chip,SoC),并且集成有一个或多个如图11中所示的组合处理装置。该芯片可以通过对外接口装置(如图12中示出的对外接口装置1206)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图12对该板卡进行详细地描述。
图12是示出根据本披露实施例的一种板卡1200的结构示意图。如图12中所示,该板卡包括用于存储数据的存储器件1204,其包括一个或多个存储单元1210。该存储器件可以通过例如总线等方式与控制器件1208和上文所述的芯片1202进行连接和数据传输。进一步,该板卡还包括对外接口装置1206,其配置用于芯片(或芯片封装结构中的芯片)与外部设备1212(例如服务器或计算机等)之间的数据中继或转接功 能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图11和图12的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个***,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件 之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款1、一种集成计算装置,包括主控制电路、第一主处理电路和第二主处理电路,其中:
所述主控制电路,其配置成获取计算指令并对所述计算指令进行解析以获得运算指令,并且将运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个;
所述第一主处理电路,其包括一组或多组流水运算电路,其中每组流水运算电路配置成根据接收到的数据和运算指令执行流水操作;以及
所述第二主处理电路,其包括多个子处理电路,其中每个子处理电路配置成根据接收到的数据和运算指令执行多线程操作。
条款2、根据条款1所述的集成计算装置,其中在解析所述计算指令中,所述主控制电路配置成:
获取所述计算指令中的指令标识信息;以及
根据所述指令标识信息将所述运算指令发送到所述第一主处理电路和所述第二 主处理电路中的至少一个。
条款3、根据条款1所述的集成计算装置,其中在解析所述计算指令中,所述主控制电路配置成:
对所述计算指令进行译码;以及
根据所述译码的结果以及所述第一主处理电路和第二主处理电路的操作状态,将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。
条款4、根据条款1所述的集成计算装置,其中所述多组流水运算电路中的各组独立地或协同地执行所述流水操作。
条款5、根据条款4所述的集成计算装置,其中所述每组流水运算电路包括一个或多个运算器,并且当所述每组流水运算电路包括多个运算器时,所述多个运算器配置成执行多级流水运算。
条款6、根据条款1所述的集成计算装置,其中所述第一主处理电路还包括运算处理电路,其配置成根据运算指令对所述流水运算电路执行运算前的数据进行预处理或者对运算后的数据进行后处理。
条款7、根据条款1所述的集成计算装置,其中所述第一主处理电路还包括数据转换电路,其配置成根据所述运算指令执行数据转换操作。
条款8、根据条款1所述的集成计算装置,其中所述多个子处理电路以一维或多维阵列的拓扑结构连接。
条款9、根据条款8所述的集成计算装置,其中所述多维阵列是二维阵列,并且位于所述二维阵列中的所述子处理电路在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述子处理电路连接。
条款10、根据条款9所述的集成计算装置,其中所述预定的二维间隔模式与所述连接中间隔的子处理电路的数目相关联。
条款11、根据条款8所述的集成计算装置,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向和列方向排列的多个所述子处理电路的二维阵列,其中:
位于所述三维阵列中的所述子处理电路在其行方向、列方向、对角线和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个子处理电路连接。
条款12、根据条款11所述的集成计算装置,其中所述预定的三维间隔模式与所述连接中相互间隔的子处理电路的数目和间隔的层数目相关。
条款13、根据条款8-12的任意一项所述的集成计算装置,其中经所述连接而串接的多个子处理电路形成一个或多个闭合的环路。
条款14、根据条款1所述的集成计算装置,其中所述多个子处理电路配置成根据运算指令判断是否参与运算。
条款15、根据条款1所述的集成计算装置,其中每个所述子处理电路包括:
逻辑运算电路,其配置成根据运算指令和数据执行逻辑运算;以及
存储电路,其包括数据存储电路,其中所述数据存储电路配置成存储所述子处理电路的运算数据与中间运算结果中的至少一项。
条款16、根据条款15所述的集成计算装置,其中所述存储电路还包括谓词存储电路,其中所述谓词存储电路配置成存储利用所述运算指令获取的每个所述子处理电路的谓词存储电路序号和谓词信息。
条款17、根据条款16所述的集成计算装置,其中所述谓词存储电路还配置成:
根据所述运算指令对所述谓词信息进行更新;或者
根据每个所述子处理电路的运算结果对所述谓词信息进行更新。
条款18、根据条款16所述的集成计算装置,其中每个所述子处理电路配置成:
根据所述运算指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及
根据所述谓词信息来确定该所述子处理电路是否执行所述运算指令。
条款19、根据条款1所述的集成计算装置,其中每个所述子处理电路包括算术运算电路,其配置成执行算术运算操作。
条款20、根据条款1所述的集成计算装置,其中所述第二主处理电路还包括数据处置电路,所述数据处置电路包括前处置电路和后处置电路中的至少一个,其中所述前处置电路配置成在所述子处理电路执行运算前对运算数据进行预处理操作,并且所述后处置电路配置成在所述子处理电路执行运算后对运算结果进行后处理操作。
条款21、根据条款1所述的集成计算装置,其中所述集成计算装置还包括主存储电路,所述主存储电路包括主存储模块和主缓存模块中的至少一个,其中所述主存储模块配置成存储用于主处理电路中执行运算的数据与执行运算后的运算结果,并且所述主缓存模块配置成缓存所述第一主处理电路与所述第二主处理电路中至少一个执行运算后的中间运算结果。
条款22、根据条款1-12或14-21的任意一项所述的集成计算装置,还包括:
至少一个从处理电路,其配置成用于根据从所述第一主处理电路和第二主处理电路中的至少一个传输的数据和运算指令来并行执行中间运算,以得到多个中间结果,并将所述多个中间结果传输给所述第一主处理电路和第二主处理电路中的至少一个。
条款23、根据条款22所述的集成计算装置,其中所述第一主处理电路配置成以SIMD方式接收并执行所述运算指令。
条款24、根据条款22所述的集成计算装置,其中所述第二主处理电路配置成以SIMT方式接收并执行所述运算指令。
条款25、一种集成电路芯片,包括根据条款1-24任意一项所述的集成计算装置。
条款26、一种板卡,包括根据条款25所述的集成电路芯片。
条款27、一种使用集成计算装置来执行运算操作的方法,其中所述集成计算装置包括主控制电路、第一主处理电路和第二主处理电路,所述方法包括:
利用所述主控制电路来获取计算指令并对所述计算指令进行解析以获得运算指令,并将运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个;
利用包括在所述第一主处理电路中的一组或多组流水运算电路来根据接收到的数据和运算指令执行流水操作;以及
利用包括在所述第二主处理电路中的多个子处理电路来根据接收到的数据和运算指令执行多线程操作。
条款28、根据条款27所述的方法,其中在解析所述计算指令中,所述方法利用 所述主控制电路来执行以下步骤:
获取所述计算指令中的指令标识信息;以及
根据所述指令标识信息将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。
条款29、根据条款27所述的方法,其中在解析所述计算指令中,所述方法利用主控制电路来执行以下步骤:
对所述计算指令进行译码;以及
根据所述译码的结果以及所述第一主处理电路和第二主处理电路的操作状态,将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。
条款30、根据条款27所述的方法,其中利用所述多组流水运算电路中的各组独立地或协同地执行所述流水操作。
条款31、根据条款30所述的方法,其中所述每组流水运算电路包括一个或多个运算器,并且当所述每组流水运算电路包括多个运算器时,所述方法利用多个运算器来执行多级流水运算。
条款32、根据条款27所述的方法,其中所述第一主处理电路还包括运算处理电路,所述方法还包括利用所述运算处理电路来根据运算指令对所述流水运算电路执行运算前的数据进行预处理或者对运算后的数据进行后处理。
条款33、根据条款27所述的方法,其中所述第一主处理电路还包括数据转换电路,所述方法还包括利用所述数据转换电路来根据所述运算指令执行数据转换操作。
条款34、根据条款27所述的方法,其中所述多个子处理电路以一维或多维阵列的拓扑结构连接。
条款35、根据条款34所述的方法,其中所述多维阵列是二维阵列,并且位于所述二维阵列中的所述子处理电路被连接成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述子处理电路连接。
条款36、根据条款35所述的方法,其中所述预定的二维间隔模式与所述连接中间隔的子处理电路的数目相关联。
条款37、根据条款34所述的方法,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向和列方向排列的多个所述子处理电路的二维阵列,其中所述方法包括:
连接位于所述三维阵列中的所述子处理电路,使得所述子处理电路在其行方向、列方向、对角线和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个子处理电路连接。
条款38、根据条款37所述的方法,其中所述预定的三维间隔模式与所述连接中相互间隔的子处理电路的数目和间隔的层数目相关。
条款39、根据条款34-38的任意一项所述的方法,其中经所述连接而串接的多个子处理电路形成一个或多个闭合的环路。
条款40、根据条款27所述的方法,其中根据所述运算指令判断所述多个子处理电路是否参与运算。
条款41、根据条款27所述的方法,其中每个所述子处理电路包括逻辑运算电路 和存储电路,其中所述存储电路包括数据存储电路,所述方法包括利用所述逻辑运算电路来根据运算指令和数据执行逻辑运算,利用所述数据存储电路来存储所述子处理电路的运算数据与中间运算结果中的至少一项。
条款42、根据条款41所述的方法,其中所述存储电路还包括谓词存储电路,其中所述方法包括利用所述谓词存储电路存储利用所述运算指令获取的每个所述子处理电路的谓词存储电路序号和谓词信息。
条款43、根据条款42所述的方法,其中利用所述谓词存储电路执行以下步骤:
根据所述运算指令对所述谓词信息进行更新;或者
根据每个所述子处理电路的运算结果对所述谓词信息进行更新。
条款44、根据条款42所述的方法,其中利用每个所述子处理电路执行以下步骤:
根据所述运算指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及
根据所述谓词信息来确定该所述子处理电路是否执行所述运算指令。
条款45、根据条款27所述的方法,其中每个所述子处理电路包括算术运算电路,并且所述方法利用所述算术运算电路执行算术运算操作。
条款46、根据条款27所述的方法,其中所述第二主处理电路还包括数据处置电路,所述数据处置电路包括前处置电路和后处置电路中的至少一个,其中所述方法包括在所述子处理电路执行运算前利用所述前处置电路对运算数据进行预处理操作,并且在所述子处理电路执行运算后利用所述后处置电路对运算结果进行后处理操作。
条款47、根据条款27所述的方法,其中所述集成计算装置还包括主存储电路,所述主存储电路包括主存储模块和主缓存模块中的至少一个,其中所述方法包括利用所述主存储模块存储用于主处理电路中执行运算的数据与执行运算后的运算结果,并且利用所述主缓存模块缓存所述第一主处理电路与所述第二主处理电路中至少一个执行运算后的中间运算结果。
条款48、根据条款27-38或40-47的任意一项所述的方法,其中所述集成计算装置还包括至少一个从处理电路,所述方法包括利用所述至少一个从处理电路来根据从所述第一主处理电路和第二主处理电路中的至少一个传输的数据和运算指令来并行执行中间运算,以得到多个中间结果,并将所述多个中间结果传输给所述第一主处理电路和第二主处理电路中的至少一个。
条款49、根据条款48所述的方法,其中所述第一主处理电路配置成以SIMD方式接收并执行所述运算指令。
条款50、根据条款48所述的方法,其中所述第二主处理电路配置成以SIMT方式接收并执行所述运算指令。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。

Claims (34)

  1. 一种集成计算装置,包括主控制电路、第一主处理电路和第二主处理电路,其中:
    所述主控制电路,其配置成获取计算指令并对所述计算指令进行解析以获得运算指令,并且将运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个;
    所述第一主处理电路,其包括一组或多组流水运算电路,其中每组流水运算电路配置成根据接收到的数据和运算指令执行流水操作;以及
    所述第二主处理电路,其包括多个子处理电路,其中每个子处理电路配置成根据接收到的数据和运算指令执行多线程操作。
  2. 根据权利要求1所述的集成计算装置,其中在解析所述计算指令中,所述主控制电路配置成:
    获取所述计算指令中的指令标识信息;以及
    根据所述指令标识信息将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个;
    或者
    所述主控制电路配置成:
    对所述计算指令进行译码;以及
    根据所述译码的结果以及所述第一主处理电路和第二主处理电路的操作状态,将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。
  3. 根据权利要求1所述的集成计算装置,其中所述多组流水运算电路配置成协同地执行所述流水操作。
  4. 根据权利要求1所述的集成计算装置,其中所述每组流水运算电路包括一个或多个运算器,并且当所述每组流水运算电路包括多个运算器时,所述多个运算器配置成执行多级流水运算。
  5. 根据权利要求1所述的集成计算装置,其中所述第一主处理电路还包括运算处理电路和/或数据转换电路,其中所述运算处理电路配置成根据运算指令对所述流水运算电路执行运算前的数据进行预处理或者对运算后的数据进行后处理,并且所述数据转换电路配置成根据所述运算指令执行数据转换操作。
  6. 根据权利要求1所述的集成计算装置,其中所述多个子处理电路以一维或多维阵列的结构连接。
  7. 根据权利要求6所述的集成计算装置,其中所述多维阵列是二维阵列,并且位于所述二维阵列中的所述子处理电路在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述子处理电路连接,其中所述预定的二维间隔模式与所述连接中间隔的子处理电路的数目相关联。
  8. 根据权利要求6所述的集成计算装置,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述子处理电路的二维阵列,其中:
    位于所述三维阵列中的所述子处理电路在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一 个或多个子处理电路连接,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关。
  9. 根据权利要求6-8的任意一项所述的集成计算装置,其中经所述连接而串接的多个子处理电路形成一个或多个闭合的环路。
  10. 根据权利要求1所述的集成计算装置,其中所述多个子处理电路配置成根据运算指令判断是否参与运算。
  11. 根据权利要求1所述的集成计算装置,其中每个所述子处理电路包括:
    逻辑运算电路,其配置成根据运算指令和数据执行逻辑运算;以及
    存储电路,其包括数据存储电路和谓词存储电路,
    其中所述数据存储电路配置成存储所述子处理电路的运算数据与中间运算结果中的至少一项,
    其中所述谓词存储电路配置成:
    存储利用所述运算指令获取的每个所述子处理电路的谓词存储电路序号和谓词信息;
    根据所述运算指令对所述谓词信息进行更新;或者
    根据每个所述子处理电路的运算结果对所述谓词信息进行更新。
  12. 根据权利要求11所述的集成计算装置,其中每个所述子处理电路配置成:
    根据所述运算指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及
    根据所述谓词信息来确定该所述子处理电路是否执行所述运算指令。
  13. 根据权利要求1所述的集成计算装置,其中每个所述子处理电路包括算术运算电路,其配置成执行算术运算操作。
  14. 根据权利要求1所述的集成计算装置,其中所述第二主处理电路还包括数据处置电路,所述数据处置电路包括前处置电路和后处置电路中的至少一个,其中所述前处置电路配置成在所述子处理电路执行运算前对运算数据进行预处理操作,并且所述后处置电路配置成在所述子处理电路执行运算后对运算结果进行后处理操作。
  15. 根据权利要求1所述的集成计算装置,其中所述集成计算装置还包括主存储电路,所述主存储电路包括主存储模块和主缓存模块中的至少一个,其中所述主存储模块配置成存储用于主处理电路中执行运算的数据与执行运算后的运算结果,并且所述主缓存模块配置成缓存所述第一主处理电路与所述第二主处理电路中至少一个执行运算后的中间运算结果。
  16. 根据权利要求1-8或10-15的任意一项所述的集成计算装置,还包括:
    至少一个从处理电路,其配置成用于根据从所述第一主处理电路和第二主处理电路中的至少一个传输的数据和运算指令来并行执行中间运算,以得到多个中间结果,并将所述多个中间结果传输给所述第一主处理电路和第二主处理电路中的至少一个。
  17. 根据权利要求16所述的集成计算装置,其中所述第一主处理电路配置成以SIMD方式接收并执行所述运算指令并且所述第二主处理电路配置成以SIMT方式接收并执行所述运算指令。
  18. 一种集成电路芯片,包括根据权利要求1-17任意一项所述的集成计算装置。
  19. 一种板卡,包括根据权利要求18所述的集成电路芯片。
  20. 一种使用集成计算装置来执行运算操作的方法,其中所述集成计算装置包括主控制电路、第一主处理电路和第二主处理电路,所述方法包括:
    利用所述主控制电路来获取计算指令并对所述计算指令进行解析以获得运算指令,并将所述运算指令发送至所述第一主处理电路和所述第二主处理电路中的至少一个;
    利用包括在所述第一主处理电路中的一组或多组流水运算电路来根据接收到的数据和所述运算指令执行流水操作;以及
    利用包括在所述第二主处理电路中的多个子处理电路来根据接收到的数据和所述运算指令执行多线程操作。
  21. 根据权利要求20所述的方法,其中在解析所述计算指令中,所述方法利用所述主控制电路来执行以下步骤:
    获取所述计算指令中的指令标识信息;以及
    根据所述指令标识信息将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个;
    或者
    所述方法利用所述主控制电路来执行以下步骤:
    对所述计算指令进行译码;以及
    根据所述译码的结果以及所述第一主处理电路和第二主处理电路的操作状态,将所述运算指令发送到所述第一主处理电路和所述第二主处理电路中的至少一个。
  22. 根据权利要求20所述的方法,其中利用所述多组流水运算电路来协同地执行所述流水操作。
  23. 根据权利要求20所述的方法,其中所述每组流水运算电路包括一个或多个运算器,并且当所述每组流水运算电路包括多个运算器时,所述方法利用所述多个运算器来执行多级流水运算。
  24. 根据权利要求20所述的方法,其中所述第一主处理电路还包括运算处理电路和/或数据转换电路,其中所述方法利用所述运算处理电路来根据运算指令对所述流水运算电路执行运算前的数据进行预处理或者对运算后的数据进行后处理,并且利用所述数据转换电路来根据所述运算指令执行数据转换操作。
  25. 根据权利要求20所述的方法,其中所述多个子处理电路以一维或多维阵列的结构连接。
  26. 根据权利要求25所述的方法,其中所述多维阵列是二维阵列,并且位于所述二维阵列中的所述子处理电路被连接成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述子处理电路连接,其中所述预定的二维间隔模式与所述连接中间隔的子处理电路的数目相关联。
  27. 根据权利要求25所述的方法,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述子处理电路的二维阵列,其中所述方法包括:
    连接位于所述三维阵列中的所述子处理电路,使得所述处理子处理在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个子处理电路连接,其中所述预定的三维间隔模式 与待连接的处理电路之间的间隔数目和间隔层数相关。
  28. 根据权利要求25-27的任意一项所述的方法,其中经所述连接而串接的多个子处理电路形成一个或多个闭合的环路。
  29. 根据权利要求20所述的方法,其中根据运算指令判断所述多个子处理电路是否参与运算。
  30. 根据权利要求20所述的方法,其中每个所述子处理电路包括逻辑运算电路和存储电路,所述存储电路包括数据存储电路和谓词存储电路,其中所述方法包括利用所述逻辑运算电路来根据运算指令和数据执行逻辑运算,并且利用所述数据存储电路来存储所述子处理电路的运算数据与中间运算结果中的至少一项,以及利用所述谓词存储电路来执行以下步骤:
    存储利用所述运算指令获取的每个所述子处理电路的谓词存储电路序号和谓词信息;
    根据所述运算指令对所述谓词信息进行更新;或者
    根据每个所述子处理电路的运算结果对所述谓词信息进行更新。
  31. 根据权利要求30所述的方法,其中利用每个所述子处理电路执行以下步骤:
    根据所述运算指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及
    根据所述谓词信息来确定该所述子处理电路是否执行所述运算指令。
  32. 根据权利要求20所述的方法,其中所述第二主处理电路还包括数据处置电路,所述数据处置电路包括前处置电路和后处置电路中的至少一个,其中所述方法包括利用所述前处置电路在所述子处理电路执行运算前对运算数据进行预处理操作,并且利用所述后处置电路在所述子处理电路执行运算后对运算结果进行后处理操作。
  33. 根据权利要求20-27或29-32的任意一项所述的方法,其中所述集成电路装置还包括至少一个从处理电路,所述方法还包括利用所述至少一个从处理电路来根据从所述第一主处理电路和第二主处理电路中的至少一个传输的数据和运算指令来并行执行中间运算,以得到多个中间结果,并将所述多个中间结果传输给所述第一主处理电路和第二主处理电路中的至少一个。
  34. 根据权利要求33所述的方法,其中将所述第一主处理电路配置成以SIMD方式接收并执行所述运算指令并且将所述第二主处理电路配置成以SIMT方式接收并执行所述运算指令。
PCT/CN2021/094721 2020-06-30 2021-05-19 集成计算装置、集成电路芯片、板卡和计算方法 WO2022001454A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010618148.7 2020-06-30
CN202010618148.7A CN113867798A (zh) 2020-06-30 2020-06-30 集成计算装置、集成电路芯片、板卡和计算方法

Publications (1)

Publication Number Publication Date
WO2022001454A1 true WO2022001454A1 (zh) 2022-01-06

Family

ID=78981594

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/094721 WO2022001454A1 (zh) 2020-06-30 2021-05-19 集成计算装置、集成电路芯片、板卡和计算方法

Country Status (2)

Country Link
CN (1) CN113867798A (zh)
WO (1) WO2022001454A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758213A (zh) * 2004-02-27 2006-04-12 印芬龙科技股份有限公司 带有共享内容的异构型并行多线程处理器(hpmt)
CN101952801A (zh) * 2008-01-16 2011-01-19 诺基亚公司 用于流数据处理的协处理器
CN105144082A (zh) * 2012-12-28 2015-12-09 英特尔公司 基于平台热以及功率预算约束,对于给定工作负荷的最佳逻辑处理器计数和类型选择
CN109388428A (zh) * 2017-08-11 2019-02-26 华为技术有限公司 图层遍历方法、控制装置及数据处理***
CN110121698A (zh) * 2016-12-31 2019-08-13 英特尔公司 用于异构计算的***、方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758213A (zh) * 2004-02-27 2006-04-12 印芬龙科技股份有限公司 带有共享内容的异构型并行多线程处理器(hpmt)
CN101952801A (zh) * 2008-01-16 2011-01-19 诺基亚公司 用于流数据处理的协处理器
CN105144082A (zh) * 2012-12-28 2015-12-09 英特尔公司 基于平台热以及功率预算约束,对于给定工作负荷的最佳逻辑处理器计数和类型选择
CN110121698A (zh) * 2016-12-31 2019-08-13 英特尔公司 用于异构计算的***、方法和装置
CN109388428A (zh) * 2017-08-11 2019-02-26 华为技术有限公司 图层遍历方法、控制装置及数据处理***

Also Published As

Publication number Publication date
CN113867798A (zh) 2021-12-31

Similar Documents

Publication Publication Date Title
CN109189473A (zh) 神经网络处理装置及其执行向量交换指令的方法
CN109032670A (zh) 神经网络处理装置及其执行向量复制指令的方法
WO2022161318A1 (zh) 数据处理装置、方法及相关产品
CN110059797B (zh) 一种计算装置及相关产品
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN112686379A (zh) 集成电路装置、电子设备、板卡和计算方法
CN110059809B (zh) 一种计算装置及相关产品
WO2022001454A1 (zh) 集成计算装置、集成电路芯片、板卡和计算方法
WO2022001497A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN111368967A (zh) 一种神经网络计算装置和方法
WO2022001500A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001457A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法
WO2022001439A1 (zh) 计算装置、集成电路芯片、板卡和计算方法
WO2022001456A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001498A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001499A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法
JP7368512B2 (ja) 計算装置、集積回路チップ、ボードカード、電子デバイスおよび計算方法
CN113791996B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2022001496A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN111368990A (zh) 一种神经网络计算装置和方法
WO2022134688A1 (zh) 数据处理电路、数据处理方法及相关产品
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2022111013A1 (zh) 支援多种访问模式的设备、方法及可读存储介质
CN118277305A (zh) 降低运算耗时的设备与方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21834494

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21834494

Country of ref document: EP

Kind code of ref document: A1