WO2022001455A1 - 计算装置、集成电路芯片、板卡、电子设备和计算方法 - Google Patents

计算装置、集成电路芯片、板卡、电子设备和计算方法 Download PDF

Info

Publication number
WO2022001455A1
WO2022001455A1 PCT/CN2021/094722 CN2021094722W WO2022001455A1 WO 2022001455 A1 WO2022001455 A1 WO 2022001455A1 CN 2021094722 W CN2021094722 W CN 2021094722W WO 2022001455 A1 WO2022001455 A1 WO 2022001455A1
Authority
WO
WIPO (PCT)
Prior art keywords
stage
pipeline
circuit
data
arithmetic
Prior art date
Application number
PCT/CN2021/094722
Other languages
English (en)
French (fr)
Inventor
喻歆
刘少礼
陶劲桦
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to JP2021576558A priority Critical patent/JP7368512B2/ja
Priority to US18/013,589 priority patent/US20230297387A1/en
Publication of WO2022001455A1 publication Critical patent/WO2022001455A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3871Asynchronous instruction pipeline, e.g. using handshake signals between stages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines

Definitions

  • the present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic device, and a computing method.
  • an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a key role in improving the performance of computing chips (eg, processors) in the computing system.
  • computing chips eg, processors
  • Various current computing chips can complete various general or specific control operations and data processing operations by using associated instruction sets.
  • the current instruction set still has many defects.
  • the existing instruction set is limited by hardware architecture and is less flexible in terms of flexibility.
  • many instructions can only complete a single operation, and the execution of multiple operations usually requires multiple instructions, potentially resulting in increased on-chip I/O data throughput.
  • the current instructions have improvements in execution speed, execution efficiency, and power consumption on the chip.
  • the present disclosure provides a hardware architecture with one or more groups of pipeline operation circuits that support multi-stage pipeline operations.
  • the solution of the present disclosure can obtain technical advantages in various aspects including enhancing the processing performance of the hardware, reducing power consumption, improving the execution efficiency of computing operations, and avoiding computing overhead.
  • the present disclosure provides a computing device comprising: one or more sets of pipeline operation circuits configured to perform multi-stage pipeline operations, wherein each group of the pipeline operation circuits constitutes a multi-stage operation pipeline, and
  • the multi-stage operation pipeline includes a plurality of operation circuits arranged in stages, wherein in response to receiving the plurality of operation instructions, each stage of the operation circuit in the multi-stage operation pipeline is configured to execute one of the plurality of operation instructions.
  • Corresponding to one operation instruction wherein the plurality of operation instructions are obtained by parsing the calculation instructions received by the computing device.
  • the present disclosure provides an integrated circuit chip including a computing device as described above and described in various embodiments below.
  • the present disclosure provides a board including an integrated circuit chip as described above and described in various embodiments below.
  • the present disclosure provides an electronic device comprising an integrated circuit chip as described above and described in various embodiments below.
  • the present disclosure provides a method of performing computation using the aforementioned computing device, wherein the computing device includes one or more sets of pipelined arithmetic circuits, the method comprising: pipelining the one or more sets of each group of arithmetic circuits is configured to perform multi-stage pipeline operations, wherein each group of said pipeline arithmetic circuits constitutes a multi-stage arithmetic pipeline, and the multi-stage arithmetic pipeline includes a plurality of arithmetic circuits arranged in stages; and in response to receiving To a plurality of operation instructions, each stage of the operation circuit in the multi-stage operation pipeline is configured to execute a corresponding one of the plurality of operation instructions, wherein the plurality of operation instructions are received by the computing device. obtained by parsing the calculation instructions.
  • pipeline operations especially various multi-stage pipeline operations in the field of artificial intelligence, can be efficiently performed. Further, the solution of the present disclosure can realize efficient computing operations by means of a unique hardware architecture, thereby improving the overall performance of the hardware and reducing the computing overhead.
  • FIG. 1 is a block diagram illustrating a computing device according to one embodiment of the present disclosure
  • FIG. 2 is a block diagram illustrating a computing device according to another embodiment of the present disclosure.
  • 3a, 3b and 3c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure
  • FIG. 4 is a block diagram illustrating a computing system according to an embodiment of the present disclosure
  • FIG. 5 is a simplified flowchart illustrating a method of using a computing device to perform an arithmetic operation in accordance with an embodiment of the present disclosure
  • FIG. 6 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the disclosed solution provides a hardware architecture that supports multi-stage pipeline operations.
  • the computing device includes at least one or more groups of pipeline operation circuits, wherein each group of pipeline operation circuits can constitute a multi-stage operation pipeline of the present disclosure.
  • a plurality of operation circuits can be arranged step by step.
  • each stage of the operation circuit in the aforementioned multi-stage operation pipeline may be configured to execute a corresponding one of the multiple operation instructions.
  • FIG. 1 is a block diagram illustrating a computing device 100 according to one embodiment of the present disclosure.
  • the computing device 100 may include one or more groups of pipeline operation circuits, such as the first group of pipeline operation circuits 102 , the second group of pipeline operation circuits 104 and the third group of pipeline operation circuits as shown in the figure.
  • Circuit 106 wherein each group of the pipelined arithmetic circuits may constitute a multi-stage arithmetic pipeline in the context of the present disclosure.
  • the first group of pipeline operation circuits 102 constituting the first multi-stage operation pipeline as an example, it can perform pipeline operations including stages 1-1, stages 1-2, stages 1-3, and so on. 1-N stages of pipeline operations, a total of N stages of pipeline operations.
  • the second and third groups of pipeline operation circuits also have a structure that supports N-stage pipeline operations.
  • multiple groups of pipeline operation circuits of the present disclosure can constitute multiple multi-stage operation pipelines, and the multiple multi-stage operation pipelines can execute respective multiple operation instructions in parallel.
  • an arithmetic circuit including one or more arithmetic units may be arranged at each stage to execute corresponding arithmetic instructions, so as to realize the arithmetic operations at this stage.
  • one or more sets of pipelined operation circuits of the present disclosure may be configured to perform multiple data operations, such as single instruction multiple data ("SIMD") instructions.
  • SIMD single instruction multiple data
  • the foregoing multiple operation instructions may be obtained by parsing the calculation instructions received by the computing device 100 , and the operation codes of the calculation instructions may represent multiple operations performed by the multi-stage operation pipeline.
  • the operation code and the plurality of operations represented by the operation code are predetermined according to the functions supported by the plurality of operation circuits arranged stage by stage in the multi-stage operation pipeline.
  • each group of pipeline operation circuits can be configured to be selectively connected according to a plurality of operation instructions, in addition to performing the step-by-step operation operations in a multi-stage operation pipeline formed by the pipeline operation circuit, so as to complete the corresponding operation. of multiple operation instructions.
  • the multiple multi-stage operation pipelines of the present disclosure may include a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein one or more stages of the first multi-stage operation pipeline The outputs of the arithmetic circuits of the stages are configured to be connected to the inputs of the arithmetic circuits of one or more stages of the second multi-stage arithmetic pipeline according to the arithmetic instruction.
  • the 1st-2nd stage pipeline operations in the first multi-stage operation pipeline shown in the figure can input their operation results to the 2nd-stage 2nd-3rd stage pipeline operations in the second multi-stage operation pipeline according to the operation instruction middle.
  • the 2-1 stage pipeline operation in the second multi-stage operation pipeline shown in the figure can input its operation result to the 3-3 stage pipeline in the third multi-stage operation pipeline according to the operation instruction. operation.
  • two-stage pipeline operations in different pipeline operation lines can realize bidirectional operation result transfer according to different operation instructions, such as the 2-2 stage pipeline in the second multi-stage operation pipeline shown Between operations and the 3rd-2nd pipeline operations in the third multi-stage operation pipeline.
  • each stage of the operation circuit in the multiple groups of operation pipelines of the present disclosure may have an input terminal and an output terminal.
  • the operation circuit receives the input data and outputs the result of the operation of the operation circuit of this stage.
  • outputs of arithmetic circuits of one or more stages are configured to be connected to inputs of arithmetic circuits of another stage or stages according to an arithmetic instruction to execute the arithmetic instruction, for example,
  • the results of the 1st-1st stage pipeline operation can be input to the 1st-3rd stage pipeline operation in the operation pipeline according to the operation instruction.
  • the aforementioned plurality of operation instructions may be micro-instructions or control signals running inside the computing device (or processing circuit, processor), which may include (or instruct) one or more computing devices Operation performed.
  • operation operations may include, but are not limited to, addition operations, multiplication operations, convolution operations, and pooling operations.
  • each stage of operation circuit that performs each stage of pipeline operation may include, but is not limited to, one or more of the following operators or circuits: random number processing circuit, addition and subtraction circuit, subtraction circuit, table look-up circuit, Parameter configuration circuit, multiplier, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  • a pooler is taken as an example, which can be exemplarily constituted by operators such as an adder, a divider, a comparator, etc., so as to perform the pooling operation in the neural network.
  • the present disclosure can also provide corresponding calculation instructions according to the operation supported by the operation circuit in the multi-stage pipeline operation, so as to realize the multi-stage pipeline operation.
  • scr0 ⁇ src4 are source operands
  • op0 ⁇ op3 are opcodes.
  • the type, order, and number of opcodes for the computing instructions of the present disclosure may vary according to different pipelined circuit architectures and supported operations.
  • the multi-stage pipeline operation of the present disclosure can support unary operations (ie, the case where there is only one item of input data).
  • unary operations ie, the case where there is only one item of input data.
  • a set of three-stage pipeline operation circuits including a multiplier, an adder, and a nonlinear operator of the present disclosure can be applied to perform the operation.
  • the multiplier of the first-stage pipeline can be used to calculate the product of the input data ina and a to obtain the first-stage pipeline operation result.
  • the adder of the second-stage pipeline can be used to perform an addition operation on the first-stage pipeline operation result (a*ina) and b to obtain the second-stage pipeline operation result.
  • the relu activation function of the third-stage pipeline can be used to activate the second-stage pipeline operation result (a*ina+b) to obtain the final operation result result.
  • Expressed convolution operation where the two input data ina and inb can be, for example, neuron data.
  • an addition operation may be performed on the first-stage pipeline operation result "product” by using the addition tree in the second-stage pipeline operation circuit to obtain the second-stage pipeline operation result sum.
  • use the nonlinear operator of the third-stage pipeline operation circuit to perform an activation operation on "sum” to obtain the final convolution operation result.
  • the solution of the present disclosure can perform bypass operations on one-stage or multi-stage pipeline operation circuits that will not be used in operation operations, that is, multi-stage operations can be selectively used according to the needs of operation operations. Pipelining one or more stages of an arithmetic circuit without passing the arithmetic operation through all of the multi-stage pipelines.
  • FIG. 2 is a block diagram illustrating a computing device 200 according to another embodiment of the present disclosure.
  • the computing device 200 additionally includes a control circuit 202 and a data processing circuit 204 .
  • the control circuit 202 may be configured to obtain the calculation instructions described above and parse the calculation instructions to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the opcodes, such as represented by the formula (1).
  • the data processing unit 204 may include a data conversion circuit 206 and a data splicing circuit 208.
  • the calculation instruction includes a preprocessing operation for pipeline operation, such as a data conversion operation or a data splicing operation
  • the data conversion circuit 206 or the data splicing circuit 208 will perform the corresponding conversion operation or splicing operation according to the corresponding calculation instruction.
  • the conversion operation and the splicing operation will be described below with an example.
  • the data conversion circuit can convert the input data to a lower bit width according to the operation requirements.
  • data for example, the bit width of the output data is 512 bits wide.
  • the data conversion circuit can support conversion between multiple data types, such as FP16 (floating point 16-bit), FP32 (floating point 32-bit), FIX8 (fixed-point 8-bit), FIX4 (fixed-point 8-bit) Conversion between data types with different bit widths, such as 4-bit point), FIX16 (16-bit fixed point), etc.
  • the data conversion operation may be a conversion performed on the arrangement positions of the matrix elements.
  • the transformation may include, for example, matrix transposition and mirroring (described later in conjunction with Figures 3a-3c), matrix rotation according to a predetermined angle (eg, 90 degrees, 180 degrees or 270 degrees), and transformation of matrix dimensions.
  • the data splicing circuit can perform parity splicing and other operations on the data blocks extracted from the data according to, for example, the bit length set in the instruction. For example, when the data bit length is 32 bits wide, the data splicing circuit can divide the data into 8 data blocks from 1 to 8 according to the bit width length of 4 bits, and then divide the data blocks 1, 3, 5 and 7 into four data blocks. The data blocks are spliced together, and four data blocks of data 2, 4, 6, and 8 are spliced together for operation.
  • the above-mentioned data splicing operation may also be performed on the data M (for example, a vector) obtained after performing the operation. It is assumed that the data splicing circuit can split the lower 256 bits of the even-numbered lines of the data M first with an 8-bit bit width as 1 unit data to obtain 32 even-numbered line unit data (represented as M_2i 0 to M_2i 31 respectively ). Similarly, the lower 256 bits of the odd-numbered rows of the data M can also be split with an 8-bit bit width as 1 unit data to obtain 32 odd-numbered row unit data (represented as M_(2i+1) 0 to M_ (2i+1) 31 ).
  • the even-numbered rows first and then the odd-numbered rows, the split 32 odd-numbered row unit data and 32 even-numbered row unit data are alternately arranged.
  • the even-numbered line unit data 0 (M_2i 0 ) is arranged in the lower order, and then the odd-numbered line unit data 0 (M_(2i+1) 0 ) is sequentially arranged.
  • even line unit data 1 (M_2i 1 ) . . . are arranged.
  • 64 unit data are spliced together to form a new data with a bit width of 512 bits.
  • the data conversion circuit and the data splicing circuit in the data processing unit can be used together to perform pre-processing or post-processing of data more flexibly.
  • the data processing unit may perform only data conversion without data stitching, only data stitching without data conversion, or both data conversion and data stitching.
  • the data processing unit may be configured to disable the data conversion circuit and the data splicing circuit.
  • the data processing unit may be configured to enable the data conversion circuit and the data splicing circuit to perform post-processing on the intermediate result data , so as to obtain the final operation result.
  • computing device 200 also includes storage circuitry 210 .
  • the storage circuit of the present disclosure may include a main storage module and/or a main cache module, wherein the main storage module is configured to store data for performing multi-stage pipeline operations and operation results after performing operations, and The main cache module is configured to cache the intermediate operation result after the operation is performed in the multi-stage pipeline operation.
  • the storage circuit may also have an interface for data transmission with an off-chip storage medium, so that data transfer between on-chip and off-chip systems can be realized.
  • 3a, 3b and 3c are schematic diagrams illustrating matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure.
  • the conversion operation performed by the data conversion circuit 206 the following will further describe the transposition operation and the horizontal mirroring operation performed by the original matrix as an example.
  • the original matrix is a matrix of (M+1) rows by (N+1) columns.
  • the data conversion circuit can perform a transposition operation on the original matrix shown in FIG. 3a to obtain the matrix shown in FIG. 3b.
  • the data conversion circuit can exchange the row numbers and column numbers of elements in the original matrix to form a transposed matrix.
  • the coordinate is the element "10" in row 1, column 0, and its coordinate in the transposed matrix shown in Fig. 3b is row 0, column 1.
  • the coordinate is the element "M0" in the M+1th row and the 0th column, and its coordinate in the transposed matrix shown in Figure 3b is the 0th row M+ 1 column.
  • the data conversion circuit may perform a horizontal mirror operation on the original matrix shown in FIG. 3a to form a horizontal mirror matrix.
  • the data conversion circuit can convert the arrangement order from the first row element to the last row element in the original matrix into the arrangement order from the last row element to the first row element through the horizontal mirror operation, and the elements in the original matrix
  • the column numbers remain the same.
  • the coordinates in the original matrix shown in Figure 3a are the element "00" in the 0th row and the 0th column and the element "10" in the 1st row and 0th column, respectively, in the horizontal mirror matrix shown in Figure 3c
  • the coordinates are the M+1 row, column 0, and the M row, column 0, respectively.
  • the coordinate is the element "M0" in the M+1th row and 0th column
  • the coordinate is the 0th row and 0th column.
  • the computing device of the present disclosure can execute computing instructions including the aforementioned preprocessing and postprocessing.
  • computing instructions including the aforementioned preprocessing and postprocessing.
  • Two illustrative examples of computing instructions in accordance with the present disclosure will be given below:
  • a calculation instruction expressed in the above formula (2) is a calculation instruction that inputs a 3-ary operand and outputs a 1-ary operand, and it includes a three-stage pipeline operation (ie, multiply + add/ Subtract + activate) a set of pipeline operation circuits to complete the microinstruction.
  • the ternary operation is A*B+C, in which the microinstruction of FPMULT completes the floating-point multiplication operation between operands A and B to obtain the product value, that is, the first-stage pipeline operation.
  • the microinstruction of FPADD or FPSUB is executed to complete the floating-point addition or subtraction of the product value and C to obtain the sum or difference result, that is, the second-stage pipeline operation.
  • the activation operation RELU that is, the third-stage pipeline operation
  • the micro-instruction CONVERTFP2FIX can be executed through the type conversion circuit above, so as to convert the type of the result data after the activation operation from floating-point to fixed-point, so as to be output as the final result or as an intermediate
  • the result is input to a fixed-point operator for further computational operations.
  • a calculation instruction expressed in the above formula (3) is a calculation instruction that inputs a 3-ary operand and outputs a 1-ary operand, and it includes a calculation instruction including a three-stage pipeline operation (ie, look-up table + multiply) according to the present disclosure. + plus) a set of pipelined arithmetic circuits to complete the microinstructions.
  • the ternary operation is ST(A)*B+C, in which the microinstruction of SEARCHC can be completed by the table lookup circuit in the first-stage pipeline operation to obtain the table lookup result A.
  • the multiplication operation between operands A and B is performed by the second stage pipeline operation to obtain the product value.
  • the microinstruction of ADD is executed to complete the addition operation of the aforementioned product value and C to obtain the summation result, that is, the third-stage pipeline operation.
  • the calculation instructions of the present disclosure can be flexibly designed and determined according to the requirements of the calculation, so that the hardware architecture including multiple operation pipelines of the present disclosure can be based on the calculation instructions and the various microinstructions (or microoperations) included ) to design and connect, so that a variety of computing operations can be completed by one computing instruction, thereby improving the execution efficiency of the instruction and reducing the computing overhead.
  • FIG. 4 is a block diagram illustrating a computing system 400 according to an embodiment of the present disclosure.
  • the computing system in addition to the computing device 200 , the computing system also includes a plurality of slave processing circuits 402 and an interconnection unit 404 for connecting the computing device 200 and the plurality of slave processing circuits 402 .
  • the slave processing circuit of the present disclosure may perform operations on the data of the preprocessing operation performed in the computing device according to the computing instructions (implemented as, for example, one or more microinstructions or control signals) to obtain expected operation results.
  • the slave processing circuit may send the intermediate result obtained after its operation (eg via the interconnection unit) to the data processing unit in the computing device for execution on the intermediate result by the data conversion circuit in the data processing unit
  • the data type conversion or data splicing circuit in the data processing unit performs data splitting and splicing operations on the intermediate results, so as to obtain the final operation result.
  • FIG. 5 is a simplified flow diagram illustrating a method 500 of performing arithmetic operations using a computing device in accordance with an embodiment of the present disclosure. From the foregoing description, it can be understood that the computing device herein may be the computing device described in conjunction with FIGS. 1-4 , which has the illustrated internal connection relationship and supports additional types of operations.
  • the method 500 configures each of the one or more groups of pipeline operation circuits to perform multi-stage pipeline operations, wherein each group of the pipeline operation circuits constitutes a multi-stage operation pipeline , and the multi-stage operation pipeline includes a plurality of operation circuits arranged in stages.
  • the method 500 in response to receiving a plurality of operation instructions, configures each stage of the operation circuit in the multi-stage operation pipeline to execute a corresponding one of the plurality of operation instructions, wherein the The plurality of operation instructions are obtained by analyzing the calculation instructions received by the computing device.
  • FIG. 6 is a structural diagram illustrating a combined processing apparatus 600 according to an embodiment of the present disclosure.
  • the combined processing device 600 includes a computing processing device 602 , an interface device 604 , other processing devices 606 and a storage device 608 .
  • one or more computing devices 610 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 5 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 702 shown in FIG. 7).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 6 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 706 shown in FIG. 7 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 7 .
  • FIG. 7 is a schematic structural diagram illustrating a board 700 according to an embodiment of the present disclosure.
  • the board includes a storage device 704 for storing data, which includes one or more storage units 710 .
  • the storage device can be connected to the control device 708 and the chip 702 described above for connection and data transmission through, for example, a bus.
  • the board also includes an external interface device 706, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 712 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a computing device comprising:
  • One or more groups of pipeline operation circuits configured to perform multi-stage pipeline operations, wherein each group of the pipeline operation circuits constitutes a multi-stage operation pipeline, and the multi-stage operation pipeline includes a plurality of operation circuits arranged in stages ,
  • each stage of the operation circuit in the multi-stage operation pipeline is configured to execute a corresponding one of the plurality of operation instructions
  • the plurality of operation instructions are obtained by parsing the calculation instructions received by the computing device.
  • Clause 2 The computing device of clause 1, wherein the opcodes of the computing instructions represent a plurality of operations performed by the multi-stage computing pipeline, the computing device further comprising control circuitry configured to obtain the The instructions are calculated and parsed to obtain the plurality of operation instructions corresponding to the plurality of operations.
  • Clause 3 The computing device of Clause 2, wherein the opcodes and the plurality of operations they represent are predetermined according to functions supported by a plurality of arithmetic circuits arranged stage by stage in a multi-stage arithmetic pipeline.
  • Clause 4 The computing device of Clause 1, wherein the operation circuits of each stage in the multi-stage operation pipeline are configured to be selectively connected according to the plurality of operation instructions so as to execute the plurality of operation instructions.
  • Clause 5 The computing device of Clause 1, wherein the plurality of sets of pipeline operation circuits constitute a plurality of multi-stage operation pipelines, and the plurality of multi-stage operation pipelines execute respective plurality of operation instructions in parallel.
  • each stage of the arithmetic circuit in the multi-stage arithmetic pipeline has an input and an output for receiving input data at the stage of the arithmetic circuit and outputting the stage The result of an arithmetic circuit operation.
  • Clause 7 The computing device of clause 6, wherein within a multi-stage arithmetic pipeline, outputs of arithmetic circuits of one or more stages are configured to be connected to arithmetic circuits of another stage or stages in accordance with an arithmetic instruction input to execute the operation instruction.
  • Clause 8 The computing device of clause 6, wherein the plurality of multi-stage computation pipelines comprises a first multi-stage computation pipeline and a second multi-stage computation pipeline, wherein one of the first multi-stage computation pipelines
  • the output of the arithmetic circuit of the stage or stages is configured to be connected to the input of the arithmetic circuit of one or more stages of the second multi-stage arithmetic pipeline according to the arithmetic instruction.
  • each stage of arithmetic circuitry includes one or more operators or circuits of:
  • Random number processing circuit addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  • Clause 10 The computing device of clause 1, further comprising a data processing circuit comprising a type conversion circuit for performing a data type conversion operation and/or a data stitching circuit for performing a data stitching operation.
  • Clause 11 The computing device of clause 10, wherein the type conversion circuit includes one or more converters for enabling conversion of computational data between a plurality of different data types.
  • Item 12 The computing device according to Item 10, wherein the data splicing circuit is configured to split the computing data with a predetermined bit length, and splicing a plurality of data blocks obtained after the splitting in a predetermined order.
  • Each of the one or more groups of pipeline operation circuits is configured to perform a multi-stage pipeline operation, wherein each group of the pipeline operation circuits constitutes a multi-stage operation pipeline, and the multi-stage operation pipeline includes a step-by-step arrangement. a plurality of arithmetic circuits;
  • the plurality of operation instructions are obtained by parsing the calculation instructions received by the computing device.
  • Clause 17 The method of clause 16, wherein the opcodes of the computing instructions represent a plurality of operations performed by the multi-stage computing pipeline, the computing device further comprising control circuitry, the method comprising controlling the The circuit is configured to obtain and parse the calculation instructions to obtain the plurality of operation instructions corresponding to the plurality of operations.
  • Clause 18 The method of clause 17, wherein the opcode and the plurality of operations it represents are predetermined according to functions supported by a plurality of operational circuits arranged stage by stage in a multi-stage operational pipeline.
  • Clause 19 The method of clause 16, wherein the operational circuits of the stages in the multi-stage operational pipeline are configured to be selectively connected in accordance with the plurality of operational instructions to execute the plurality of operational instructions.
  • Clause 20 The method of Clause 16, wherein the plurality of sets of pipelined operation circuits constitute a plurality of multi-stage operation pipelines, and the plurality of multi-stage operation pipelines execute respective plurality of operation instructions in parallel.
  • each stage of the arithmetic circuit in the multi-stage arithmetic pipeline has an input and an output for receiving input data at the stage of the arithmetic circuit and outputting the operation of the stage result of circuit operation.
  • Clause 22 The method of clause 21, wherein within a multi-stage arithmetic pipeline, outputs of arithmetic circuits of one or more stages are configured to be connected to arithmetic circuits of another stage or stages in accordance with an arithmetic instruction input to execute the operation instruction.
  • Clause 23 The method of clause 21, wherein the plurality of multi-stage operation pipelines comprises a first multi-stage operation pipeline and a second multi-stage operation pipeline, wherein the method combines the first multi-stage operation pipeline.
  • the output of the arithmetic circuit of one or more stages of the pipeline is configured to be connected to the input of the arithmetic circuit of the one or more stages of the second multi-stage arithmetic pipeline according to the arithmetic instruction.
  • each stage of the operational circuit comprises one or more operators or circuits of:
  • Random number processing circuit addition and subtraction circuit, subtraction circuit, look-up table circuit, parameter configuration circuit, multiplier, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.
  • Clause 25 The method of clause 16, further comprising a data processing circuit comprising a type conversion circuit for performing a data type conversion operation and/or a data stitching circuit for performing a data stitching operation.
  • Clause 26 The method of clause 25, wherein the type conversion circuit includes one or more converters for enabling conversion of computational data between a plurality of different data types.
  • Clause 27 The method of Clause 25, wherein the data splicing circuit is configured to split the calculated data with a predetermined bit length, and splices the plurality of data blocks obtained after the splitting in a predetermined order.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

一种计算装置、集成电路芯片、板卡和使用前述计算装置来执行运算操作的方法。其中该计算装置可以包括在组合处理装置中,该组合处理装置还可以包括通用互联接口和其他处理装置。所述计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与设备和其他处理装置连接,用于存储该设备和其他处理装置的数据。

Description

计算装置、集成电路芯片、板卡、电子设备和计算方法
相关申请的交叉引用
本申请要求于2020年6月30日申请的,申请号为202010619481X,名称为“计算装置、集成电路芯片、板卡、电子设备和计算方法”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本披露一般地涉及计算领域。更具体地,本披露涉及一种计算装置、集成电路芯片、板卡、电子设备和计算方法。
背景技术
在计算***中,指令集是用于执行计算和对计算***进行控制的一套指令的集合,并且在提高计算***中计算芯片(例如处理器)的性能方面发挥着关键性的作用。当前的各类计算芯片(特别是人工智能领域的芯片)利用相关联的指令集,可以完成各类通用或特定的控制操作和数据处理操作。然而,当前的指令集还存在诸多方面的缺陷。例如,现有的指令集受限于硬件架构而在灵活性方面表现较差。进一步,许多指令仅能完成单一的操作,而多个操作的执行则通常需要多条指令,这潜在地导致片内I/O数据吞吐量增大。另外,当前的指令在执行速度、执行效率和对芯片造成的功耗方面还有改进之处。
发明内容
为了至少解决上述现有技术中存在的问题,本披露提供一种具有支持多级流水运算的一组或多组流水运算电路的硬件架构。通过利用该硬件架构来执行计算指令,本披露的方案可以在包括增强硬件的处理性能、减小功耗、提高计算操作的执行效率和避免计算开销等多个方面获得技术优势。
在第一方面中,本披露提供一种计算装置,包括:一组或多组流水运算电路,其配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运算流水线,并且所述多级运算流水线中包括逐级布置的多个运算电路,其中响应于接收到多个运算指令,所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
在第二方面中,本披露提供一种集成电路芯片,其包括如上所述并且将在下面多个实施例中描述的计算装置。
在第三方面中,本披露提供一种板卡,其包括如上所述并且将在下面多个实施例中描述的集成电路芯片。
在第四方面中,本披露提供一种电子设备,其包括如上所述并且将在下面多个实施例中描述的集成电路芯片。
在第五方面中,本披露提供一种使用前述计算装置来执行计算的方法,其中所述计算装置包括一组或多组流水运算电路,所述方法包括:将所述一组或多组流水运算电路中的每组配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运 算流水线,并且该多级运算流水线中包括逐级布置的多个运算电路;以及响应于接收到多个运算指令,将所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
通过利用本披露的计算装置、集成电路芯片、板卡、电子设备和方法,可以高效地执行流水运算,尤其是人工智能领域内的各种多级流水运算。进一步,本披露的方案可以借助于独特的硬件架构来实现高效的运算操作,从而提升硬件的整体性能并减小计算开销。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出根据本披露一个实施例的计算装置的框图;
图2是示出根据本披露另一个实施例的计算装置的框图;
图3a,3b和3c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图;
图4是示出根据本披露实施例的计算***的框图;
图5是示出根据本披露实施例的使用计算装置来执行运算操作的方法的简化流程图;
图6是示出根据本披露实施例的一种组合处理装置的结构图;以及
图7是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
本披露的方案提供一种支持多级流水运算的硬件架构。当该硬件架构实现于计算装置中时,该计算装置至少包括一组或多组流水运算电路,其中每组流水运算电路可以构成本披露的一条多级运算流水线。在该多级运算流水线中,可以逐级的布置多个运算电路。在一个实施方式中,当接收到多个运算指令时,前述的多级运算流水线中的每级运算电路可以配置成执行所述多个运算指令中的对应一个运算指令。借助于本披露的硬件架构和运算指令,可以高效地执行并行流水操作,扩展了计算的应用场景并且减小了计算开销。
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
图1是示出根据本披露一个实施例的计算装置100的框图。如图1中所示,该计算装置100可以包括一组或多组流水运算电路,如图中所示出的第1组流水运算电路102、第2组流水运算电路104和第3组流水运算电路106,其中每组所述流水运算电路可以构成本披露上下文中的一条多级运算流水线。以构成第1条多级运算流水线的第1组流水运算电路102为例,其可以执行包括第1-1级流水运算、第1-2级流水运算、第1-3级流水运算……第1-N级流水运算,共计N级的流水运算。类似地,第2 组和第3组流水运算电路也具有支持N级流水运算的结构。通过这样的示例性架构,本领域技术人员可以理解本披露的多组流水运算电路可以构成多条多级运算流水线,并且所述多条多级运算流水线可以并行地执行各自的多条运算指令。
为了执行上述的每级流水运算,可以在每级布置包括有一个或多个运算器的运算电路来执行对应的运算指令,以实现该级处的运算操作。在一个实施例中,响应于接收到多个运算指令,本披露的一组或多组流水运算电路可以配置成执行多数据运算,例如执行单指令多数据(“SIMD”)指令。在一个实施例中,前述的多个运算指令可以由对计算装置100接收到的计算指令进行解析来获得,该计算指令的操作码可以表示由所述多级运算流水线所执行的多个操作。在另一个实施例中,根据多级运算流水线中逐级布置的多个运算电路所支持的功能来预先确定所述操作码以及其所代表的所述多个操作。
在本披露的方案中,每组流水运算电路除了执行其所构成的一条多级运算流水线中的逐级运算操作以外,还可以配置成根据多条运算指令来进行选择性地连接,以完成相应的多条运算指令。在一个实现场景中,本披露的所述多条多级运算流水线可以包括第一条多级运算流水线和第二条多级运算流水线,其中所述第一条多级运算流水线的一个级或多个级的运算电路的输出端配置成根据所述运算指令连接至所述第二条多级运算流水线的一个级或多个级的运算电路的输入端。例如,图中所示出的第1条多级运算流水线中的第1-2级流水运算可以根据运算指令将其运算结果输入至第2条多级运算流水线中的第2-3级流水运算中。类似地,图中所示出的第2条多级运算流水线中的第2-1级流水运算可以根据运算指令将其运算结果输入至第3条多级运算流水线中的第3-3级流水运算。在一些场景中,根据运算指令的不同,不同条流水运算线中的两级流水运算可以实现双向的运算结果传递,例如所示出的第2条多级运算流水线中的第2-2级流水运算和第3条多级运算流水线中的第3-2级流水运算之间。
由上所述可以看出,为了实现数据在同一条运算流水线和不同条运算流水线之间的传递,本披露的多组运算流水线中的每级运算电路可以具有输入端和输出端,用于在该运算电路处接收输入的数据和输出该级运算电路操作的结果。在一条多级运算流水线内,一个或多个级的运算电路的输出端配置成根据运算指令连接到另一个级或另外多个级的运算电路的输入端,以执行所述运算指令,例如,在第1条运算流水线内,第1-1级流水运算的结果可以根据运算指令输入至该条运算流水线内的第1-3级流水运算。
在本披露的上下文中,前述的多个运算指令可以是在计算装置(或处理电路、处理器)内部运行的微指令或控制信号,其可以包括(或者说指示)一个或多个需计算装置执行的运算操作。根据不同的运算场景,运算操作可以包括但不限于加法操作、乘法操作、卷积运算操作、池化操作等各种操作。为了实现多级流水运算,执行每级流水运算的每级运算电路可以包括但不限于以下中的一个或多个运算器或电路:随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。这里以池化器为例,其可以示例性由加法器、除法器、比较器等运算器来构成,以便执行神经网络中的池化操作。
为了实现多级的流水运算,本披露还可以根据多级流水运算中运算电路所支持的 运算来提供相应的计算指令,以实现多级流水运算。根据运算场景的不同,本披露的计算指令可以包括多个操作码,该操作码可以表示由运算电路执行的多个操作。例如,当图1中的N=4(即执行4级流水运算时),根据本披露方案的计算指令可以如下式(1)表示:
Result=((((scr0 op0 scr1)op1 src2)op2 src3)op3 src4)      (1)
其中,scr0~src4是源操作数、op0~op3为操作码。根据不同的流水运算电路架构和支持的操作,本披露计算指令的操作码的类型、顺序和数量可以改变。
在一些应用场景中,本披露的多级流水运算可以支持一元运算(即只有一项输入数据的情形)。以神经网络中的scale层+relu层处的运算操作为例,假设待执行的计算指令表达为result=relu(a*ina+b),其中ina是输入数据(例如可以是向量或矩阵),a、b均为运算常量。对于该计算指令,可以应用本披露的包括乘法器、加法器、非线性运算器的一组三级流水运算电路来执行运算。具体来说,可以利用第一级流水的乘法器计算输入数据ina与a的乘积,以获得第一级流水运算结果。接着,可以利用第二级流水的加法器,对该第一级流水运算结果(a*ina)与b执行加法运算获得第二级流水运算结果。最后,可以利用第三级流水的relu激活函数,对该第二级流水运算结果(a*ina+b)进行激活操作,以获得最终的运算结果result。
在一些应用场景中,本披露的多级流水运算电路可以支持二元运算(例如卷积计算指令result=conv(ina,inb))或三元运算(例如卷积计算指令result=conv(ina,inb,bias)),其中输入数据ina、inb与bias既可以是向量(例如可以是整型、定点型或浮点型数据),也可以是矩阵。这里以卷积计算指令result=conv(ina,inb)为例,可以利用三级流水运算电路结构中包括的多个乘法器、至少一个加法树和至少一个非线性运算器来执行该计算指令所表达的卷积运算,其中两个输入数据ina和inb可以例如是神经元数据。具体来说,首先可以利用三级流水运算电路中的第一级流水乘法器进行计算,从而可以获得第一级流水运算结果product=ina*inb(视为运算指令中的一条微指令,其对应于乘法操作)。继而可以利用第二级流水运算电路中的加法树对第一级流水运算结果“product”执行加和操作,以获得第二级流水运算结果sum。最后,利用第三级流水运算电路的非线性运算器对“sum”执行激活操作,从而得到最终的卷积运算结果。
在一些应用场景中,如前所述,本披露的方案可以对运算操作中将不使用的一级或多级流水运算电路执行旁路操作,即可以根据运算操作的需要选择性地使用多级流水运算电路的一级或多级,而无需令运算操作经过所有的多级流水操作。以计算欧式距离的运算操作为例,假设其计算指令表示为dis=sum((ina-inb)^2),可以只使用由加法器、乘法器、加法树和累加器构成的若干级流水运算电路来进行运算以获得最终的运算结果,而对于未使用的流水运算电路,可以在流水运算操作前或操作中予以旁路。
图2是示出根据本披露另一个实施例的计算装置200的框图。从图中可以看出,除了具有与计算装置100相同的两组流水运算电路102和104以外,计算装置200还附加地包括控制电路202和数据处理电路204。在一个实施例中,控制电路202可以配置成获取上文所述的计算指令并且对计算指令进行解析,以得到与所述操作码表示的多个操作相对应的所述多个运算指令,例如式(1)所表示的。
在一个实施例中,数据处理单元204可以包括数据转换电路206和数据拼接电路 208。当计算指令包括针对于流水运算操作的前处理操作时,例如数据转换操作或数据拼接操作时,数据转换电路206或数据拼接电路208将根据相应的计算指令来执行相应的转换操作或拼接操作。下面将以示例来说明转换操作和拼接操作。
就数据转换操作而言,当输入到数据转换电路的数据位宽较高时(例如数据位宽为1024比特位宽),则数据转换电路可以根据运算要求将输入数据转换为较低比特位宽的数据(例如输出数据的位宽为512比特位宽)。根据不同的应用场景,数据转换电路可以支持多种数据类型之间的转换,例如可以进行FP16(浮点数16位)、FP32(浮点数32位)、FIX8(定点数8位)、FIX4(定点数4位)、FIX16(定点数16位)等具有不同比特位宽的数据类型间转换。当输入到数据转换电路的数据是矩阵时,数据转换操作可以是针对矩阵元素的排列位置进行的变换。该变换可以例如包括矩阵转置与镜像(稍后结合图3a-图3c描述)、矩阵按照预定的角度(例如是90度、180度或270度)旋转和矩阵维度的转换。
就数据拼接操作而言,数据拼接电路可以根据例如指令中设定的位长对数据中提取的数据块进行奇偶拼接等操作。例如,当数据位长为32比特位宽时,数据拼接电路可以按照4比特的位宽长度将数据分为1~8共8个数据块,然后将数据块1、3、5和7共四个数据块拼接在一起,并且将数据2、4、6和8共四个数据块拼接在一起以用于运算。
在另一些应用场景中,还可以针对执行运算后获得的数据M(例如可以是向量)执行上述的数据拼接操作。假设数据拼接电路可以将数据M偶数行的低256位先以8位比特位宽作为1个单位数据进行拆分,以得到32个偶数行单位数据(分别表示为M_2i 0至M_2i 31)。类似地,可以将数据M奇数行的低256位也以8位比特位宽作为1个单位数据进行拆分,以得到32个奇数行单位数据(分别表示为M_(2i+1) 0至M_(2i+1) 31)。进一步,根据数据位由低到高、先偶数行后奇数行的顺序依次交替布置拆分后的32个奇数行单位数据与32个偶数行单位数据。具体地,将偶数行单位数据0(M_2i 0)布置在低位,再顺序布置奇数行单位数据0(M_(2i+1) 0)。接着,布置偶数行单位数据1(M_2i 1)……。以此类推,当完成奇数行单位数据31(M_(2i+1) 31)的布置时,64个单位数据拼接在一起以组成一个512位比特位宽的新数据。
根据不同的应用场景,数据处理单元中的数据转换电路和数据拼接电路可以配合使用,以便更加灵活地进行数据的前处理或后处理。例如,根据计算指令中包括的不同操作,数据处理单元可以仅执行数据转换而不执行数据拼接操作、仅执行数据拼接操作而不执行数据转换、或者既执行数据转换又执行数据拼接操作。在一些场景中,当所述计算指令中并不包括针对于流水运算操作的前处理操作时,则数据处理单元可以配置成禁用所述数据转换电路和数据拼接电路。在另一些场景中,当所述计算指令中包括针对于流水运算操作的后处理操作时,则数据处理单元可以配置成启用所述数据转换电路和数据拼接电路以执行对中间结果数据的后处理,从而得到最终的运算结果。
为了实现数据存储操作,计算装置200还包括存储电路210。在一个实现场景中,本披露的存储电路可以包括主存储模块和/或主缓存模块,其中所述主存储模块配置成存储用于执行多级流水运算的数据与执行运算后的运算结果,并且所述主缓存模块配置成缓存所述多级流水运算中执行运算后的中间运算结果。进一步,存储电路还可以 具有用于与片外存储介质进行数据传输的接口,从而可以实现片上与片外***之间的数据搬运。
图3a,3b和3c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图。为了更好地理解数据转换电路206执行的转换操作,下面将以原始矩阵进行的转置操作与水平镜像操作为例做进一步描述。
如图3a所示,原始矩阵是(M+1)行×(N+1)列的矩阵。根据应用场景的需求,数据转换电路可以对图3a中示出的原始矩阵进行转置操作转换,以获得如图3b所示出的矩阵。具体来说,数据转换电路可以将原始矩阵中元素的行序号与列序号进行交换操作以形成转置矩阵。具体来说,在图3a示出的原始矩阵中坐标是第1行第0列的元素“10”,其在图3b示出的转置矩阵中的坐标则是第0行第1列。以此类推,在图3a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”,其在图3b示出的转置矩阵中的坐标则是第0行第M+1列。
如图3c所示,数据转换电路可以对图3a示出的原始矩阵进行水平镜像操作以形成水平镜像矩阵。具体来说,所述数据转换电路可以通过水平镜像操作,将原始矩阵中从首行元素到末行元素的排列顺序转换成从末行元素到首行元素的排列顺序,而对原始矩阵中元素的列号保持不变。具体来说,图3a示出的原始矩阵中坐标分别是第0行第0列的元素“00”与第1行第0列的元素“10”,在图3c中示出的水平镜像矩阵中的坐标则分别是第M+1行第0列与第M行第0列。以此类推,在图3a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”,在图3c示出的水平镜像矩阵中的坐标则是第0行第0列。
基于上述图3的硬件架构,本披露的计算装置可以执行包括前述预处理和后处理的计算指令。下面将给出根据本披露方案的计算指令的两个示例性例子:
例1:MUAD=(FPMULT)+(FPADD/FPSUB)+(RELU)+(CONVERTFP2FIX)       (2)
在上式(2)中表达的一条计算指令是输入3元操作数,输出1元操作数的计算指令,并且其包括可以由根据本披露的一个包括三级流水运算(即,乘+加/减+激活)的一组流水运算电路来完成的微指令。具体来说,三元操作是A*B+C,其中FPMULT的微指令完成操作数A和B之间的浮点数乘法操作以获得乘积值,即第一级流水运算。接着,执行FPADD或FPSUB的微指令来完成前述乘积值与C的浮点数加法或减法操作,以获得求和或求差结果,即第二级流水运算。然后,可以对前级结果执行激活操作RELU,即第三级流水运算。通过该三级流水运算后,最后可以通过上文的类型转换电路来执行微指令CONVERTFP2FIX,从而将激活操作后的结果数据的类型从浮点数转换成定点数,以便作为最终的结果输出或作为中间结果输入到定点运算器以进行进一步的计算操作。
例2:SECMUADC=SEARCHC+MULT+ADD    (3)
在上式(3)中表达的一条计算指令是输入3元操作数,输出1元操作数的计算指令,并且其包括可以由根据本披露的一个包括三级流水运算(即,查表+乘+加)的一组流水运算电路来完成的微指令。具体来说,三元操作是ST(A)*B+C,其中SEARCHC的微指令可以由第一级流水运算中的查表电路来完成,以得到查表结果A。接着,由第二级流水运算来完成操作数A和B之间的乘法操作以获得乘积值。然后,执行ADD的微指令来完成前述乘积值与C的加法操作,以获得求和结果,即第三级 流水运算。
如前所述,本披露的计算指令可以根据计算的要求来灵活地设计和确定,从而本披露的包括多条运算流水线的硬件架构可以依据计算指令及其包括的多种微指令(或微操作)来设计和连接,从而可以通过一条计算指令既可完成多种运算操作,以此提升指令的执行效率并减小计算开销。
图4是示出根据本披露实施例的计算***400的框图。从图中可以看出,除了包括计算装置200,该计算***还包括多个从处理电路402和用于连接计算装置200和多个从处理电路402的互联单元404。
在一个运算场景中,本披露的从处理电路可以根据计算指令(实现为例如一个或多个微指令或控制信号)对计算装置中执行前处理操作的数据进行运算,以获得预期的运算结果。在另一个运算场景中,从处理电路可以将其运算后获得的中间结果(例如经由互联单元)发送到计算装置中的数据处理单元,以便由数据处理单元中的数据转换电路来对中间结果执行数据类型转换或者以便由数据处理单元中的数据拼接电路来对中间结果执行数据拆分和拼接操作,从而获得最终的运算结果。
图5是示出根据本披露实施例的使用计算装置来执行运算操作的方法500的简化流程图。根据前文的描述,可以理解这里的计算装置可以是结合图1-图4所描述的计算装置,其具有所示出的内部连接关系并且支持附加的各类操作。
如图5所示,在步骤502处,方法500将所述一组或多组流水运算电路中的每组配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运算流水线,并且该多级运算流水线中包括逐级布置的多个运算电路。接着,在步骤504处,方法500响应于接收到多个运算指令,将所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
以上为了简明的目的,仅结合图5描述了本披露的计算方法。本领域技术人员根据本披露的公开内容也可以想到本方法可以包括更多的步骤,并且这些步骤的执行可以实现前文结合图1-图4所描述的本披露的各类操作,此处不再赘述。
图6是示出根据本披露实施例的一种组合处理装置600的结构图。如图6中所示,该组合处理装置600包括计算处理装置602、接口装置604、其他处理装置606和存储装置608。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置610,该计算装置可以配置用于执行本文结合图1-图5所描述的操作。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。 这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图7中示出的芯片702)。在一种实现中,该芯片是一种***级芯片(System on Chip,SoC),并且集成有一个或多个如图6中所示的组合处理装置。该芯片可以通过对外接口装置(如图7中示出的对外接口装置706)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图7对该板卡进行详细地描述。
图7是示出根据本披露实施例的一种板卡700的结构示意图。如图7中所示,该板卡包括用于存储数据的存储器件704,其包括一个或多个存储单元710。该存储器件可以通过例如总线等方式与控制器件708和上文所述的芯片702进行连接和数据传输。进一步,该板卡还包括对外接口装置706,其配置用于芯片(或芯片封装结构中的芯片)与外部设备712(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller  Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图6和图7的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个***,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全 部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款1、一种计算装置,包括:
一组或多组流水运算电路,其配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运算流水线,并且所述多级运算流水线中包括逐级布置的多个运算电路,
其中响应于接收到多个运算指令,所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,
其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
条款2、根据条款1所述的计算装置,其中所述计算指令的操作码表示由所述多级运算流水线所执行的多个操作,所述计算装置还包括控制电路,其配置成获取所述计算指令并对其进行解析,以获得与所述多个操作相对应的所述多条运算指令。
条款3、根据条款2所述的计算装置,其中根据多级运算流水线中逐级布置的多个运算电路所支持的功能来预先确定所述操作码以及其所代表的所述多个操作。
条款4、根据条款1所述的计算装置,其中所述多级运算流水线中的各级运算电路配置成根据所述多条运算指令进行选择性地连接,以便执行所述多条运算指令。
条款5、根据条款1所述的计算装置,其中所述多组流水运算电路构成多条多级运算流水线,并且所述多条多级运算流水线并行地执行各自的多条运算指令。
条款6、根据条款1或5所述的计算装置,其中所述多级运算流水线中的每级运算电路具有输入端和输出端,用于在该级运算电路处接收输入的数据和输出该级运算 电路操作的结果。
条款7、根据条款6所述的计算装置,其中在一条多级运算流水线内,一个或多个级的运算电路的输出端配置成根据运算指令连接到另一个级或另外多个级的运算电路的输入端,以执行所述运算指令。
条款8、根据条款6所述的计算装置,其中所述多条多级运算流水线包括第一条多级运算流水线和第二条多级运算流水线,其中所述第一条多级运算流水线的一个级或多个级的运算电路的输出端配置成根据所述运算指令连接至所述第二条多级运算流水线的一个级或多个级的运算电路的输入端。
条款9、根据条款1所述的计算装置,其中每级运算电路包括以下中的一个或多个运算器或电路:
随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。
条款10、根据条款1所述的计算装置,还包括数据处理电路,其包括用于执行数据类型转换操作的类型转换电路和/或用于执行数据拼接操作的数据拼接电路。
条款11、根据条款10所述的计算装置,其中所述类型转换电路包括一个或多个转换器,用于实现计算数据在多种不同数据类型之间的转换。
条款12、根据条款10所述的计算装置,其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。
条款13、一种集成电路芯片,包括根据条款1-12的任意一项所述的计算装置。
条款14、一种板卡,包括根据条款13所述的集成电路芯片。
条款15、一种电子设备,包括根据条款13所述的集成电路芯片。
条款16、一种使用计算装置来执行计算操作的方法,其中所述计算装置包括一组或多组流水运算电路,所述方法包括:
将所述一组或多组流水运算电路中的每组配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运算流水线,并且该多级运算流水线中包括逐级布置的多个运算电路;以及
响应于接收到多个运算指令,将所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,
其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
条款17、根据条款16所述的方法,其中所述计算指令的操作码表示由所述多级运算流水线所执行的多个操作,所述计算装置还包括控制电路,所述方法包括将该控制电路配置成获取所述计算指令并对其进行解析,以获得与所述多个操作相对应的所述多条运算指令。
条款18、根据条款17所述的方法,其中根据多级运算流水线中逐级布置的多个运算电路所支持的功能来预先确定所述操作码以及其所代表的所述多个操作。
条款19、根据条款16所述的方法,其中将所述多级运算流水线中的各级运算电路配置成根据所述多条运算指令进行选择性地连接,以便执行所述多条运算指令。
条款20、根据条款16所述的方法,其中所述多组流水运算电路构成多条多级运算流水线,并且所述多条多级运算流水线并行地执行各自的多条运算指令。
条款21、根据条款16或20所述的方法,其中所述多级运算流水线中的每级运算 电路具有输入端和输出端,用于在该级运算电路处接收输入的数据和输出该级运算电路操作的结果。
条款22、根据条款21所述的方法,其中在一条多级运算流水线内,将一个或多个级的运算电路的输出端配置成根据运算指令连接到另一个级或另外多个级的运算电路的输入端,以执行所述运算指令。
条款23、根据条款21所述的方法,其中所述多条多级运算流水线包括第一条多级运算流水线和第二条多级运算流水线,其中所述方法将所述第一条多级运算流水线的一个级或多个级的运算电路的输出端配置成根据所述运算指令连接至所述第二条多级运算流水线的一个级或多个级的运算电路的输入端。
条款24、根据条款16所述的方法,其中每级运算电路包括以下中的一个或多个运算器或电路:
随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。
条款25、根据条款16所述的方法,还包括数据处理电路,其包括用于执行数据类型转换操作的类型转换电路和/或用于执行数据拼接操作的数据拼接电路。
条款26、根据条款25所述的方法,其中所述类型转换电路包括一个或多个转换器,用于实现计算数据在多种不同数据类型之间的转换。
条款27、根据条款25所述的方法,其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。

Claims (27)

  1. 一种计算装置,包括:
    一组或多组流水运算电路,其配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运算流水线,并且所述多级运算流水线中包括逐级布置的多个运算电路,
    其中响应于接收到多个运算指令,所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,
    其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
  2. 根据权利要求1所述的计算装置,其中所述计算指令的操作码表示由所述多级运算流水线所执行的多个操作,所述计算装置还包括控制电路,其配置成获取所述计算指令并对其进行解析,以获得与所述多个操作相对应的所述多条运算指令。
  3. 根据权利要求2所述的计算装置,其中根据多级运算流水线中逐级布置的多个运算电路所支持的功能来预先确定所述操作码以及其所代表的所述多个操作。
  4. 根据权利要求1所述的计算装置,其中所述多级运算流水线中的各级运算电路配置成根据所述多条运算指令进行选择性地连接,以便执行所述多条运算指令。
  5. 根据权利要求1所述的计算装置,其中所述多组流水运算电路构成多条多级运算流水线,并且所述多条多级运算流水线并行地执行各自的多条运算指令。
  6. 根据权利要求1或5所述的计算装置,其中所述多级运算流水线中的每级运算电路具有输入端和输出端,用于在该级运算电路处接收输入的数据和输出该级运算电路操作的结果。
  7. 根据权利要求6所述的计算装置,其中在一条多级运算流水线内,一个或多个级的运算电路的输出端配置成根据运算指令连接到另一个级或另外多个级的运算电路的输入端,以执行所述运算指令。
  8. 根据权利要求6所述的计算装置,其中所述多条多级运算流水线包括第一条多级运算流水线和第二条多级运算流水线,其中所述第一条多级运算流水线的一个级或多个级的运算电路的输出端配置成根据所述运算指令连接至所述第二条多级运算流水线的一个级或多个级的运算电路的输入端。
  9. 根据权利要求1所述的计算装置,其中每级运算电路包括以下中的一个或多个运算器或电路:
    随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。
  10. 根据权利要求1所述的计算装置,还包括数据处理电路,其包括用于执行数据类型转换操作的类型转换电路和/或用于执行数据拼接操作的数据拼接电路。
  11. 根据权利要求10所述的计算装置,其中所述类型转换电路包括一个或多个转换器,用于实现计算数据在多种不同数据类型之间的转换。
  12. 根据权利要求10所述的计算装置,其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。
  13. 一种集成电路芯片,包括根据权利要求1-12的任意一项所述的计算装置。
  14. 一种板卡,包括根据权利要求13所述的集成电路芯片。
  15. 一种电子设备,包括根据权利要求13所述的集成电路芯片。
  16. 一种使用计算装置来执行计算操作的方法,其中所述计算装置包括一组或多组流 水运算电路,所述方法包括:
    将所述一组或多组流水运算电路中的每组配置成执行多级流水运算,其中每组所述流水运算电路构成一条多级运算流水线,并且该多级运算流水线中包括逐级布置的多个运算电路;以及
    响应于接收到多个运算指令,将所述多级运算流水线中的每级运算电路配置成执行所述多个运算指令中的对应一个运算指令,
    其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得。
  17. 根据权利要求16所述的方法,其中所述计算指令的操作码表示由所述多级运算流水线所执行的多个操作,所述计算装置还包括控制电路,所述方法包括将该控制电路配置成获取所述计算指令并对其进行解析,以获得与所述多个操作相对应的所述多条运算指令。
  18. 根据权利要求17所述的方法,其中根据多级运算流水线中逐级布置的多个运算电路所支持的功能来预先确定所述操作码以及其所代表的所述多个操作。
  19. 根据权利要求16所述的方法,其中将所述多级运算流水线中的各级运算电路配置成根据所述多条运算指令进行选择性地连接,以便执行所述多条运算指令。
  20. 根据权利要求16所述的方法,其中所述多组流水运算电路构成多条多级运算流水线,并且所述多条多级运算流水线并行地执行各自的多条运算指令。
  21. 根据权利要求16或20所述的方法,其中所述多级运算流水线中的每级运算电路具有输入端和输出端,用于在该级运算电路处接收输入的数据和输出该级运算电路操作的结果。
  22. 根据权利要求21所述的方法,其中在一条多级运算流水线内,将一个或多个级的运算电路的输出端配置成根据运算指令连接到另一个级或另外多个级的运算电路的输入端,以执行所述运算指令。
  23. 根据权利要求21所述的方法,其中所述多条多级运算流水线包括第一条多级运算流水线和第二条多级运算流水线,其中所述方法将所述第一条多级运算流水线的一个级或多个级的运算电路的输出端配置成根据所述运算指令连接至所述第二条多级运算流水线的一个级或多个级的运算电路的输入端。
  24. 根据权利要求16所述的方法,其中每级运算电路包括以下中的一个或多个运算器或电路:
    随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。
  25. 根据权利要求16所述的方法,还包括数据处理电路,其包括用于执行数据类型转换操作的类型转换电路和/或用于执行数据拼接操作的数据拼接电路。
  26. 根据权利要求25所述的方法,其中所述类型转换电路包括一个或多个转换器,用于实现计算数据在多种不同数据类型之间的转换。
  27. 根据权利要求25所述的方法,其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分,并且将拆分后获得的多个数据块按照预定顺序进行拼接。
PCT/CN2021/094722 2020-06-30 2021-05-19 计算装置、集成电路芯片、板卡、电子设备和计算方法 WO2022001455A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021576558A JP7368512B2 (ja) 2020-06-30 2021-05-19 計算装置、集積回路チップ、ボードカード、電子デバイスおよび計算方法
US18/013,589 US20230297387A1 (en) 2020-06-30 2021-05-19 Calculation apparatus, integrated circuit chip, board card, electronic device and calculation method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010619481.X 2020-06-30
CN202010619481.XA CN113867793A (zh) 2020-06-30 2020-06-30 计算装置、集成电路芯片、板卡、电子设备和计算方法

Publications (1)

Publication Number Publication Date
WO2022001455A1 true WO2022001455A1 (zh) 2022-01-06

Family

ID=78981787

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/094722 WO2022001455A1 (zh) 2020-06-30 2021-05-19 计算装置、集成电路芯片、板卡、电子设备和计算方法

Country Status (4)

Country Link
US (1) US20230297387A1 (zh)
JP (1) JP7368512B2 (zh)
CN (1) CN113867793A (zh)
WO (1) WO2022001455A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020890A (zh) * 2012-12-17 2013-04-03 中国科学院半导体研究所 基于多层次并行处理的视觉处理装置
CN107729990A (zh) * 2017-07-20 2018-02-23 上海寒武纪信息科技有限公司 支持离散数据表示的用于执行人工神经网络正向运算的装置及方法
CN110858150A (zh) * 2018-08-22 2020-03-03 上海寒武纪信息科技有限公司 一种具有局部实时可重构流水级的运算装置
US20200201932A1 (en) * 2019-12-28 2020-06-25 Intel Corporation Apparatuses, methods, and systems for instructions of a matrix operations accelerator

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0769824B2 (ja) * 1988-11-11 1995-07-31 株式会社日立製作所 複数命令同時処理方式
US5787026A (en) * 1995-12-20 1998-07-28 Intel Corporation Method and apparatus for providing memory access in a processor pipeline
EP1199629A1 (en) * 2000-10-17 2002-04-24 STMicroelectronics S.r.l. Processor architecture with variable-stage pipeline
US10572824B2 (en) 2003-05-23 2020-02-25 Ip Reservoir, Llc System and method for low latency multi-functional pipeline with correlation logic and selectively activated/deactivated pipelined data processing engines
US8074056B1 (en) * 2005-02-02 2011-12-06 Marvell International Ltd. Variable length pipeline processor architecture
US7721071B2 (en) * 2006-02-28 2010-05-18 Mips Technologies, Inc. System and method for propagating operand availability prediction bits with instructions through a pipeline in an out-of-order processor
US7984269B2 (en) * 2007-06-12 2011-07-19 Arm Limited Data processing apparatus and method for reducing issue circuitry responsibility by using a predetermined pipeline stage to schedule a next operation in a sequence of operations defined by a complex instruction
JP5173782B2 (ja) 2008-05-26 2013-04-03 清水建設株式会社 地下水流動保全工法
US7941644B2 (en) * 2008-10-16 2011-05-10 International Business Machines Corporation Simultaneous multi-thread instructions issue to execution units while substitute injecting sequence of instructions for long latency sequencer instruction via multiplexer
US20140129805A1 (en) * 2012-11-08 2014-05-08 Nvidia Corporation Execution pipeline power reduction
US10223124B2 (en) 2013-01-11 2019-03-05 Advanced Micro Devices, Inc. Thread selection at a processor based on branch prediction confidence
US9354884B2 (en) 2013-03-13 2016-05-31 International Business Machines Corporation Processor with hybrid pipeline capable of operating in out-of-order and in-order modes
US11029997B2 (en) * 2013-07-15 2021-06-08 Texas Instruments Incorporated Entering protected pipeline mode without annulling pending instructions
US9690590B2 (en) * 2014-10-15 2017-06-27 Cavium, Inc. Flexible instruction execution in a processor pipeline
CN110990063B (zh) * 2019-11-28 2021-11-23 中国科学院计算技术研究所 一种用于基因相似性分析的加速装置、方法和计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020890A (zh) * 2012-12-17 2013-04-03 中国科学院半导体研究所 基于多层次并行处理的视觉处理装置
CN107729990A (zh) * 2017-07-20 2018-02-23 上海寒武纪信息科技有限公司 支持离散数据表示的用于执行人工神经网络正向运算的装置及方法
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品
CN110858150A (zh) * 2018-08-22 2020-03-03 上海寒武纪信息科技有限公司 一种具有局部实时可重构流水级的运算装置
US20200201932A1 (en) * 2019-12-28 2020-06-25 Intel Corporation Apparatuses, methods, and systems for instructions of a matrix operations accelerator

Also Published As

Publication number Publication date
JP2022542217A (ja) 2022-09-30
JP7368512B2 (ja) 2023-10-24
CN113867793A (zh) 2021-12-31
US20230297387A1 (en) 2023-09-21

Similar Documents

Publication Publication Date Title
CN111291880B (zh) 计算装置以及计算方法
US11531540B2 (en) Processing apparatus and processing method with dynamically configurable operation bit width
CN109032669B (zh) 神经网络处理装置及其执行向量最小值指令的方法
CN109543832B (zh) 一种计算装置及板卡
CN109522052B (zh) 一种计算装置及板卡
KR20190107091A (ko) 계산 장치 및 방법
CN110059797B (zh) 一种计算装置及相关产品
CN109711540B (zh) 一种计算装置及板卡
CN110059809B (zh) 一种计算装置及相关产品
WO2022001497A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN111368967B (zh) 一种神经网络计算装置和方法
CN109740730B (zh) 运算方法、装置及相关产品
WO2022001500A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001455A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN111353124A (zh) 运算方法、装置、计算机设备和存储介质
WO2022001496A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN111368990B (zh) 一种神经网络计算装置和方法
CN111367567B (zh) 一种神经网络计算装置和方法
CN111368987B (zh) 一种神经网络计算装置和方法
CN113746471B (zh) 运算电路、芯片和板卡
WO2022001454A1 (zh) 集成计算装置、集成电路芯片、板卡和计算方法
CN111290788B (zh) 运算方法、装置、计算机设备和存储介质
CN113791996B (zh) 集成电路装置、电子设备、板卡和计算方法
CN111353125B (zh) 运算方法、装置、计算机设备和存储介质
WO2022001457A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021576558

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21832729

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21832729

Country of ref document: EP

Kind code of ref document: A1