WO2022021073A1 - 神经网络模型的多算子运算方法以及装置 - Google Patents

神经网络模型的多算子运算方法以及装置 Download PDF

Info

Publication number
WO2022021073A1
WO2022021073A1 PCT/CN2020/105217 CN2020105217W WO2022021073A1 WO 2022021073 A1 WO2022021073 A1 WO 2022021073A1 CN 2020105217 W CN2020105217 W CN 2020105217W WO 2022021073 A1 WO2022021073 A1 WO 2022021073A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
original image
data
read
operator
Prior art date
Application number
PCT/CN2020/105217
Other languages
English (en)
French (fr)
Inventor
刘敏丽
张楠赓
Original Assignee
嘉楠明芯(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 嘉楠明芯(北京)科技有限公司 filed Critical 嘉楠明芯(北京)科技有限公司
Priority to PCT/CN2020/105217 priority Critical patent/WO2022021073A1/zh
Priority to CN202080102306.1A priority patent/CN116134446A/zh
Publication of WO2022021073A1 publication Critical patent/WO2022021073A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Definitions

  • the present application relates to the field of artificial intelligence, in particular to the field of multi-operator operations.
  • the convolutional neural network also includes operations such as activation, pooling, and batch normalization. Although these operations account for a small proportion of the entire convolutional neural network, they are very important. At present, there are two ways to realize operations such as activation, pooling, and batch normalization. In actual scenarios, various combinations of hardware of multiple individual computing modules are performed to complete computing tasks. However, each operation corresponds to the hardware of a single operation module, which not only increases the chip area, but also increases the production cost. Moreover, the hardware that can only implement one type of operation module can only realize conventional simple calculations, and complex operations cannot be realized.
  • the second way is to use general hardware accelerators such as CPU (central processing unit, central processing unit), DSP (Digital Signal Processing) or GPU (graphics processing unit, Graphics Processing Unit) to realize activation, pooling, Operations such as batch normalization.
  • CPUs, DSPs or GPUs are not specially designed for operations such as activation, pooling, batch normalization, etc. in neural networks, resulting in lower operation rates.
  • the embodiments of the present application provide a multi-operator computing method and device for a neural network model to solve the problems existing in the related art, and the technical solutions are as follows:
  • an embodiment of the present application provides a multi-operator computing method for a neural network model, including:
  • Obtain a configuration instruction and determine, according to the configuration instruction, multiple computing devices corresponding to the multiple operators, and the execution order of the multiple computing devices, where the multiple operators are decomposed according to the computing formula, and the multiple computing devices is selected from the set of computing devices;
  • the plurality of operation devices are controlled to process the original image data in a serial execution manner to output final image data.
  • the configuration instruction includes a preset data length, and the read pixel points included in the tensor corresponding to the original image to obtain the original image data, including:
  • the read request In the case that the read request is passed, read the pixels contained in the tensor corresponding to the original image to obtain the original image data;
  • the reading of the pixel points is stopped.
  • the configuration instruction includes a preset vector length
  • the reading of the pixels included in the tensor of the original image to obtain the original image data includes:
  • the pixel points are read in the order of arrangement, and when each pixel point is read, each of the pixel points is repeatedly read M1 times;
  • Each of the vectors is repeatedly read M2 times to obtain the original image data, wherein both M1 and M2 are greater than or equal to 1.
  • controlling the plurality of computing devices to process the original image data in a serial execution manner, and outputting the final image data includes:
  • the plurality of operation devices are controlled to perform parallel processing on the original image data corresponding to the plurality of pixel points in a serial manner, and output the final image data.
  • the configuration instruction includes a mapping relationship table between the output terminal of each computing device and the input terminals of the remaining computing devices, and according to the execution sequence, the plurality of computing devices are controlled to serially
  • the original image data is processed in the manner of line execution, and the final image data is output, including:
  • the original image data is controlled to be input to the first operation device for operation to obtain first intermediate data
  • the first intermediate data is input to the second operation device for operation to obtain second intermediate data
  • the N-1th intermediate data is input to the Nth operation device for operation
  • the final image data is output, where N is a positive integer greater than or equal to 1.
  • the configuration instruction includes a constant, and the constant is obtained by decomposing the operation formula.
  • it also includes:
  • the final image data is downsampled.
  • the original image data, the first intermediate data to the N-1th intermediate data, and the final image data are all in a data format of 16-bit floating point numbers.
  • an embodiment of the present application provides a multi-operator computing device for a neural network model, including:
  • a configuration instruction acquisition module configured to acquire a configuration instruction, and determine a plurality of operation devices corresponding to a plurality of operators according to the configuration instructions, and the execution order of the plurality of operation devices, and the multiple operators are decomposed according to the operation formula Obtained, the plurality of operation devices are selected from the operation device set;
  • the data reading module is used to read the pixels contained in the tensor corresponding to the original image to obtain the original image data;
  • the multi-operator operation module is configured to control the plurality of operation devices to process the original image data in a serial execution manner according to the execution sequence, and output final image data.
  • the configuration instruction includes a preset data length
  • the data reading module includes:
  • a read request sending submodule used to send a read request to the external memory and/or the internal local buffer
  • a data reading submodule configured to read the pixels contained in the tensor corresponding to the original image to obtain the original image data when the read request is passed;
  • a data reading stop sub-module is configured to stop reading the pixel point when the length of the original image data is equal to the preset data length.
  • the configuration instruction includes a preset vector length
  • the data reading submodule includes:
  • a vector dividing unit configured to divide the tensor into a plurality of vectors according to the preset vector length, and the vectors include a plurality of pixels;
  • a first reading unit used in the vector, to read the pixel points according to the arrangement order, and when each pixel point is read, each of the pixel points is repeatedly read M1 times;
  • the second reading unit is used to repeatedly read each of the vectors M2 times to obtain the original image data, wherein both M1 and M2 are greater than or equal to 1.
  • the multi-operator operation module is configured to control the plurality of operation devices to perform parallel processing on the original image data corresponding to the plurality of pixels in a serial execution manner within one clock cycle, and output the the final image data.
  • the configuration instruction includes a mapping relationship table between the output terminal of each operation device and the input terminals of the remaining operation devices
  • the multi-operator operation module includes:
  • an execution order determination submodule configured to determine the execution order according to the mapping relationship table
  • a multi-operator operation sub-module configured to control the input of the original image data to the first operation device for operation according to the execution sequence, obtain first intermediate data, and input the first intermediate data to the second operation
  • the device performs operations to obtain second intermediate data, until the N-1th intermediate data is input to the Nth operation device for operation, and the final image data is output, where N is a positive integer greater than or equal to 1.
  • the configuration instruction includes a constant, and the constant is obtained by decomposing the operation formula.
  • it also includes:
  • a down-sampling module for down-sampling the final image data.
  • the original image data, the first intermediate data to the N-1th intermediate data, and the final image data are all in a data format of 16-bit floating point numbers.
  • an electronic device comprising:
  • At least one processor and a memory communicatively coupled to the at least one processor;
  • the memory stores instructions executable by at least one processor, and the instructions are executed by at least one processor, so that the at least one processor can perform any one of the above methods.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute any of the above methods.
  • An embodiment in the above application has the following advantages or beneficial effects: since any complex operation formula can be decomposed into multiple operators, multiple operators are configured with corresponding multiple operation devices, and multiple operation devices are used to serially It can process the original image data and output the final image data, so it can support various types of complex operations in various neural networks, and the operations are programmable, which improves the operation efficiency.
  • the operation devices corresponding to multiple operators are selected from the operation set, when various complex operations are performed, the multiple operation devices are configurable and reusable, and each complex operation does not require The corresponding hardware is designed, which effectively saves the chip area and reduces the chip cost.
  • general-purpose hardware accelerators such as CPU, DSP, or GPU are not directly used to perform various operations of the neural network model, but the multi-operator computing device of the neural network model provided in the present application is used, which avoids the need for CPU, Communication with general-purpose hardware accelerators such as DSP or GPU improves the computational timeliness.
  • FIG. 1 is a schematic diagram of a multi-operator computing method of a neural network model according to an embodiment of the present application
  • FIG. 2 is a scene diagram of a multi-operator computing device of a neural network model according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of a multi-operator operation method of a neural network model according to another embodiment of the present application.
  • FIG. 4 is a structural diagram of an internal local buffer according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a method for reading pixels included in a tensor corresponding to an original image according to another embodiment of the present application;
  • FIG. 6 is a scene diagram of a multi-operator computing method according to an embodiment of the present application.
  • FIG. 7 is a scene diagram of a multi-operator computing device of a neural network model according to another embodiment of the present application.
  • FIG. 8 is a schematic diagram of a multi-operator computing device of a neural network model according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a multi-operator computing device of a neural network model according to another embodiment of the present application.
  • FIG. 10 is a block diagram of an electronic device used to implement a multi-operator operation method of a neural network model according to an embodiment of the present application.
  • a multi-operator operation method of a neural network model including the following steps:
  • Step S110 Acquire a configuration instruction, and determine, according to the configuration instruction, multiple computing devices corresponding to the multiple operators, and the execution order of the multiple computing devices, where the multiple operators are decomposed according to the computing formula, and the multiple computing devices are obtained from the operation selected from the device collection;
  • Step S120 Read the pixels contained in the tensor corresponding to the original image to obtain original image data
  • Step S130 According to the execution sequence, control a plurality of computing devices to process the original image data in a serial execution manner, and output the final image data.
  • the multi-operator computing device of the neural network model may include a data reading module, a multi-operator computing module and a data writing module which are connected in sequence.
  • the multi-operator operation module can be set based on a mesh network (Meshnet).
  • the convolution accelerator includes a multi-operator computing device and a GLB (internal local buffer, Global local buffer), and the multi-operator computing device is connected to the GLB.
  • DDR Double Data Rate synchronous dynamic random-access memory
  • the data reading module in the multi-operator computing device can read data from GLB and/or DDR.
  • GLB and/or DDR are provided with multiple storage areas, and each storage area can store tensors (Tensors) corresponding to different original images.
  • the tensor includes four dimensions of the original image: N (number of frames, batch), C (number of channels, channels), H (height, height), and W (width, width).
  • NCHW is used to represent a four-dimensional image.
  • N represents the number of frames of this batch of images
  • H represents the number of pixels in the vertical direction of the image
  • W represents the number of pixels in the horizontal direction
  • the data reading module may read each pixel in the tensor corresponding to the original image from the GLB and/or DDR, and the original image data may include the value of one pixel or the values of multiple pixels.
  • the configuration module provided by the upper-layer software is used to split any complex operation formula into multiple basic operators that the Meshnet network can support.
  • Basic operators may include addition operators, multiplication operators, square root operators, square operators, sine and cosine operators, and base logarithm operators.
  • a corresponding computing device is required to perform the operation of each operator. Therefore, in this embodiment, a set of computing devices is set, and the set of computing devices is used to implement the operations of commonly used operators in operations such as activation, pooling, and batch normalization in the neural network.
  • the set of arithmetic devices can include adders, multipliers, one-to-two copy operators, sixteen piece-wise linear fittings, one-of-two operators, comparators, dividers, binary logic operators, unary logic operators, rounders. Input operator, square root operator, square operator, sine and cosine operator, exponentiation operator with base e, logarithm operator with base e, etc.
  • the set of computing devices can be adaptively adjusted according to actual needs, which are all within the protection scope of this embodiment.
  • the input end of each operation device can be used as the input end of the multi-operator operation module for receiving original image data.
  • the output terminal of each computing device in the computing device set can be connected to the input terminals of the remaining computing devices to ensure that the intermediate data output by the previous computing device is used as the input data of the next computing device, and is input to the next computing device to continue. operation.
  • the output end of each operation device can also be used as the output end of the multi-operator operation module for outputting the final image data.
  • the configuration module queries multiple operation devices corresponding to multiple operators from the operation device set (each operation does not necessarily use all the operation devices in the operation device set), and determines according to the mathematical operation order of the multiple operators The execution order of multiple arithmetic devices. Then, the configuration module sends a configuration instruction to the multi-operator operation module, where the configuration instruction includes multiple operation devices corresponding to the multiple operators and an execution sequence of the multiple operation devices.
  • the multi-operator operation module obtains configuration instructions from the configuration module, and on the other hand, reads the pixels contained in the tensor corresponding to the original image from GLB and/or DDR to obtain the original image data, and controls the original image data according to the execution order.
  • Input to the first operation device for operation obtain the first intermediate data, input the first intermediate data to the second operation device for operation, and obtain the second intermediate data, until the N-1th intermediate data is input to the Nth
  • the operation device performs operation and outputs final image data.
  • the multi-operator operation module is provided with a logic control unit, which controls the multi-operator operation module to read the original image data, and according to the execution sequence, controls multiple operation devices to process the original image data in a serial execution manner, and outputs the final image the entire process of data. Finally, the final image data is written into the GLB and/or DDR using the data write-out module.
  • tanh_shrink activation function (a type of activation function in the neural network structure) is:
  • the corresponding arithmetic devices include that the first arithmetic device is a base-e exponentiation operator, the second arithmetic device is a one-to-two copy operator, the third arithmetic device is another one-to-two copy operator, and the fourth
  • the first operational device is an adder, the fifth operational device is a subtractor, the sixth operational device is a divider, and the seventh operational device is another subtractor.
  • the original image data is controlled to be input to the e-base exponentiation operator to obtain the first intermediate data, and the first intermediate data is input to the one-to-two copy operator, The second intermediate data is obtained, and so on, and the final image data is output from the subtractor (the seventh operation device).
  • any complex operation formula can be decomposed into multiple operators, multiple operation devices corresponding to multiple operators are configured, and multiple operation devices are used to perform serial execution on the original image data. process, and output the final image data, so it can support various types of complex operations in various neural networks, and the operation is programmable, which improves the operation efficiency.
  • the multiple operation devices since the operation devices corresponding to the multiple operators are selected from the operation set, the multiple operation devices have configurability when various complex operations are performed. For different complex mathematical operations, each operation device in the operation set may be reused, and there is no need to design corresponding hardware for each complex operation, which effectively saves chip area and reduces chip cost.
  • the configuration instruction includes a preset data length
  • step S120 includes:
  • Step S121 send a read request to the external memory and/or the internal local buffer
  • Step S122 in the case that the read request is passed, read the pixel points contained in the tensor corresponding to the original image to obtain the original image data;
  • Step S123 In the case that the length of the original image data is equal to the preset data length, stop reading pixels.
  • the data read module may send one or more read requests to DDR (external memory) and/or GLB (internal local buffer). For example, if a request is made to read two tensors of raw images, then one read request can be sent to the DDR and another read request to the GLB; alternatively, two read requests can be sent to the DDR; alternatively, two read requests can be sent to the GLB ask.
  • the data reading module can read tensors corresponding to multiple frames of original images, and the reading method and the number of read tensors can be adaptively adjusted according to actual needs, which are all within the protection scope of this embodiment.
  • the DDR and/or GLB After the DDR and/or GLB receive the read request, it feeds back the result of allowing the read to the data read module. Then, the data reading module reads the pixel points contained in the tensor corresponding to the original image, obtains the original image data, and sends the original image data to the multi-operator operation module.
  • the data reading module includes a mapping function (Map) unit and/or a broadcasting function (Broadcast) unit, which can implement the reading method of the mapping function and the reading method of the broadcast function.
  • GLB is the data cache SRAM (Static Random-Access Memory) of the convolution accelerator. It has a large storage space, and the data reading module can directly obtain data from GLB.
  • GLB can include eight independent RAMs (Random Access Memory, Random Access Memory), each RAM has a depth of 512 and a width of 128 bits, and the eight independent RAMs are numbered bank0 to bank7 respectively.
  • the mapping functional unit needs one input in a single clock cycle, mapping one input to the eight independent RAMs of the GLB.
  • the GLB responds to a read request in a single clock cycle, and the mapping functional unit selects a bankA (A is any integer from 0 to 7) from bank0 to bank7 to read the tensors stored in bankA.
  • the broadcast function unit requires at least one input in a single clock cycle, mapping the two inputs to eight independent RAMs.
  • GLB responds to two read requests in a single clock cycle, and the broadcast function unit selects two bankB and bankC from bank0 ⁇ bank7 (B and C are any integers from 0 to 7 and B is not equal to C) to read the data stored in bankB.
  • the data write-out module will send a write request to an independent RAM of the DDR and/or GLB in a single clock cycle. After the write request is passed, the data-write-out module will write the final image data into the DDR and/or GLB.
  • the data reading module can read the pixel points included in the tensor of the original image according to the configuration instruction to obtain the original image data. Since the configuration instruction sent by the configuration module to the data reading module includes the preset data length, the data reading module stops reading pixels when the length of the read original image data is equal to the preset data length.
  • the mapping functional unit supports reading tensors corresponding to one original image from GLB or DDR, and regards the tensors corresponding to a single original image as a one-dimensional vector, and the pixels are arranged in the NCHW. the point.
  • the mapping functional unit reads pixels from the GLB or DDR in the order of the first row and then the column until the entire four-dimensional image is read, and sends the read original image data to the multi-operator operation module in turn.
  • the NCHW in the tensor corresponding to the original image is 1*2*30*40, and the tensor is regarded as a one-dimensional vector.
  • the mapping functional unit does not necessarily have to read all the pixels of the NCHW in the tensor, but reads according to the preset data length.
  • the NCHW in the tensor is 1*2*30*40, which means that one line contains 40 pixels. If the preset data length is 120, then only three lines of pixels need to be read.
  • the configuration instruction includes a preset vector length
  • step S122 includes:
  • S1221 Divide the tensor into multiple vectors according to the preset vector length, and the vectors include multiple pixels;
  • the broadcast functional unit supports reading tensors corresponding to multiple frames of raw images in GLB and/or DDR.
  • the configuration module sends a configuration instruction to the data reading module, and the configuration instruction includes a preset vector length.
  • the configuration instruction may also include information such as the number of times M1 of repeated reading of pixel points, the number of times of repeated reading of vectors M2, and the like. Different tensors, vector lengths, pixel point repeated reading times M1, and vector repeated reading times M2 can be configured according to actual conditions, which are all within the protection scope of this embodiment.
  • the NCHW in the tensor corresponding to the first original image is 1*2*30*40
  • the NCHW in the tensor corresponding to the second original image is 1*3*20*40.
  • the first pixel point X0 is repeatedly read three times to obtain (X0, X0, X0)
  • the second pixel point X1 is repeatedly read three times to obtain (X1, X1, X1) ... until the vector in the read is completed. each pixel.
  • multiple computing devices are controlled to process the original image data corresponding to the multiple pixels in a serial manner, and output the final image data.
  • step 130 includes:
  • multiple operation devices are controlled to perform parallel processing on the original image data corresponding to multiple pixel points in a serial manner, and the final image data is output.
  • the data reading module reads multiple pixels from the DDR and/or GLB at a time. In one clock cycle, the data reading module sends the read original image data corresponding to multiple pixels to the multi-operator operation module, so that the multi-operator operation module can parallelize the original image data corresponding to multiple pixels at the same time. calculate. For example, in one clock cycle, the data reading module can send the original image data corresponding to four pixels to the multi-operator operation module.
  • the data reading module in the 0th clock cycle, can send four points X00 ⁇ X03 in the first row to the multi-operator operation module; in the first clock cycle, the data reading module can The four points X04 ⁇ X07 are sent to the multi-operator operation module, and so on.
  • the number of pixel repetitions M1 it is assumed that the number of pixel repetitions M1 is 3.
  • the data reading module can send (X00, X00, X00, X01) to the multi-operator operation module; in the first clock cycle, the data read The module can send (X01, X01, X02, X02) to the multi-operator operation module, and so on.
  • each operation device can be set, and the four identical operation devices work at the same time to process the original image data corresponding to the four pixels in parallel.
  • each computing device can also be set with a larger number, for example, eight adders, eight subtractors, etc., or each computing device can also be set with a smaller number, for example, two or three adders, Two or three subtractors, etc. Adaptive adjustment according to actual needs is within the protection scope of this embodiment.
  • the data reading module sends the original image data corresponding to multiple pixels to the multi-operator operation module, so as to realize parallel operation and effectively improve the operation efficiency.
  • the configuration instruction includes a mapping relationship table between the output terminal of each computing device and the input terminals of the remaining computing devices, and step 130 includes:
  • Step 131 an execution order determination submodule, configured to determine the execution order according to the mapping relationship table;
  • Step 132 According to the execution sequence, control the input of the original image data to the first operation device for operation to obtain the first intermediate data, input the first intermediate data to the second operation device for operation, and obtain the second intermediate data, until the The N-1th intermediate data is input to the Nth operation device for operation, and the final image data is output, where N is a positive integer greater than or equal to 1.
  • the configuration instruction includes determining a constant, and the constant is obtained by decomposing the operation formula.
  • the computing device set includes 27 computing devices, from computing device 0 to computing device 26 .
  • Each computing device may include two or three input terminals, one output terminal or two output terminals.
  • the operation device may include two tensor input terminals, or two constant input terminals, or two tensor input terminals and one constant input terminal, etc., which can be set according to requirements.
  • the tensor input is used to input raw image data or intermediate image data.
  • Constant inputs are used to enter constants.
  • the operation formula may contain constants, for example, the operation formula
  • 3.73 and 5.89 are both decimal values, configure the first constant as 3.73 and the second constant as 5.89. Therefore, the number of constants is related to the operation formula. Of course, if there are no constants in the operation formula, you do not need to configure the constants.
  • a multi-operator operation method of a neural network model is provided.
  • the configuration module is used to decompose the operation formula to obtain multiple operators, and determine multiple operation devices corresponding to the multiple operators.
  • the output of the previous operation device and the adjacent lower An input terminal of an arithmetic device generates a mapping table.
  • the input end of each computing device will have a corresponding output end number.
  • the preferred numbering method is to unify the default numbering of multiple tensor input terminals (eg, two), multiple constant input terminals (eg, four), and output terminals of all computing devices, so as to facilitate the connection between the input terminals of each computing device and the outputs of all devices.
  • the multi-operator operation module receives the configuration instruction sent by the configuration module, and determines the execution sequence according to the mapping relationship between the output end of each operation device and the input end of the remaining operation devices in the mapping relationship table.
  • first-level register (reg) is inserted after each operation is performed, in order to ensure that the timing of the input of the first tensor and the second tensor is consistent, additional computing devices can be called to perform calculations on the second tensor.
  • different storage areas in the GLB store tensors (first tensor and second tensor) corresponding to different original images
  • the data reading module reads the tensors contained in the tensors corresponding to the two original images from the GLB. Pixel points to obtain two channels of original image data. It is assumed that the original image data corresponding to the two tensors is sent to the multi-operator operation module.
  • Four computing devices are determined according to the configuration instruction, including the first computing device to the fourth computing device.
  • the first operation device is an adder 0, the second operation device is an adder 1, the third operation device is a square operator, and the fourth operation device is a comparator.
  • mapping relationship table in Table 1 two tensors and four constants are numbered uniformly with the output terminals of each operational device.
  • the two tensors and four constants are uniformly numbered 1 to 6, that is, the number of the first tensor is 1, the number of the second tensor is 2, the number of the first constant is 3, the number of the second constant is 4, The number of the third constant is 5, and the number of the fourth constant is 6. See the mapping table for the output terminal numbers of each computing device.
  • the execution order is obtained according to the mapping relationship table: the tensor input terminal of adder 0 is used to input the original image data corresponding to the first tensor, the constant input terminal is used to input the second constant, and the output terminal of adder 0 is used to input the original image data of the adder 1.
  • the tensor input terminal is connected; the tensor input terminal of the adder 1 is used to input the first intermediate data corresponding to the first tensor, the constant input terminal of the adder 1 corresponds to the input of the first constant, and the output terminal of the adder 1 is used for the square operation
  • the tensor input end of the square operator is connected to the tensor input end of the square operator; the tensor input end of the square operator is used to input the second intermediate data corresponding to the first tensor, and the output end of the square operator is connected to the tensor input end of the comparator;
  • the quantity input terminal is used to input the third intermediate data corresponding to the first tensor, the constant input terminal of the comparison operator is used to input the third constant, and the output terminal of the comparator is used to output the final image data.
  • any input terminal of each operation device can be a constant input, a tensor input, or a tensor output of other devices.
  • the output terminal and input terminal of each computing device in the computing device set can be controlled by the logic control unit to be turned on or off, indicating whether they can be used to transmit data.
  • the logic control unit controls the working state of each computing device according to the mapping relationship table. If the output terminal number corresponding to the input terminal of a certain computing device is 0, it means that the computing device will not work and can be in a closed state.
  • Step S140 down-sampling the final image data.
  • the down-sampling (Reduce) module down-samples the final image data
  • the input data can come from GLB or DDR or a multi-operator operation module, and can be any one of N, C, H, or W.
  • Dimension downsampling operation For example, the operations of finding the maximum value, minimum value, summation, subtraction, and multiplication of any dimension of N or C or H or W.
  • the input of downsampling can be the data read back by GLB or DDR, or the final image data output by the multi-operator operation module. If the multi-operator operation module does not work, you can directly downsample the original image data corresponding to the tensor, that is, downsample the pixels of any dimension in the tensor.
  • the format of the original image data, the format of the first intermediate data to the N-1th intermediate data, and the final image data are all data formats of 16-bit floating point numbers.
  • the operations involved in the multi-operator operation module and the downsampling module are both floating-point operations in BF16 format, which can effectively improve the operation precision.
  • Floating-point operations in BF16 (16-bit floating-point numbers, bfloat) format can be replaced with fixed-point operations in INT8 format, which can save the hardware area of the multi-operator operation module or downsampling module.
  • a multi-operator computing device of a neural network model including:
  • the configuration instruction acquisition module 110 is configured to acquire a configuration instruction, and according to the configuration instruction, determine multiple computing devices corresponding to multiple operators, and the execution order of the multiple computing devices, and the multiple operators are based on the operation formula. Decomposed, multiple computing devices are selected from the computing device set;
  • the data reading module 120 is used to read the pixel points contained in the tensor corresponding to the original image to obtain the original image data;
  • the multi-operator operation module 130 is configured to control the plurality of operation devices to process the original image data in a serial execution manner according to the execution sequence, and output final image data.
  • the general hardware accelerators such as CPU, DSP or GPU are not directly used in this embodiment to perform various operations of the neural network model, but the multi-operator computing device of the neural network model provided by this application is used, It avoids the communication with general hardware accelerators such as CPU, DSP or GPU, and improves the time efficiency of operation.
  • the configuration instruction includes a preset data length
  • the data reading module 120 includes:
  • a read request sending submodule 121 is used to send a read request to the external memory and/or the internal local buffer;
  • the data reading submodule 122 is configured to read the pixel points contained in the tensor corresponding to the original image to obtain the original image data when the read request is passed;
  • the data reading stop sub-module 123 is configured to stop reading the pixel point when the length of the original image data is equal to the preset data length.
  • the configuration instruction includes a preset vector length
  • the data reading sub-module 122 includes:
  • a vector dividing unit 1221 configured to divide the tensor into multiple vectors according to the preset vector length, and the vector includes multiple pixels;
  • the first reading unit 1222 is used in the vector to read the pixel points according to the arrangement sequence, and when each pixel point is read, each of the pixel points is repeatedly read M1 times;
  • the second reading unit 1223 is configured to repeatedly read each of the vectors M2 times to obtain the original image data, wherein both M1 and M2 are greater than or equal to 1.
  • the multi-operator operation module is configured to control the plurality of operation devices to perform parallel processing on the original image data corresponding to the plurality of pixels in a serial execution manner within one clock cycle, and output the the final image data.
  • the configuration instruction includes a mapping relationship table between the output terminal of each operation device and the input terminals of the remaining operation devices
  • the multi-operator operation module 130 includes:
  • an execution order determination submodule 131 configured to determine the execution order according to the mapping relationship table
  • the multi-operator operation sub-module 132 is configured to control the input of the original image data to the first operation device for operation according to the execution sequence, obtain the first intermediate data, and input the first intermediate data to the second operation device.
  • the operation device performs operations to obtain second intermediate data, until the N-1th intermediate data is input to the Nth operation device for operation, and the final image data is output, where N is a positive integer greater than or equal to 1.
  • the down-sampling module 140 is configured to down-sample the final image data.
  • the original image data, the first intermediate data to the N-1th intermediate data, and the final image data are all in a data format of 16-bit floating point numbers.
  • the present application further provides an electronic device and a readable storage medium.
  • FIG. 10 it is a block diagram of an electronic device of a multi-operator operation method of a neural network model according to an embodiment of the present application.
  • Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the application described and/or claimed herein.
  • the electronic device includes: one or more processors 1001, a memory 1002, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
  • the various components are interconnected using different buses and may be mounted on a common motherboard or otherwise as desired.
  • the processor may process instructions for execution within the electronic device, including storing in or on memory to display a Graphical User Interface (GUI) on an external input/output device such as a display device coupled to the interface ) instructions for graphics information.
  • GUI Graphical User Interface
  • multiple processors and/or multiple buses may be used with multiple memories and multiple memories, if desired.
  • multiple electronic devices may be connected, each providing some of the necessary operations (eg, as a server array, a group of blade servers, or a multiprocessor system).
  • a processor 1001 is used as an example.
  • the memory 1002 is the non-transitory computer-readable storage medium provided by the present application.
  • the memory stores instructions executable by at least one processor, so that the at least one processor executes the multi-operator operation method of the neural network model provided by the present application.
  • the non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make the computer execute the multi-operator operation method of the neural network model provided by the present application.
  • the memory 1002 can be used to store non-transitory software programs, non-transitory computer-executable programs and modules, such as corresponding to the multi-operator operation method of a neural network model in the embodiments of the present application.
  • Program instructions/modules for example, the configuration instruction acquisition module 110, the data reading module 120, and the multi-operator operation module 130 shown in FIG. 8).
  • the processor 1001 executes various functional applications and data processing of the server by running the non-transitory software programs, instructions and modules stored in the memory 1002, that is, to realize the multi-operator operation of a neural network model in the above method embodiments. method.
  • the memory 1002 can include a stored program area and a stored data area, wherein the stored program area can store an operating system and an application program required by at least one function; data created by the use of the device, etc. Additionally, memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1002 may optionally include memory located remotely relative to processor 1001, and these remote memories may be connected to the aforementioned electronic device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the above electronic device may further include: an input device 1003 and an output device 1004 .
  • the processor 1001 , the memory 1002 , the input device 1003 and the output device 1004 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 10 .
  • the input device 1003 can receive input numerical or character information, and generate key signal input related to user settings and function control of the above-mentioned electronic equipment, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointing stick, one or more Input devices such as mouse buttons, trackballs, joysticks, etc.
  • Output devices 1004 may include display devices, auxiliary lighting devices (eg, LEDs), haptic feedback devices (eg, vibration motors), and the like.
  • the display device may include, but is not limited to, a liquid crystal display (Liquid Cr10stal Displa10, LCD), a light emitting diode (Light Emitting Diode, LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
  • Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof .
  • ASICs application specific integrated circuits
  • These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
  • machine-readable medium and “computer-readable medium” refer to any computer program product, apparatus, and/or apparatus for providing machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memories, programmable logic devices (PLDs)), including machine-readable media that receive machine instructions as machine-readable signals.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the systems and techniques described herein may be implemented on a computer having: a display device (eg, a CRT (Cathode Ray Tube) or an LCD (liquid crystal) for displaying information to the user monitor); and a keyboard and pointing device (eg, mouse or trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (Cathode Ray Tube) or an LCD (liquid crystal) for displaying information to the user monitor
  • a keyboard and pointing device eg, mouse or trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), and the Internet.
  • a computer system can include clients and servers.
  • Clients and servers are generally remote from each other and usually interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Abstract

本申请公开了一种神经网络模型的多算子运算方法以及装置。具体实现方案为:获取配置指令,根据配置指令确定多个算子对应的多个运算器件,以及多个运算器件的执行顺序,多个算子是根据运算公式分解得到的,多个运算器件是从运算器件集合中选择的;读取原始图像对应的张量中包含的像素点,得到原始图像数据;按照执行顺序,控制多个运算器件以串行执行的方式对原始图像数据进行处理,输出最终图像数据。能够支持各种神经网络中各种类型的复杂运算,并且运算具有可编程性。同时,运算器件具有可配置性,可重复利用性,有效降低了芯片面积,节省芯片成本。

Description

神经网络模型的多算子运算方法以及装置 技术领域
本申请涉及人工智能领域,尤其涉及多算子运算领域。
背景技术
卷积神经网络中除了卷积运算外还包括激活、池化、批归一化等运算,这些运算在整个卷积神经网络中所占的比重虽然小,但是却至关重要。目前,有两种实现激活、池化、批归一化等运算的方式:第一种方式针对激活、池化、批归一化等运算中的每种运算分别设计了运算模块的硬件,根据实际场景,对多个单独的运算模块的硬件进行多样组合,来完成运算任务。然而,每种运算对应单独一个运算模块的硬件来实现,不仅导致芯片面积增加,生产成本提高,而且,只能实现一种运算模块的硬件仅能实现常规的简单计算,复杂的运算无法实现。第二种方式是利用CPU(中央处理器,central processing unit)、DSP(数字信号处理器,Digital Signal Processing)或GPU(图形处理器,Graphics Processing Unit)等通用的硬件加速器实现激活、池化、批归一化等运算。然而,CPU、DSP或GPU并不是为了神经网络中的激活、池化、批归一化等运算专门设计的,导致运算速率较低。
发明内容
本申请实施例提供一种神经网络模型的多算子运算方法以及装置,以解决相关技术存在的问题,技术方案如下:
第一方面,本申请实施例提供了一种神经网络模型的多算子运算方法,包括:
获取配置指令,根据所述配置指令确定多个算子对应的多个运算器件,以及所述多个运算器件的执行顺序,所述多个算子是根据运算公式分解得到的,多个运算器件是从运算器件集合中选择的;
读取原始图像对应的张量中包含的像素点,得到原始图像数据;
按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据。
在一种实施方式中,所述配置指令包括预设数据长度,所述读取原始图像对应的张量中包含的像素点,得到原始图像数据,包括:
向外部存储器和/或内部本地缓存器发送读请求;
在所述读请求通过的情况下,读取所述原始图像对应的张量中包含的像素点,得到所述原始图像数据;
在所述原始图像数据的长度等于所述预设数据长度的情况下,停止读取所述像素点。
在一种实施方式中,所述配置指令包括预设向量长度,所述读取所述原始图像的张量中包含的像素点,得到所述原始图像数据,包括:
根据所述预设向量长度将所述张量切分成多个向量,所述向量包括多个像素点;
所述向量中,按照排列顺序读取所述像素点,且读取至每个所述像素点时,每个所述像素点被重复读取M1次;
每个所述向量被重复读取M2次,得到所述原始图像数据,其中,M1、M2均大于或等于1。
在一种实施方式中,所述按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据,包括:
在一个时钟周期内,控制所述多个运算器件以串行执行的方式对多个像素点对应的原始图像数据进行并行处理,输出所述最终图像数据。
在一种实施方式中,所述配置指令包括每个运算器件的输出端与剩余运算器件的输入端之间的映射关系表,所述按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据,包括:
根据所述映射关系表确定所述执行顺序;
根据所述执行顺序,控制所述原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将所述第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进 行运算,输出最终图像数据,N为大于或等于1的正整数。
在一种实施方式中,所述配置指令包括常量,所述常量是根据所述运算公式分解得到的。
在一种实施方式中,还包括:
对所述最终图像数据进行降采样。
在一种实施方式中,所述原始图像数据、所述第一中间数据至所述第N-1中间数据,以及所述最终图像数据均为16位浮点数的数据格式。
第二方面,本申请实施例提供了一种神经网络模型的多算子运算装置,包括:
配置指令获取模块,用于获取配置指令,根据所述配置指令确定多个算子对应的多个运算器件,以及所述多个运算器件的执行顺序,所述多个算子是根据运算公式分解得到的,多个运算器件是从运算器件集合中选择的;
数据读取模块,用于读取原始图像对应的张量中包含的像素点,得到原始图像数据;
多算子运算模块,用于按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据。
在一种实施方式中,所述配置指令包括预设数据长度,所述数据读取模块包括:
读请求发送子模块,用于向外部存储器和/或内部本地缓存器发送读请求;
数据读取子模块,用于在所述读请求通过的情况下,读取所述原始图像对应的张量中包含的像素点,得到所述原始图像数据;
数据读取停止子模块,用于在所述原始图像数据的长度等于所述预设数据长度的情况下,停止读取所述像素点。
在一种实施方式中,所述配置指令包括预设向量长度,所述数据读取子模块,包括:
向量划分单元,用于根据所述预设向量长度将所述张量切分成多个向量,所述向量包括多个像素点;
第一读取单元,用于所述向量中,按照排列顺序读取所述像素点,且读取至每个所述像素点时,每个所述像素点被重复读取M1次;
第二读取单元,用于每个所述向量被重复读取M2次,得到所述原始 图像数据,其中,M1、M2均大于或等于1。
在一种实施方式中,所述多算子运算模块用于在一个时钟周期内,控制所述多个运算器件以串行执行的方式对多个像素点对应的原始图像数据进行并行处理,输出所述最终图像数据。
在一种实施方式中,所述配置指令包括每个运算器件的输出端与剩余运算器件的输入端之间的映射关系表,所述多算子运算模块包括:
执行顺序确定子模块,用于根据所述映射关系表确定所述执行顺序;
多算子运算子模块,用于根据所述执行顺序,控制所述原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将所述第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进行运算,输出最终图像数据,N为大于或等于1的正整数。
在一种实施方式中,所述配置指令包括常量,所述常量是根据所述运算公式分解得到的。
在一种实施方式中,还包括:
降采样模块,用于对所述最终图像数据进行降采样。
在一种实施方式中,所述原始图像数据、所述第一中间数据至所述第N-1中间数据,以及所述最终图像数据均为16位浮点数的数据格式。
第三方面,提供了一种电子设备,包括:
至少一个处理器;以及与至少一个处理器通信连接的存储器;
其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行上述任一项的方法。
第四方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,计算机指令用于使计算机执行上述任一项的方法。
上述申请中的一个实施例具有如下优点或有益效果:由于任意复杂的运算公式都可以分解成多个算子,给多个算子配置对应的多个运算器件,利用多个运算器件以串行执行的方式来对原始图像数据进行处理,输出最终图像数据,所以能够支持各种神经网络中各种类型的复杂运算,并且运算具有可编程性,提高了运算效率。同时,又由于多个算子对应的运算器件是从运算集合中选择的,使得在进行各种复杂运算时,多个运算器件具 有可配置性,可重复利用性,每种复杂运算都不需要设计对应的硬件,有效节省了芯片面积,降低芯片成本。由于本实施方式中并未直接利用CPU、DSP或者GPU等通用硬件加速器来执行神经网络模型的各种运算,而是利用本申请提供的神经网络模型的多算子运算装置,避免了与CPU、DSP或者GPU等通用硬件加速器的通信,提高了运算时效性。
上述可选方式所具有的其他效果将在下文中结合具体实施例加以说明。
附图说明
附图用于更好地理解本方案,不构成对本申请的限定。其中:
图1是根据本申请一实施例的一种神经网络模型的多算子运算方法的示意图;
图2是根据本申请一实施例的一种神经网络模型的多算子运算装置的场景图;
图3是根据本申请另一实施例的一种神经网络模型的多算子运算方法的示意图;
图4是根据本申请一实施例的内部本地缓存器结构图;
图5是根据本申请另一实施例的一种原始图像对应的张量中包含的像素点的读取方法的示意图;
图6是根据本申请一实施例的一种多算子运算方法的场景图;
图7是根据本申请另一实施例的一种神经网络模型的多算子运算装置的场景图;
图8是根据本申请一实施例的一种神经网络模型的多算子运算装置的示意图;
图9是根据本申请另一实施例的一种神经网络模型的多算子运算装置的示意图;
图10是用来实现本申请实施例的一种神经网络模型的多算子运算方法的电子设备的框图。
具体实施方式
以下结合附图对本申请的示范性实施例做出说明,其中包括本申请实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本申请的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
如图1所示,在一种具体实施方式中,提供了一种神经网络模型的多算子运算方法,包括如下步骤:
步骤S110:获取配置指令,根据配置指令确定多个算子对应的多个运算器件,以及多个运算器件的执行顺序,多个算子是根据运算公式分解得到的,多个运算器件是从运算器件集合中选择的;
步骤S120:读取原始图像对应的张量中包含的像素点,得到原始图像数据;
步骤S130:按照执行顺序,控制多个运算器件以串行执行的方式对原始图像数据进行处理,输出最终图像数据。
一种示例中,如图2所示,神经网络模型的多算子运算装置可以包括依次连接的数据读取模块、多算子运算模块以及数据写出模块。多算子运算模块可以基于网状网格(Meshnet)网络进行设置。
卷积加速器内部包括多算子运算装置和GLB(内部本地缓存器,Global local buffer),多算子运算装置与GLB连接。卷积加速器外部设置DDR(双倍速率同步动态随机存储器,Double Data Rate synchronous dynamic random-access memory),多算子运算装置还可以与DDR连接。多算子运算装置中的数据读取模块可以从GLB和/或DDR中读取数据。
GLB和/或DDR中设置有多个存储区域,各个存储区域中可以存储不同的原始图像对应的张量(Tensor)。张量包括原始图像的四个维度:N(帧数,batch)、C(通道数,channels)、H(高度,height)、W(宽度,width),通常用NCHW来表示四维图像。其中,N表示这批图像的帧数,H表示图像在竖直方向的像素数,W表示在水平方向的像素数,C表示通道数(例如,黑白图像的通道C=1,而RGB彩色图像的通道数C=3)。数据读取模块可以从GLB和/或DDR中读取原始图像对应的张量中的每个像素点,原始图像数据可以包括一个像素点的值或者多个像素点的值。
上层软件提供的配置模块用于将任意一个复杂的运算公式拆分成Meshnet网络可以支持的多个基本算子。基本算子可以包括加法算子、乘法算子、开方算子、平方算子、正余弦算子、底求对数算子等基础运算的算子。同时,在硬件层面,需要有对应的运算器件执行各个算子的运算。所以,本实施方式中,设置运算器件集合,运算器件集合用于实现神经网络中激活、池化、批归一化等运算中常用算子的运算。运算器件集合可以包括加法器、乘法器、一对二复制运算器、十六分段线性拟合、二选一运算器、比较器、除法器、二元逻辑运算器、一元逻辑运算器、舍入运算器、开方运算器、平方运算器、正余弦运算器、以e为底求幂运算器、以e为底求对数运算器等。运算器件集合可以根据实际需求进行适应性调整,均在本实施方式的保护范围内。每个运算器件的输入端可以作为多算子运算模块的输入端,用于接收原始图像数据。运算器件集合中的每个运算器件的输出端可以与剩余运算器件的输入端相连接,以保证上一个运算器件输出的中间数据作为下一个运算器件的输入数据,输入至下一个运算器件中继续运算。每个运算器件的输出端也可以作为多算子运算模块的输出端,用于输出最终图像数据。
配置模块从运算器件集合中查询与多个算子对应的多个运算器件(每次运算并不一定用到运算器件集合中的全部的运算器件),并根据多个算子的数学运算顺序确定多个运算器件的执行顺序。然后,配置模块向多算子运算模块发送配置指令,配置指令包括多个算子对应的多个运算器件以及多个运算器件的执行顺序。
多算子运算模块一方面从配置模块中获取配置指令,另一方面从GLB和/或DDR读取原始图像对应的张量中包含的像素点来获取原始图像数据,并按照执行顺序,控制原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进行运算,输出最终图像数据。多算子运算模块中设置有逻辑控制单元,控制多算子运算模块读取原始图像数据、以及按照执行顺序,控制多个运算器件以串行执行的方式对原始图像数据进行处理,输出最终图像数据的整个过程。最后,利用数据写出模块将最终图像数据写入至GLB和/或DDR中。
例如,实现tanh_shrink激活函数(神经网络结构中激活函数的一种)的运算公式为:
Figure PCTCN2020105217-appb-000001
将运算公式(1)拆分为以第一个算子为e为底的幂运算的算子,第二个算子为一对二复制运算的算子,第三个算子为一对二复制运算的算子,第四个算子为加法运算的算子、第五个算子为减法运算的算子、第六个算子为除法运算的算子,第七个算子为减法运算的算子。确定对应的运算器件包括第一个运算器件为e为底求幂运算器、第二个运算器件为一对二复制运算器、第三个运算器件为另一个一对二复制运算器、第四个运算器件为加法器、第五个运算器件为减法器、第六个运算器件为除法器、第七个运算器件为另一减法器。按照第一个运算器件至第七个运算器件的执行顺序,控制原始图像数据输入至e为底求幂运算器,得到第一中间数据,将第一中间数据输入至一对二复制运算器,得到第二中间数据,以此类推,从减法器(第七个运算器件)中输出最终图像数据。
本实施方式中,由于任意复杂的运算公式都可以分解成多个算子,给多个算子配置对应的多个运算器件,利用多个运算器件以串行执行的方式来对原始图像数据进行处理,输出最终图像数据,所以能够支持各种神经网络中各种类型的复杂运算,并且运算具有可编程性,提高了运算效率。同时,又由于多个算子对应的运算器件是从运算集合中选择的,使得在进行各种复杂运算时,多个运算器件具有可配置性。针对不同的复杂的数学运算,运算集合中的每个运算器件都可能被重复利用,无需对每种复杂的运算都设计对应的硬件,有效节省了芯片面积,降低芯片成本。
在一种实施方式中,如图3所示,配置指令包括预设数据长度,步骤S120,包括:
步骤S121:向外部存储器和/或内部本地缓存器发送读请求;
步骤S122:在读请求通过的情况下,读取原始图像对应的张量中包含的像素点,得到原始图像数据;
步骤S123:在原始图像数据的长度等于预设数据长度的情况下,停止读取像素点。
一种示例中,数据读取模块可以向DDR(外部存储器)和/或GLB(内部 本地缓存器)发送一个或多个读请求。例如,如果请求读取两帧原始图像的张量,那么可以向DDR发送一个读请求,向GLB发送另一个读请求;或者,向DDR发送两个读请求;又或者,向GLB发送两个读请求。数据读取模块可以读取多帧原始图像对应的张量,读取方式和读取张量的数量可以根据实际需求进行适应性调整,均在本实施方式的保护范围内。DDR和/或GLB接收读请求后,反馈允许读取的结果给数据读取模块。然后,数据读取模块读取原始图像对应的张量中包含的像素点,得到原始图像数据,并将原始图像数据发送至多算子运算模块。
具体的,数据读取模块包括映射功能(Map)单元和/或广播功能(Broadcast)单元,可以实现映射功能的读取方式和广播功能的读取方式。GLB是卷积加速器的数据缓存SRAM(静态随机存取存储器,Static Random-Access Memory),储存空间大,数据读取模块可以直接从GLB中获取数据。如图4所示,GLB可以包含八个独立RAM(随机存取存储器,Random Access Memory),每个RAM的深度为512,宽度为128bit,分别给这八个独立RAM编号为bank0~bank7。通常,映射功能单元在单个时钟周期内需要一路输入,将一路输入端映射到GLB的八个独立RAM上。GLB在单个时钟周期内响应一个读请求,映射功能单元从bank0~bank7中选择一个bankA(A为0至7的任意整数),来读取存储在bankA中的张量。广播功能单元在单个时钟周期内需要至少一路输入,将两路输入端映射到八个独立RAM上。GLB在单个时钟周期内响应两个读请求,广播功能单元从bank0~bank7中选择两个bankB和bankC(B、C为0至7的任意整数且B不等于C),来读取存储在bankB中的一个原始图像对应的张量,以及读取存储在bankC中的另一原始图像对应的张量。另外,数据写出模块在单个时钟周期内会向DDR和/或GLB的一个独立RAM发送写请求,写请求通过后,数据写出模块将最终图像数据写入DDR和/或GLB中。
在DDR和/或GLB确定读请求通过的情况下,数据读取模块可以根据配置指令来读取原始图像的张量中包含的像素点,得到原始图像数据。由于配置模块向数据读取模块发送的配置指令包括预设数据长度,所以在读取的原始图像数据的长度等于预设数据长度的情况下,数据读取模块停止读取像素点。在具体读取像素点的过程中:映射功能单元支持从GLB或DDR中读取一路原 始图像对应的张量,将单路原始图像对应的张量看作一维向量,像素点是NCHW中排列的点。映射功能单元按照先行后列的顺序从GLB或DDR中读取像素点,直到整个四维图像读取完成,并将读取的原始图像数据依次发送至多算子运算模块。例如,原始图像对应的张量中NCHW为1*2*30*40,将张量看作一维向量。映射功能单元不一定要将张量中的NCHW所有的像素点都读完,而是根据预设数据长度进行读取。张量中NCHW为1*2*30*40,表示一行包含了40个像素点,如果预设数据长度为120,那么只需要读取三行像素点即可。
本实施方式中,支持从DDR和/或GLB中读写数据,如果是从GLB中读写数据,可以降低DDR的读写带宽。另外,由于配置指令中包括预设数据长度,使得读取像素点更具有灵活性。
在一种实施方式中,如图5所示,配置指令包括预设向量长度,步骤S122,包括:
S1221:根据预设向量长度将张量切分成多个向量,向量包括多个像素点;
S1222:向量中,按照排列顺序读取像素点,且读取至每个像素点时,每个像素点被重复读取M1次;
S1223:每个向量被重复读取M2次,得到原始图像数据,其中,M1、M2均大于或等于1。
在一种示例中,广播功能单元支持GLB和/或DDR中读取多帧原始图像对应的张量。配置模块将配置指令发送至数据读取模块,配置指令中包括预设向量长度。数据读取模块根据预设向量长度将每个张量划分为多个向量(向量的数量大于或等于1),每个向量的长度等于预设向量长度。每个向量中,重复读取M1次(M1>=1)第一个像素点,切换到第二个像素点,重复读取M1次第二个像素点,以此类推,直至读取向量中的全部像素点。并且支持重复读取M2(M2>=1)次单个向量,即第一个向量被重复读取M2次,切换到第二个向量,第二个向量被重复读取M2次,以此类推,直至读取全部的向量。将重复读取后得到的原始图像数据发送至多算子运算模块。当然,配置指令除了包括预设向量长度,还可以包括像素点重复读取次数M1、向量重复读取次数M2等信息。不同的张量、向量长度、像素点重复读取次数M1以及向量重复读取次数M2可以根据实际情况配置,均在本实施 方式的保护范围内。
例如,第一原始图像对应的张量中NCHW为1*2*30*40,第二原始图像对应的张量中NCHW为1*3*20*40。假设第一原始图像对应的张量被分成四个向量,每个向量中包括80个像素点,像素点重复读取次数M1=3,向量重复读取次数M2=2。那么第一个像素点X0被重复读取三次,得到(X0、X0、X0),第二个像素点X1被重复读取三次,得到(X1、X1、X1)……直至读完向量中的每个像素点。重复上述过程两次,即将此向量重复读取两次,直至NCHW为1*2*30*40的图像被读取八行之后,结束读取。第二原始图像对应的张量被分成八个向量,每个向量中包括30个像素点,像素点的重复次数M1=8,向量重复次数M2=1或M1=4,M2=2,那么第一个像素点Y0被重复读取八次,得到(Y0、Y0、Y0、Y0、Y0、Y0、Y0、Y0),第二个像素点Y1被重复读取八次,得到(Y1、Y1、Y1、Y1、Y1、Y1、Y1、Y1)……直至读完向量中的每个像素点。上述过程一次,直至NCHW为1*3*20*40的图像被读取六行之后,结束读取。如果读取第一原始图像对应的张量中的像素点和第二原始图像对应的张量中的像素点的个数分别达到预设数据长度时,停止读取。预设数据长度=N1*张量1循环后的长度=N2*张量2循环后的长度(N1,N2>=1,整数)。数据读取模块每次从GLB中读取多个像素点,可以将每次读取的多个像素点对应的原始图像数据发送给多算子运算模块,假设M1=1,这时从GLB读取4个像素点都会送入多算子运算模块。按照执行顺序,控制多个运算器件以串行执行的方式对多个像素点对应的原始图像数据进行处理,输出最终图像数据。
本实施方式中,由于向量中的每个像素点被重复读取M1次,且每个向量被重复读取M2次,自动实现了上采样运算,即最邻近插值方式。
在一种实施方式中,步骤130包括:
在一个时钟周期内,控制多个运算器件以串行执行的方式对多个像素点对应的原始图像数据进行并行处理,输出最终图像数据。
一种示例中,在数据读取模块从DDR和/或GLB中每次读取多个像素点。在一个时钟周期内,数据读取模块将读取的多个像素点对应的原始图像数据发送至多算子运算模块,使得多算子运算模块可以同时对多个像素点对 应的原始图像数据进行并行计算。例如,在一个时钟周期内,数据读取模块可以发送四个像素点对应的原始图像数据至多算子运算模块。映射功能中,第0个时钟周期,数据读取模块可以发送第一行四个点X00~X03给多算子运算模块;第1个时钟周期,数据读取模块可以将第一行接下来的四个点X04~X07发送至多算子运算模块,以此类推。广播功能中,假设像素点重复次数M1为3,第0个时钟周期,数据读取模块可发送(X00、X00、X00、X01)给多算子运算模块;第1个时钟周期,数据读取模块可以将(X01、X01、X02、X02)发送给多算子运算模块,以此类推。那么,在多算子运算模块中,每种运算器件可以设置四个,四个相同的运算器件同时工作,并行处理四个像素点对应的原始图像数据。当然,每种运算器件还可以设置更多数量,例如,设置八个加法器、八个减法器等,或者每种运算器件还可以设置较少数量,例如,设置两个或三个加法器、两个或三个减法器等。根据实际需求进行适应性调整,均在本实施方式的保护范围内。
本实施方式中,一个时钟周期内,数据读取模块将多个像素点对应的原始图像数据发送至多算子运算模块中,实现并行运算,有效提高运算效率。
在一种实施方式中,如图3所示,所述配置指令包括每个运算器件的输出端与剩余运算器件的输入端之间的映射关系表,步骤130包括:
步骤131:执行顺序确定子模块,用于根据所述映射关系表确定所述执行顺序;
步骤132:根据执行顺序,控制原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进行运算,输出最终图像数据,N为大于或等于1的正整数。
在一种实施方式中,所述配置指令包括确定常量,所述常量是根据所述运算公式分解得到的。
一种示例中,假设运算器件集合中包括27个运算器件,运算器件0到运算器件26。每个运算器件可能包括两个输入端或者三个输入端,一个输出端或者两个输出端。例如,运算器件可以包括两个张量输入端,或者两个常量输入端,又或者是两个张量输入端和一个常量输入端等,根据需求进行设置。张量输入端用于输入原始图像数据或中间图像数据。常量输入 端用于输入常量。运算公式中可能会包含常量,例如,运算公式
Figure PCTCN2020105217-appb-000002
其中,3.73和5.89都是小数值,配置第一个常量为3.73,第二个常量为5.89。因此,常量的数量与运算公式相关。当然,如果运算公式中没有常量,则可以不用配置常量。
如图6所示,提供一种神经网络模型的多算子运算方法。配置模块用于将运算公式分解得到多个算子,并确定多个算子对应的多个运算器件,根据多个算子的数学运算关系,针对上一个运算器件的输出端与相邻的下一个运算器件的输入端生成映射关系表。映射关系表中,每个运算器件的输入端都会有相应的输出端编号。优选的编号方式是将多个张量输入端(例如两个)、多个常量输入端(例如四个)和所有运算器件输出端统一默认编号,方便将每个运算器件输入端与所有器件输出端、张量输入端、常量输入端映射成表格形式。由于配置指令中包括映射关系表,根据配置指令确定每个运算器件输入端对应的输出端编号,根据编号确定每个运算器件的输出端与剩余运算器件的输入端之间的映射关系。多算子运算模块接收配置模块发送的配置指令,根据映射关系表中每个运算器件的输出端与剩余运算器件的输入端之间的映射关系,确定执行顺序。由于每执行一次运算后,都会***一级寄存器(reg),因此为了保证输入第一张量和第二张量的时序一致,可以调用额外的运算器件对第二张量执行计算。
一种示例中,GLB中不同的存储区域存储不同的原始图像对应的张量(第一张量和第二张量),数据读取模块从GLB中读取两路原始图像对应的张量中包含的像素点,得到两路原始图像数据。假设将两路张量对应的原始图像数据发送至多算子运算模块。根据配置指令确定了四个运算器件,包括第一运算器件至第四运算器件。第一运算器件为加法器0,第二运算器件为加法器1,第三运算器件为平方运算器,第四运算器件为比较器。如表1所示的映射关系表,两个张量和四个常量与各个运算器件输出端统一编号。两个张量和四个常量统一编号为1~6,即第一张量的编号为1,第二张量的编号为2,第一常量的编号为3、第二常量的编号为4、第三常量的编号为5、第四常量的编号为6。各个运算器件的输出端编号参见映射关系表。 根据映射关系表得到执行顺序:加法器0的张量输入端用于输入第一张量对应的原始图像数据,常量输入端用于输入第二常量,加法器0的输出端与加法器1的张量输入端连接;加法器1的张量输入端用于输入第一张量对应的第一中间数据,加法器1的常量输入端对应输入第一常量,加法器1的输出端与平方运算器的张量输入端连接;平方运算器的张量输入端用于输入第一张量对应的第二中间数据,平方运算器的输出端与比较器的张量输入端连接;比较器的张量输入端用于输入第一张量对应的第三中间数据,比较运算器的常量输入端用于输入第三常量,比较器的输出端用于输出最终图像数据。需要指出的是,每个运算器件的任意一个输入端,可以是常量输入、张量输入、或者其他器件的张量输出。运算器件集合中的各个运算器件的输出端与输入端可以由逻辑控制单元控制打开或者关闭,表示可否用来传输数据。具体的,逻辑控制单元根据映射关系表控制各个运算器件的工作状态,如果某个运算器件输入端对应的输出端编号为0,则表示该运算器件将不工作,可以处于关闭状态。
表1 映射关系表
Figure PCTCN2020105217-appb-000003
Figure PCTCN2020105217-appb-000004
在一种实施方式中,如图3所示,还包括:
步骤S140:对最终图像数据进行降采样。
一种示例中,如图7所示,降采样(Reduce)模块对最终图像数据进行降采样,输入数据可以来自GLB或者DDR或者多算子运算模块,可以对N或者C或者H或者W任意一个维度进行降采样操作。例如,N或者C或者H或者W任意一个维度的求最大值、最小值、求和、求减、求乘积的操作。降采样的输入可以是GLB或者DDR读取回来的数据,也可以是多算子运算模块输出的最终图像数据。如果多算子运算模块不工作的时候,可以将张量对应的原始图像数据直接进行降采样,即对张量中任意一个维度的像素点进行降采样。
在一种实施方式中,原始图像数据的格式、第一中间数据至第N-1中间数据的格式,以及最终图像数据均为16位浮点数的数据格式。
一种示例中,多算子运算模块和降采样模块涉及的运算均为BF16格式的浮点运算,能够有效提高运算精度。可以将BF16(16位浮点数,bfloat)格式的浮点运算替换为INT8格式的定点运算,可以节省多算子运算模块或者降采样模块的硬件面积。
如图8所示,在另一种具体实施方式中,提供了一种神经网络模型的多算子运算装置,包括:
配置指令获取模块110,用于获取配置指令,根据所述配置指令确定多个算子对应的多个运算器件,以及所述多个运算器件的执行顺序,所述多个算子是根据运算公式分解得到的,多个运算器件是从运算器件集合中选择的;
数据读取模块120,用于读取原始图像对应的张量中包含的像素点,得到原始图像数据;
多算子运算模块130,用于按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据。本实施方式中,由于本实施方式中并未直接利用CPU、DSP或者GPU等通用硬件加速器来执行神经网络模型的各种运算,而是利用本申请提供的神经网络模型的多算子运算装置,避免了与CPU、DSP或者GPU等通用硬件加速器的通信,提高了运算时效性。
在一种实施方式中,如图9所示,所述配置指令包括预设数据长度,所述数据读取模块120包括:
读请求发送子模块121,用于向外部存储器和/或内部本地缓存器发送读请求;
数据读取子模块122,用于在所述读请求通过的情况下,读取所述原始图像对应的张量中包含的像素点,得到所述原始图像数据;
数据读取停止子模块123,用于在所述原始图像数据的长度等于所述预设数据长度的情况下,停止读取所述像素点。
在一种实施方式中,如图10所示,所述配置指令包括预设向量长度,所述数据读取子模块122,包括:
向量划分单元1221,用于根据所述预设向量长度将所述张量切分成多个向量,所述向量包括多个像素点;
第一读取单元1222,用于所述向量中,按照排列顺序读取所述像素点,且读取至每个所述像素点时,每个所述像素点被重复读取M1次;
第二读取单元1223,用于每个所述向量被重复读取M2次,得到所述原始图像数据,其中,M1、M2均大于或等于1。
在一种实施方式中,所述多算子运算模块用于在一个时钟周期内,控制所述多个运算器件以串行执行的方式对多个像素点对应的原始图像数据进行并行处理,输出所述最终图像数据。
在一种实施方式中,如图9所示,所述配置指令包括每个运算器件的输出端与剩余运算器件的输入端之间的映射关系表,所述多算子运算模块130包括:
执行顺序确定子模块131,用于根据所述映射关系表确定所述执行顺序;
多算子运算子模块132,用于根据所述执行顺序,控制所述原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将所述第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进行运算,输出最终图像数据,N为大于或等于1的正整数。
在一种实施方式中,如图9所示,还包括:
降采样模块140,用于对所述最终图像数据进行降采样。
在一种实施方式中,所述原始图像数据、所述第一中间数据至所述第N-1中间数据,以及所述最终图像数据均为16位浮点数的数据格式。
本申请实施例各装置中的各模块的功能可以参见上述方法中的对应描述,在此不再赘述。
根据本申请的实施例,本申请还提供了一种电子设备和一种可读存储介质。
如图10所示,是根据本申请实施例的一种神经网络模型的多算子运算方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。
如图10所示,该电子设备包括:一个或多个处理器1001、存储器1002,以及用于连接各部件的接口,包括高速接口和低速接口。各个部件利用不同的总线互相连接,并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理,包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如,耦合至接口的显示设备)上显示图形用户界面(Graphical User Interface,GUI)的图形信息的指令。在其它实施方式中,若需要,可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样,可以连接多个电子设备,各个设备提供部分必要的操作(例如,作为服务器阵列、一组刀片式服务器、或者多处理器***)。 图10中以一个处理器1001为例。
存储器1002即为本申请所提供的非瞬时计算机可读存储介质。其中,存储器存储有可由至少一个处理器执行的指令,以使至少一个处理器执行本申请所提供的一种神经网络模型的多算子运算方法。本申请的非瞬时计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行本申请所提供的一种神经网络模型的多算子运算方法。
存储器1002作为一种非瞬时计算机可读存储介质,可用于存储非瞬时软件程序、非瞬时计算机可执行程序以及模块,如本申请实施例中的一种神经网络模型的多算子运算方法对应的程序指令/模块(例如,附图8所示的配置指令获取模块110、数据读取模块120以及多算子运算模块130)。处理器1001通过运行存储在存储器1002中的非瞬时软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例中的一种神经网络模型的多算子运算方法。
存储器1002可以包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需要的应用程序;存储数据区可存储根据一种神经网络模型的多算子运算方法的电子设备的使用所创建的数据等。此外,存储器1002可以包括高速随机存取存储器,还可以包括非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中,存储器1002可选包括相对于处理器1001远程设置的存储器,这些远程存储器可以通过网络连接至上述电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述电子设备还可以包括:输入装置1003和输出装置1004。处理器1001、存储器1002、输入装置1003和输出装置1004可以通过总线或者其他方式连接,图10中以通过总线连接为例。
输入装置1003可接收输入的数字或字符信息,以及产生与上述电子设备的用户设置以及功能控制有关的键信号输入,例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置1004可以包括显示设备、辅助照明装置(例如,LED)和触觉反馈装置(例如,振动电机)等。该显示设备可以包括但不限于,液晶显示器(Liquid Cr10stal Displa10,LCD)、发光二极管(Light Emitting Diode, LED)显示器和等离子体显示器。在一些实施方式中,显示设备可以是触摸屏。
此处描述的***和技术的各种实施方式可以在数字电子电路***、集成电路***、专用集成电路(Application Specific Integrated Circuits,ASIC)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程***上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储***、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储***、该至少一个输入装置、和该至少一个输出装置。
这些计算程序(也称作程序、软件、软件应用、或者代码)包括可编程处理器的机器指令,并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的,术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如,磁盘、光盘、存储器、可编程逻辑装置(programmable logic device,PLD)),包括,接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。
为了提供与用户的交互,可以在计算机上实施此处描述的***和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(Cathode Ray Tube,阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的***和技术实施在包括后台部件的计算***(例如,作为数据服务器)、或者包括中间件部件的计算***(例如,应用服务器)、或者包括前端部件的计算***(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的***和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者 前端部件的任何组合的计算***中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将***的部件相互连接。通信网络的示例包括:局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)和互联网。
计算机***可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本申请保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等,均应包含在本申请保护范围之内。

Claims (18)

  1. 一种神经网络模型的多算子运算方法,其特征在于,包括:
    获取配置指令,根据所述配置指令确定多个算子对应的多个运算器件,以及所述多个运算器件的执行顺序,所述多个算子是根据运算公式分解得到的,所述多个运算器件是从运算器件集合中选择的;
    读取原始图像对应的张量中包含的像素点,得到原始图像数据;
    按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据。
  2. 根据权利要求1所述的方法,其特征在于,所述配置指令包括预设数据长度,所述读取原始图像对应的张量中包含的像素点,得到原始图像数据,包括:
    向外部存储器和/或内部本地缓存器发送读请求;
    在所述读请求通过的情况下,读取所述原始图像对应的张量中包含的像素点,得到所述原始图像数据;
    在所述原始图像数据的长度等于所述预设数据长度的情况下,停止读取所述像素点。
  3. 根据权利要求2所述的方法,其特征在于,所述配置指令包括预设向量长度,所述读取所述原始图像的张量中包含的像素点,得到所述原始图像数据,包括:
    根据所述预设向量长度将所述张量切分成多个向量,所述向量包括多个像素点;
    所述向量中,按照排列顺序读取所述像素点,且读取至每个所述像素点时,每个所述像素点被重复读取M1次;
    每个所述向量被重复读取M2次,得到所述原始图像数据,其中,M1、M2均大于或等于1。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据,包括:
    在一个时钟周期内,控制所述多个运算器件以串行执行的方式对多个像 素点对应的原始图像数据进行并行处理,输出所述最终图像数据。
  5. 根据权利要求1所述的方法,其特征在于,所述配置指令包括每个运算器件的输出端与剩余运算器件的输入端之间的映射关系表,所述按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据,包括:
    根据所述映射关系表确定所述执行顺序;
    根据所述执行顺序,控制所述原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将所述第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进行运算,输出最终图像数据,N为大于或等于1的正整数。
  6. 根据权利要求1所述的方法,其特征在于,所述配置指令包括常量,所述常量是根据所述运算公式分解得到的。
  7. 根据权利要求1所述的方法,其特征在于,还包括:
    对所述最终图像数据进行降采样。
  8. 根据权利要求5所述的方法,其特征在于,所述原始图像数据、所述第一中间数据至所述第N-1中间数据,以及所述最终图像数据均为16位浮点数的数据格式。
  9. 一种神经网络模型的多算子运算装置,其特征在于,包括:
    配置指令获取模块,用于获取配置指令,根据所述配置指令确定多个算子对应的多个运算器件,以及所述多个运算器件的执行顺序,所述多个算子是根据运算公式分解得到的,所述多个运算器件是从运算器件集合中选择的;
    数据读取模块,用于读取原始图像对应的张量中包含的像素点,得到原始图像数据;
    多算子运算模块,用于按照所述执行顺序,控制所述多个运算器件以串行执行的方式对所述原始图像数据进行处理,输出最终图像数据。
  10. 根据权利要求9所述的装置,其特征在于,所述配置指令包括预设数据长度,所述数据读取模块包括:
    读请求发送子模块,用于向外部存储器和/或内部本地缓存器发送读请求;
    数据读取子模块,用于在所述读请求通过的情况下,读取所述原始图像对应的张量中包含的像素点,得到所述原始图像数据;
    数据读取停止子模块,用于在所述原始图像数据的长度等于所述预设数据长度的情况下,停止读取所述像素点。
  11. 根据权利要求10所述的装置,其特征在于,所述配置指令包括预设向量长度,所述数据读取子模块,包括:
    向量划分单元,用于根据所述预设向量长度将所述张量切分成多个向量,所述向量包括多个像素点;
    第一读取单元,用于所述向量中,按照排列顺序读取所述像素点,且读取至每个所述像素点时,每个所述像素点被重复读取M1次;
    第二读取单元,用于每个所述向量被重复读取M2次,得到所述原始图像数据,其中,M1、M2均大于或等于1。
  12. 根据权利要求9-11任一项所述的装置,其特征在于,所述多算子运算模块用于在一个时钟周期内,控制所述多个运算器件以串行执行的方式对多个像素点对应的原始图像数据进行并行处理,输出所述最终图像数据。
  13. 根据权利要求9所述的装置,其特征在于,所述配置指令包括每个运算器件的输出端与剩余运算器件的输入端之间的映射关系表,所述多算子运算模块包括:
    执行顺序确定子模块,用于根据所述映射关系表确定所述执行顺序;
    多算子运算子模块,用于根据所述执行顺序,控制所述原始图像数据输入至第一个运算器件进行运算,得到第一中间数据,将所述第一中间数据输入至第二个运算器件进行运算,得到第二中间数据,直至将第N-1中间数据输入至第N个运算器件进行运算,输出最终图像数据,N为大于或等于1的正整数。
  14. 根据权利要求9所述的装置,其特征在于,所述配置指令包括常量,所述常量是根据所述运算公式分解得到的。
  15. 根据权利要求9所述的装置,其特征在于,还包括:
    降采样模块,用于对所述最终图像数据进行降采样。
  16. 根据权利要求13所述的装置,其特征在于,所述原始图像数据、所述第一中间数据至所述第N-1中间数据,以及所述最终图像数据均为16位浮点数的数据格式。
  17. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及与所述至少一个处理器通信连接的存储器;
    其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-8中任一项所述的方法。
  18. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行权利要求1-8中任一项所述的方法。
PCT/CN2020/105217 2020-07-28 2020-07-28 神经网络模型的多算子运算方法以及装置 WO2022021073A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/105217 WO2022021073A1 (zh) 2020-07-28 2020-07-28 神经网络模型的多算子运算方法以及装置
CN202080102306.1A CN116134446A (zh) 2020-07-28 2020-07-28 神经网络模型的多算子运算方法以及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/105217 WO2022021073A1 (zh) 2020-07-28 2020-07-28 神经网络模型的多算子运算方法以及装置

Publications (1)

Publication Number Publication Date
WO2022021073A1 true WO2022021073A1 (zh) 2022-02-03

Family

ID=80037244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105217 WO2022021073A1 (zh) 2020-07-28 2020-07-28 神经网络模型的多算子运算方法以及装置

Country Status (2)

Country Link
CN (1) CN116134446A (zh)
WO (1) WO2022021073A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354400A1 (en) * 2017-05-10 2019-11-21 Atlantic Technical Organization, Llc System and method of schedule validation and optimization of machine learning flows for cloud computing
CN110503195A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 利用人工智能处理器执行任务的方法及其相关产品
CN110503199A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 运算节点的拆分方法和装置、电子设备和存储介质
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111126558A (zh) * 2018-10-31 2020-05-08 北京嘉楠捷思信息技术有限公司 一种卷积神经网络计算加速方法及装置、设备、介质
CN111242321A (zh) * 2019-04-18 2020-06-05 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354400A1 (en) * 2017-05-10 2019-11-21 Atlantic Technical Organization, Llc System and method of schedule validation and optimization of machine learning flows for cloud computing
CN111126558A (zh) * 2018-10-31 2020-05-08 北京嘉楠捷思信息技术有限公司 一种卷积神经网络计算加速方法及装置、设备、介质
CN111242321A (zh) * 2019-04-18 2020-06-05 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品
CN110503195A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 利用人工智能处理器执行任务的方法及其相关产品
CN110503199A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 运算节点的拆分方法和装置、电子设备和存储介质
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品

Also Published As

Publication number Publication date
CN116134446A (zh) 2023-05-16

Similar Documents

Publication Publication Date Title
EP3144805B1 (en) Method and processing apparatus for performing arithmetic operation
WO2022000802A1 (zh) 深度学习模型的适配方法、装置及电子设备
US9818170B2 (en) Processing unaligned block transfer operations
Moussalli et al. Fast and flexible conversion of geohash codes to and from latitude/longitude coordinates
KR20210131225A (ko) 영상 프레임 처리 방법, 장치, 전자 기기, 저장 매체 및 프로그램
CN111340905B (zh) 图像风格化方法、装置、设备和介质
US20230297287A1 (en) Dynamic processing memory core on a single memory chip
KR20210090558A (ko) 행렬식 텍스트를 저장하는 방법, 장치 및 전자기기
CN114792355B (zh) 虚拟形象生成方法、装置、电子设备和存储介质
WO2022021073A1 (zh) 神经网络模型的多算子运算方法以及装置
US20190324909A1 (en) Information processing apparatus and information processing method
US11734007B2 (en) Address generation method, related apparatus, and storage medium
US11393068B2 (en) Methods and apparatus for efficient interpolation
US11704896B2 (en) Method, apparatus, device and storage medium for image processing
Hsiao et al. Design of a programmable vertex processor in OpenGL ES 2.0 mobile graphics processing units
US20220292638A1 (en) Video resolution enhancement method, storage medium, and electronic device
CN115578258A (zh) 图像处理方法、装置、设备及存储介质
CN109408028B (zh) 浮点数运算方法、装置及存储介质
CN112036561A (zh) 数据处理方法、装置、电子设备及存储介质
CN111931937A (zh) 图像处理模型的梯度更新方法、装置及***
WO2022206138A1 (zh) 一种基于神经网络的运算方法以及装置
CN114707478B (zh) 映射表生成方法、装置、设备及存储介质
JP7403586B2 (ja) 演算子の生成方法および装置、電子機器、記憶媒体並びにコンピュータプログラム
US11930307B2 (en) Image processing method and apparatus, electronic device and computer-readable storage medium
US20210209727A1 (en) Method for displaying electronic map, electronic device and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20946659

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20946659

Country of ref document: EP

Kind code of ref document: A1