WO2017185256A1 - Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop - Google Patents

Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop Download PDF

Info

Publication number
WO2017185256A1
WO2017185256A1 PCT/CN2016/080354 CN2016080354W WO2017185256A1 WO 2017185256 A1 WO2017185256 A1 WO 2017185256A1 CN 2016080354 W CN2016080354 W CN 2016080354W WO 2017185256 A1 WO2017185256 A1 WO 2017185256A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
module
unit
updated
Prior art date
Application number
PCT/CN2016/080354
Other languages
English (en)
Chinese (zh)
Inventor
刘少礼
郭崎
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/080354 priority Critical patent/WO2017185256A1/fr
Publication of WO2017185256A1 publication Critical patent/WO2017185256A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures

Definitions

  • the present invention relates to the field of RMSprop algorithm application technology, and in particular to an apparatus and method for performing an RMSprop gradient descent algorithm, and relates to a hardware implementation of an RMSprop gradient descent optimization algorithm.
  • Gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing.
  • RMSprop algorithm is one of the gradient descent optimization algorithms. Because of its easy implementation, the calculation amount is small, the required storage space is small and Features such as mini-batch data sets are well used for processing, and the use of dedicated devices to implement the RMSprop algorithm can significantly increase the speed of execution.
  • a known method of performing the RMSprop gradient descent algorithm is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this method is that the performance of a single general-purpose processor is low, and when multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the correlation operation corresponding to the RMSprop algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of performing the RMSprop gradient descent algorithm is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit.
  • SIMD general single instruction multiple data stream
  • the GPU Since the GPU is a device dedicated to performing graphics and computational operations and scientific calculations, without the special support for the RMSprop gradient descent algorithm, it still requires a large amount of front-end decoding to perform the relevant operations in the RMSprop gradient descent algorithm. A lot of extra overhead.
  • the GPU has only a small on-chip buffer.
  • the intermediate variable data required for the RMSprop gradient descent algorithm, such as the average direction amount, needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck and brings great work. Cost.
  • the main object of the present invention is to provide an apparatus and method for performing an RMSprop gradient descent algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeatedly reading into the memory. Take data and reduce the bandwidth of memory access.
  • the present invention provides an apparatus for performing an RMSprop gradient descent algorithm, the apparatus comprising a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, among them:
  • the direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of the data;
  • the instruction cache unit 2 is configured to read the instruction by the direct memory access unit 1 and cache the read instruction;
  • the controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5.
  • a data buffer unit 4 configured to cache a mean square matrix during initialization and data update
  • the data processing module 5 is configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit 4, and write the updated parameter to be updated to the outside through the direct memory access unit 1. In the specified space.
  • the direct memory access unit 1 writes an instruction from the external designated space to the instruction cache unit 2, reads the parameter to be updated and the corresponding gradient value from the external designated space to the data processing module 5, and updates the updated
  • the parameter vector is directly written from the data processing module 5 to the external designated space.
  • the controller unit 3 decodes the read instruction into a micro-instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to be externally specified.
  • the address reads the data and writes the data to the external designated address
  • the control data buffer unit 4 acquires an instruction required for the operation from the external designated address through the direct memory access unit 1, controls the data processing module 5 to perform the update operation of the parameter to be updated, and controls The data buffer unit 4 performs data transmission with the data processing module 5.
  • the data buffer unit 4 initializes the mean square matrix RMS t at initialization, and reads the mean square matrix RMS t-1 into the data processing module 5 in each data update process, in the data processing module 5 It is updated to the mean square matrix RMS t and then written to the data buffer unit 4. During the operation of the device, a copy of the mean square matrix RMS t is always stored inside the data buffer unit 4.
  • the data processing module 5 reads the average direction quantity RMS t-1 from the data buffer unit 4, and reads the parameter vector ⁇ t-1 and the gradient vector to be updated from the external designated space through the direct memory access unit 1 .
  • ⁇ global update step size and direction are updated amount of ⁇
  • the direction of the average amount of RMS t-1 update RMS t to be updated by updating the parameter RMS t ⁇ t-1 is ⁇ t
  • the data is written back to cache RMS t In unit 4
  • ⁇ t is written back to the external designated space by the direct memory control unit 1.
  • the data processing module 5 updates the mean direction amount RMS t-1 to RMS t according to the formula.
  • the data processing module 5 to be updated vector ⁇ t-1 is updated according to the formula ⁇ t Realized.
  • the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected.
  • 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56.
  • the vector operations are all element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
  • the present invention also provides a method for performing an RMSprop gradient descent algorithm, the method comprising:
  • the initial direction quantity RMS 0 is initialized, and the parameter vector ⁇ t to be updated and the corresponding gradient vector are obtained from the specified storage unit.
  • step S1 an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the RMSprop gradient descent calculation from the external address space.
  • INSTRUCTION_IO instruction prefetch instruction
  • Step S2 the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, drives the direct memory access unit 1 to read from the external address space and is related to the RMSprop gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
  • step S3 the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size ⁇ from the external space according to the translated microinstruction.
  • step S4 the controller unit 3 reads the assignment instruction from the instruction cache unit 2, initializes the average direction amount RMS t-1 in the drive data buffer unit 4 according to the translated microinstruction, and drives the number of iterations in the data processing unit 5.
  • t is set to 1;
  • step S5 the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated ⁇ t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector Then sent to the data processing module 5;
  • DATA_IO parameter read instruction
  • step S6 the controller unit 3 reads a data transfer instruction from the instruction buffer unit 2, and transmits the average direction amount RMS t-1 in the data buffer unit 4 to the data processing unit 5 according to the translated microinstruction.
  • the average direction RMS t-1 and the gradient vector are utilized.
  • the average direction quantity update rate ⁇ update the average direction quantity RMS t which is according to the formula
  • the implementation specifically includes: the controller unit 3 reads an average direction quantity update instruction from the instruction cache unit 2, and drives the data buffer unit 4 to perform an update operation of the average direction quantity RMS t-1 according to the translated micro instruction;
  • the average direction quantity update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56, and driving the basic operation sub-
  • the module 56 operates (1- ⁇ ), sends the operation instruction 2 (INS_2) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates (1- ⁇ ) RMS t-1 , respectively.
  • the average direction RMS t-1 and the gradient vector are utilized.
  • the controller unit 3 After updating the average direction quantity RMS t and the average direction quantity update rate ⁇ , the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and according to the translated micro-instruction, the updated average direction quantity RMS t It is transferred from the data processing unit 5 to the data buffer unit 4.
  • the gradient vector is divided by the square root of the mean direction quantity and multiplied by the global update step length ⁇ to obtain a corresponding gradient decrease amount, and the updated update vector ⁇ t-1 is ⁇ t , which is according to the formula.
  • the implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and performs an update operation of the parameter vector according to the translated micro instruction; in the update operation, the parameter vector update instruction is sent to
  • the operation control sub-module 51 controls the correlation operation module to perform an operation of: transmitting the operation instruction 4 (INS_4) to the basic operation unit sub-module 56, and driving the basic operation unit sub-module 56 to calculate - ⁇ , the number of iterations t plus 1; send operation instruction 5 (INS_5) to vector square root parallel operation sub-module 55, drive vector square root parallel operation sub-module 55 is calculated
  • the operation operation instruction 6 (INS_6) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates
  • the operation instruction 7 (INS_7) is sent to the vector division parallel operation sub-module 54, and the drive vector division parallel operation sub-module 54 is calculated.
  • the operation instruction 8 (INS_8) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates Obtaining ⁇ t ; where ⁇ t-1 is the value before ⁇ 0 is not updated at the t-th cycle, and the t-th cycle updates ⁇ t-1 to ⁇ t ; the operation control sub-module 51 sends the operation instruction 9 (INS_9) ) to the vector division parallel operation sub-module 54, the drive vector division parallel operation sub-module 54 operates to obtain a vector
  • the controller unit 3 further reads a DATABACK_IO instruction from the instruction cache unit 2, and updates the parameter vector according to the translated micro instruction.
  • ⁇ t is transferred from the data processing unit 5 to the external designated space through the direct memory access unit 1.
  • the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges.
  • the specific determination process is as follows: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, according to the translation. The micro-instruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 ⁇ ct, it converges, and the operation ends.
  • the apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large by adopting a device specially used for performing the RMSprop gradient descent algorithm. Accelerate the execution speed of related applications.
  • Apparatus and method for performing an RMSprop gradient descent algorithm provided by the present invention,
  • the use of the moment vector required for the intermediate process of the data buffer unit temporarily avoids repeatedly reading data into the memory, reduces the IO operation between the device and the external address space, reduces the bandwidth of the memory access, and solves the off-chip bandwidth. This bottleneck.
  • the apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention the degree of parallelism is greatly improved because the data processing module performs vector operations using the related parallel operation sub-modules.
  • FIG. 1 shows an example block diagram of the overall structure of an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
  • FIG. 2 illustrates an example block diagram of a data processing module in an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
  • FIG. 3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
  • An apparatus and method for performing an RMSprop gradient descent algorithm for accelerating the application of an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention First, an average direction quantity RMS 0 is initialized, and the parameter vector ⁇ t to be updated and the corresponding gradient vector are obtained from the specified storage unit. Then, for each iteration, first use the previous mean direction RMS t-1 , gradient vector And the average direction quantity update rate ⁇ updates the average direction quantity RMS t , that is, After that, the gradient vector is divided by the square root of the mean direction amount and multiplied by the global update step size ⁇ to obtain the corresponding gradient descent amount, and the vector to be updated is updated, that is, The entire process is repeated until the vector to be updated converges.
  • the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.
  • the direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of data. Specifically, an instruction is written from the external designated space to the instruction cache unit 2, the parameter to be updated and the corresponding gradient value are read from the external designated space to the data processing module 5, and the updated parameter vector is directly written from the data processing module 5. Externally specified space.
  • Instruction cache unit 2 for reading instructions through direct memory access unit 1 and caching The instruction read.
  • the controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5, and each micro
  • the instruction is sent to the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to read data from the external designated address and write the data to the external designated address, and control the data cache unit 3 to access through the direct memory.
  • the unit 1 acquires an instruction required for an operation from an external designated address, controls the data processing module 5 to perform an update operation of the parameter to be updated, and controls the data buffer unit 4 to perform data transmission with the data processing module 5.
  • Data cache unit 4 for the initialization and update the data cache during mean square matrix specifically, data cache unit 4 are initialized in initialization RMS t square matrix, a square matrix RMS t are each in the data update process
  • the -1 is read out into the data processing module 5, updated in the data processing module 5 to the mean square matrix RMS t , and then written to the data buffer unit 4.
  • a copy of the mean square matrix RMS t is always stored inside the data buffer unit 4 throughout the operation of the device.
  • the moment vector required for the intermediate process of the data buffer unit is temporarily used, the data is repeatedly read into the memory, the IO operation between the device and the external address space is reduced, and the bandwidth of the memory access is reduced.
  • the data processing module 5 is configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit 4, and write the updated parameter to be updated to the outside through the direct memory access unit 1.
  • the data processing module 5 reads the average direction quantity RMS t-1 from the data buffer unit 4, and reads the parameter vector ⁇ t-1 to be updated from the external designated space through the direct memory access unit 1 , Gradient vector The global update step size ⁇ and the average direction quantity update rate ⁇ .
  • the parameter ⁇ t-1 to be updated is updated by RMS t to be ⁇ t , that is, The RMS t is written back to the data buffer unit 4, and ⁇ t is written back to the external designated space by the direct memory control unit 1.
  • the data processing module since the data processing module performs vector operations using the associated parallel operation sub-modules, the degree of parallelism is greatly improved, so the frequency of operation is low, and the power consumption overhead is small.
  • the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected.
  • the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
  • FIG. 3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm, including the following steps, in accordance with an embodiment of the present invention:
  • step S1 an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the RMSprop gradient descent calculation from the external address space.
  • INSTRUCTION_IO instruction prefetch instruction
  • Step S2 the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, drives the direct memory access unit 1 to read from the external address space and is related to the RMSprop gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
  • step S3 the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size ⁇ from the external space according to the translated microinstruction.
  • Step S4 the controller unit 3 from unit 2 is read into the instruction cache assignment instructions, translated in accordance with the microinstruction, the drive data buffer unit 4 in the direction average RMS t-1 the amount of initialization, data processing and number of iterations of the drive unit 5 t is set to 1;
  • step S5 the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated ⁇ t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector Then sent to the data processing module 5;
  • DATA_IO parameter read instruction
  • step S6 the controller unit 3 reads a data transfer instruction from the instruction buffer unit 2, and transmits the average direction amount RMS t-1 in the data buffer unit 4 to the data processing unit 5 according to the translated microinstruction.
  • step S7 the controller unit 3 reads an average direction quantity update instruction from the instruction buffer unit 2, and drives the data buffer unit 4 to perform an update operation of the average direction quantity RMS t-1 according to the translated microinstruction.
  • the average direction quantity update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56 to drive the basic operation.
  • the sub-module 56 operates (1- ⁇ ), sends the operation instruction 2 (INS_2) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates (1- ⁇ ) RMS t-1 , respectively.
  • step S8 the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and transfers the updated mean direction amount RMS t from the data processing unit 5 to the data buffer unit 4 according to the translated microinstruction.
  • step S9 the controller unit 3 reads a parameter vector operation instruction from the instruction buffer unit 2, and performs an update operation of the parameter vector according to the translated micro instruction.
  • the parameter vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 controls the correlation operation module to perform the following operations: transmitting the operation instruction 4 (INS_4) to the basic operation unit sub-module 56, and driving the basic operation.
  • the unit sub-module 56 calculates - ⁇ , the iteration number t is incremented by one; the arithmetic operation instruction 5 (INS_5) is sent to the vector square root parallel operation sub-module 55, and the drive vector square root parallel operation sub-module 55 is calculated.
  • the operation operation instruction 6 (INS_6) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates After the two operations are completed, the operation instruction 7 (INS_7) is sent to the vector division parallel operation sub-module 54, and the drive vector division parallel operation sub-module 54 is calculated.
  • the operation instruction 8 (INS_8) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates Obtaining ⁇ t ; where ⁇ t-1 is the value before ⁇ 0 is not updated at the t-th cycle, and the t-th cycle updates ⁇ t-1 to ⁇ t ; the operation control sub-module 51 sends the operation instruction 9 (INS_9) ) to the vector division parallel operation sub-module 54, the drive vector division parallel operation sub-module 54 operates to obtain a vector
  • Step S10 the control unit 3 reads an instruction cache to be updated, the amount of write-back command from the unit 2 (DATABACK_IO), translated in accordance with the microinstruction, the updated parameter vector ⁇ t from the data processing unit 5 via direct memory access unit 1 Transfer to the external designated space.
  • DATABACK_IO the amount of write-back command from the unit 2
  • Step S10 the control unit 3 reads an instruction cache to be updated, the amount of write-back command from the unit 2 (DATABACK_IO), translated in accordance with the microinstruction, the updated parameter vector ⁇ t from the data processing unit 5 via direct memory access unit 1 Transfer to the external designated space.
  • step S11 the controller unit 3 reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 ⁇ ct, it converges, and the operation ends. Otherwise, go to step S5 to continue execution.
  • the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated.
  • the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

La présente invention concerne un appareil et un procédé d'exécution d'algorithme de descente de gradient RMSprop. L'appareil comprend : une unité d'accès direct à la mémoire, une unité cache d'instructions, une unité de commande, une unité cache de données et un module de traitement de données. Le procédé comprend les étapes suivantes : premièrement, lire un vecteur de gradients et un vecteur de valeurs à mettre à jour, et initialiser un vecteur de variances ; et pendant chaque itération, premièrement mettre à jour le vecteur de variances en utilisant le vecteur de gradients, puis calculer, en utilisant le vecteur de variances, une quantité de descente de gradient correspondante pendant la mise à jour, mettre à jour le vecteur de paramètres à mettre à jour, et répéter le processus jusqu'à ce que le vecteur à mettre à jour converge. Dans tout le processus, le vecteur de variances est toujours stocké dans l'unité cache de données. Grâce à la présente invention, l'application d'un algorithme de descente de gradient RMSprop peut être mise en œuvre, et l'efficacité du traitement des données peut être améliorée considérablement.
PCT/CN2016/080354 2016-04-27 2016-04-27 Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop WO2017185256A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080354 WO2017185256A1 (fr) 2016-04-27 2016-04-27 Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080354 WO2017185256A1 (fr) 2016-04-27 2016-04-27 Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop

Publications (1)

Publication Number Publication Date
WO2017185256A1 true WO2017185256A1 (fr) 2017-11-02

Family

ID=60161731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080354 WO2017185256A1 (fr) 2016-04-27 2016-04-27 Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop

Country Status (1)

Country Link
WO (1) WO2017185256A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255270A (zh) * 2021-05-14 2021-08-13 西安交通大学 一种雅克比模版计算加速方法、***、介质及存储设备
CN114461579A (zh) * 2021-12-13 2022-05-10 杭州加速科技有限公司 Pattern文件并行读取和动态调度的处理方法、***及ATE设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016037351A1 (fr) * 2014-09-12 2016-03-17 Microsoft Corporation Système informatique pour un apprentissage de réseaux neuronaux
CN105512723A (zh) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016037351A1 (fr) * 2014-09-12 2016-03-17 Microsoft Corporation Système informatique pour un apprentissage de réseaux neuronaux
CN105512723A (zh) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255270A (zh) * 2021-05-14 2021-08-13 西安交通大学 一种雅克比模版计算加速方法、***、介质及存储设备
CN113255270B (zh) * 2021-05-14 2024-04-02 西安交通大学 一种雅克比模版计算加速方法、***、介质及存储设备
CN114461579A (zh) * 2021-12-13 2022-05-10 杭州加速科技有限公司 Pattern文件并行读取和动态调度的处理方法、***及ATE设备

Similar Documents

Publication Publication Date Title
WO2017185257A1 (fr) Dispositif et procédé d'exécution d'un algorithme d'apprentissage de descente de gradient adam
WO2017124644A1 (fr) Dispositif et procédé de codage par compression de réseau neuronal artificiel
US10153011B2 (en) Multiple register memory access instructions, processors, methods, and systems
WO2017124648A1 (fr) Dispositif informatique de calcul de vecteur
CN111353589B (zh) 用于执行人工神经网络正向运算的装置和方法
CN111353588B (zh) 用于执行人工神经网络反向训练的装置和方法
WO2017185389A1 (fr) Dispositif et procédé servant à exécuter des opérations de multiplication de matrices
JP6340097B2 (ja) リードマスク及びライトマスクにより制御されるベクトル移動命令
CN109062608B (zh) 用于独立数据上递归计算的向量化的读和写掩码更新指令
CN111260025B (zh) 用于执行lstm神经网络运算的装置和运算方法
WO2017124647A1 (fr) Appareil de calcul de matrice
WO2018120016A1 (fr) Appareil d'exécution d'opération de réseau neuronal lstm, et procédé opérationnel
WO2017185396A1 (fr) Dispositif et procédé à utiliser lors de l'exécution d'opérations d'addition/de soustraction matricielle
WO2017185411A1 (fr) Appareil et procédé d'exécution d'un algorithme d'apprentissage de descente de gradient adagrad
US20170185888A1 (en) Interconnection Scheme for Reconfigurable Neuromorphic Hardware
KR101817459B1 (ko) 1들을 최하위 비트들이 되도록 풀링하면서 비트들을 좌측으로 시프팅하기 위한 명령어
KR102556033B1 (ko) 패킹된 데이터 정렬 플러스 계산 명령어, 프로세서, 방법, 및 시스템
WO2017185393A1 (fr) Appareil et procédé d'exécution d'une opération de produit interne de vecteurs
WO2017185336A1 (fr) Appareil et procédé pour exécuter une opération de regroupement
WO2017185392A1 (fr) Dispositif et procédé permettant d'effectuer quatre opérations fondamentales de calcul de vecteurs
WO2017185404A1 (fr) Appareil et procédé permettant de mettre en œuvre une opération logique vectorielle
WO2017185256A1 (fr) Appareil et procédé d'exécution d'algorithme de descente de gradient rmsprop
CN107315569B (zh) 一种用于执行RMSprop梯度下降算法的装置及方法
WO2017181336A1 (fr) Appareil et procédé d'opération de couche "maxout"
CN107341540B (zh) 一种用于执行Hessian-Free训练算法的装置和方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899769

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899769

Country of ref document: EP

Kind code of ref document: A1