WO2024065865A1 - 一种面向神经网络计算的内存优化方法和装置 - Google Patents

一种面向神经网络计算的内存优化方法和装置 Download PDF

Info

Publication number
WO2024065865A1
WO2024065865A1 PCT/CN2022/124000 CN2022124000W WO2024065865A1 WO 2024065865 A1 WO2024065865 A1 WO 2024065865A1 CN 2022124000 W CN2022124000 W CN 2022124000W WO 2024065865 A1 WO2024065865 A1 WO 2024065865A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
life cycle
memory
node
cycle interval
Prior art date
Application number
PCT/CN2022/124000
Other languages
English (en)
French (fr)
Inventor
王宏升
陈�光
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to US18/072,969 priority Critical patent/US20240104395A1/en
Publication of WO2024065865A1 publication Critical patent/WO2024065865A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0253Garbage collection, i.e. reclamation of unreferenced memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present invention relates to the technical field of computer systems based on a specific computing model, and in particular to a memory optimization method and device for neural network computing.
  • the purpose of the present invention is to provide a memory optimization method and device for neural network computing, so as to solve the problem of how to optimize and reduce the persistent dependence and occupation of tensor variables on the memory resources of the deep learning operating system, reduce the memory overhead required for tensor variables in the data flow, and reduce the requirements of large models for hardware memory resources.
  • a memory optimization method for neural network computing includes the following steps:
  • Step S1 reconstruct the computation graph into a topological structure computation graph
  • Step S2 construct the life cycle interval of the tensor variable
  • Step S3 constructing a scan line about the life cycle interval
  • Step S4 Allocate tensor variables to free registers
  • Step S5 allocating registers of tensor variables corresponding to the life cycle interval at the farthest end point to tensor variables exceeding the required number of registers;
  • Step S6 allocating the registers allocated to the expired life cycle interval to the tensor variables that exceed the required number of registers;
  • Step S7 Add the tensor variable transferred to the memory back to the activated life cycle interval and allocate a free register to it.
  • step S1 specifically includes the following sub-steps:
  • Step S11 traverse the computation graph in post-order order to obtain a subgraph access list
  • Step S12 access the subgraph list in reverse order to obtain the topological structure order of the computation graph
  • Step S13 reconstructing the calculation graph according to the topological structure sequence to obtain a topological structure calculation graph.
  • the post-order sequence is that when accessing a node of the computation graph, the successor node of the node is preferentially recursively accessed.
  • step S2 is specifically to construct a life cycle interval for the tensor variables contained in each node, and the life cycle interval corresponding to the tensor variables contained in the node starts from the position of the first node where the tensor variable is in a surviving state, and ends at the position of the last node where the tensor variable is in a surviving state.
  • step S3 is specifically to construct a scan line parallel to the life cycle interval at the starting node of the topological structure calculation graph, and the scan line is used to observe whether there are free registers that can be allocated to tensor variables during the data flow execution process while moving from the starting end of the life cycle interval to the ending end of the life cycle interval.
  • step S5 is specifically that when the execution flow is located at a certain node, when the node has neither free registers nor the lifecycle interval that has been scanned and expired and can be removed from the lifecycle interval in the activated state, the tensor variables in the registers allocated to the tensor variables corresponding to the lifecycle interval at the farthest end point are transferred to the memory, and then the released registers are allocated to the tensor variables that exceed the required number of registers.
  • step S6 is specifically that when the execution flow is located at a certain node, when the scan line has passed through the life cycle interval corresponding to the register assigned to the tensor variable, the tensor variable is removed from the activated life cycle interval, the corresponding allocated register is recycled to the free register list, and the free register is allocated to the tensor variable that exceeds the required number of registers.
  • step S7 is specifically that when the execution flow is located at a certain node, if there is a free register, the tensor variable transferred to the memory is added back to the life cycle interval in the activated state, and the free register is allocated to the corresponding life cycle interval.
  • the present invention also provides a memory optimization device for neural network computing, comprising a memory and one or more processors, wherein the memory stores executable code, and when the one or more processors execute the executable code, they are used to implement a memory optimization method for neural network computing as described in any of the above embodiments.
  • the present invention also provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements a memory optimization method for neural network computing as described in any one of the above embodiments.
  • the present invention provides a mapping relationship between tensor variables generated during the execution of a computational graph and physical registers and memory, and provides an optimization method based on the mapping relationship.
  • Registers can store the storage locations of tensor variables generated during the execution of a computational graph in memory.
  • the traditional method of storing tensor variables is to directly store the values of tensor variables in memory.
  • the method of optimizing memory with the help of registers proposed in the present invention optimizes the memory of the data flow of the computational graph used for neural network calculations, reduces the memory overhead required for tensor variables in the data flow, and reduces the requirements of large models for hardware memory resources.
  • the memory optimization method for neural network calculations improves the computational efficiency of the entire computational graph and saves hardware and time costs.
  • FIG1 is a schematic diagram of a flow chart of a memory optimization method for neural network computing according to the present invention.
  • FIG2 is a schematic diagram of a process of reconstructing a computation graph into a topological structure in Example 1;
  • FIG3 is a topological structure calculation diagram of Example 1
  • FIG4 is a diagram of Example 1 for constructing an interval containing a tensor variable life cycle for a topology structure computation graph node;
  • FIG5 shows that the first two tensor variables included in the topology structure calculation graph node are allocated to two registers in Example 1;
  • FIG6 is a diagram of Example 1, in which a tensor variable in a register is transferred to a memory and a new tensor variable is allocated to an already free register;
  • FIG7 is a calculation diagram for neural network calculation in Example 2.
  • FIG8 is a diagram of Example 2 for constructing a life cycle interval for a tensor variable in a data stream
  • FIG9 is a scan line constructed for a tensor variable life cycle interval according to Embodiment 2;
  • FIG10 shows the second embodiment in which register r 3 is assigned to variable x at node V 1 ;
  • FIG11 shows the second embodiment in which register r1 is assigned to variable y at node V2 ;
  • FIG12 shows the second embodiment in which register r 2 is assigned to variable z at node V 3 ;
  • FIG13 shows Example 2 where the register r 3 corresponding to the tensor variable x of the farthest endpoint interval l x is allocated to the tensor variable b that exceeds the required number of registers;
  • FIG14 shows that, in Example 2, the register r1 allocated to the expired life cycle interval l y is allocated to the tensor variable w that exceeds the required number of registers;
  • FIG15 is a diagram of Embodiment 2 in which the tensor variable corresponding to the expired life cycle interval is removed from the list of life cycle intervals in an activated state and the register is recycled;
  • FIG16 is a diagram of Embodiment 2 in which the tensor variable corresponding to the expired life cycle interval is removed from the list of life cycle intervals in an activated state and the register is recycled;
  • FIG17 shows that the free register r3 is allocated to the corresponding life cycle interval of l r3 in Example 2;
  • Figure 18 is a schematic diagram of a memory optimization device for neural network computing in Example 3.
  • a memory optimization method for neural network computing includes the following steps:
  • Step S1 reconstruct the computation graph into a topological structure computation graph
  • Step S11 traverse the computation graph in post-order order to obtain a subgraph access list
  • the post-order sequence is that when accessing a node of the computation graph, the successor nodes of the node are preferentially recursively accessed.
  • Step S12 access the subgraph list in reverse order to obtain the topological structure order of the computation graph
  • Step S13 reconstructing the calculation graph according to the topological structure sequence to obtain a topological structure calculation graph.
  • Step S2 construct the life cycle interval of the tensor variable
  • a life cycle interval of the tensor variables contained in each node is constructed, and the life cycle interval corresponding to the tensor variables contained in the node starts from the position of the first node where the tensor variable is in a surviving state, and ends at the position of the last node where the tensor variable is in a surviving state.
  • Step S3 constructing a scan line about the life cycle interval
  • a scan line parallel to the life cycle interval is constructed.
  • the scan line is used to observe whether there are free registers that can be allocated to tensor variables during the data flow execution process while moving from the starting end of the life cycle interval to the ending end of the life cycle interval.
  • Step S4 Allocate tensor variables to free registers
  • Step S5 allocating registers of tensor variables corresponding to the life cycle interval at the farthest end point to tensor variables exceeding the required number of registers;
  • the execution flow When the execution flow is located at a certain node, when the node has neither free registers nor the lifecycle interval that has been scanned and expired and can be removed from the activated lifecycle interval, the tensor variables in the registers allocated to the tensor variables corresponding to the lifecycle interval at the farthest end are transferred to the memory, and then the released registers are allocated to the tensor variables that exceed the required number of registers.
  • Step S6 allocating the registers allocated to the expired life cycle interval to the tensor variables that exceed the required number of registers;
  • the execution flow When the execution flow is located at a certain node, when the scan line has passed through the life cycle interval corresponding to the register assigned to the tensor variable, the tensor variable is removed from the activated life cycle interval, the corresponding allocated register is recycled to the free register list, and the free register is allocated to the tensor variable that exceeds the required number of registers.
  • Step S7 Add the tensor variable transferred to the memory back to the activated life cycle interval and allocate free registers for it.
  • tf.random_uniform([[5, 3]]): represents a randomly generated tensor with a shape of 5 rows and 3 columns.
  • goto V i indicates entering the calculation flow of executing the V i node.
  • V i if expression goto V i : It means to judge whether the value of the expression is true. If it is true, the calculation flow of the V i node is executed; otherwise, the calculation flow of other branch nodes is executed.
  • tf.add(x, y): represents the addition operation of tensor x and tensor y.
  • tf.ones(ai.shape) creates a tensor with the same shape as the tensor a i and all elements are 1.
  • tf.relu(x) means inputting tensor x into the rectified linear unit.
  • tf.matmul(x, y) represents the matrix multiplication operation of tensor x and tensor y.
  • return b i indicates returning to the branch containing the b i tensor variable.
  • I x represents the life cycle interval of the tensor variable x.
  • tf.subtract(x, y): represents the subtraction operation between tensor x and tensor y.
  • ri indicates that the free register ri is assigned to the tensor variable of the corresponding life cycle interval.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • step S1 reconstructing the computation graph into a topological structure computation graph
  • Step S11 traverse the computation graph in post-order order to obtain a subgraph access list
  • the post-order sequence is that when accessing a node of the computation graph, the successor nodes of the node are preferentially recursively accessed.
  • the post-order traversal ensures that the node V B is visited before the node V A in the path from the node V A to the node V B in the computation graph traversal.
  • Step S12 access the subgraph list in reverse order to obtain the topological structure order of the computation graph
  • the reverse post-order node list refers to reversing the list of nodes obtained by accessing the post-order sequence in the first step.
  • the reverse post-order node list ensures that if there is a path from node VA to node VB in the graph, then node VA appears before node VB in the list of the obtained topological order.
  • the reverse post-order process ensures that the computational graph of the topological structure needs to preferentially access the V c node before accessing any other nodes connected by a certain node V c .
  • Step S13 reconstructing the calculation graph according to the topological structure sequence to obtain a topological structure calculation graph, see FIG. 3 .
  • step S2 constructing a life cycle interval for a tensor variable
  • a life cycle interval of the tensor variables contained in each node is constructed, and the life cycle interval corresponding to the tensor variables contained in the node starts from the position of the first node where the tensor variable is in a surviving state, and ends at the position of the last node where the tensor variable is in a surviving state.
  • the life cycle interval l v corresponding to the tensor variable starts at the position of the first node where the tensor variable v is alive, and ends at the position of the last node where the tensor variable v is alive.
  • Step 1 Construct the lifetime interval of the tensor variable a 0 The life cycle interval of the tensor variable a 0 It starts at node V 1 and ends at node V 3 .
  • Step 2 Construct the lifetime interval of the tensor variable a1
  • the life cycle interval of the tensor variable a1 Starting from node V 4 , since there is an edge between subgraph E and subgraph D, the tensor variable a 1 will pass through node V 8 to reach subgraph D. Therefore, the life cycle interval of the tensor variable a 1 is Terminates at node V 8 .
  • Step 3 Construct the lifetime interval of the tensor variable a2
  • the life cycle interval of the tensor variable a2 Starting from node V 5 , since there is an edge between subgraph E and subgraph D, the tensor variable a 2 will pass through node V 8 to reach subgraph D. Therefore, the life cycle interval of the tensor variable a 2 is Terminates at node V 8 .
  • Step S3 constructing a scan line about the life cycle interval
  • a scan line parallel to the life cycle interval is constructed.
  • the scan line is used to observe whether there are free registers that can be allocated to tensor variables during the data flow execution process while moving from the starting end of the life cycle interval to the ending end of the life cycle interval.
  • step S4 allocating tensor variables to free registers
  • the process of allocating the tensor variables contained in the topology structure calculation graph nodes to two registers r 0 and r 1 includes the following process:
  • Step 1 Assign the tensor variable a0 to register r0 .
  • Step 2 Assign the tensor variable a1 to register r1 .
  • Step S5 allocating registers of tensor variables corresponding to the life cycle interval at the farthest end point to tensor variables exceeding the required number of registers;
  • the execution flow is located at a certain node Vi , when the node has neither free registers nor the lifecycle interval that has been scanned and expired and can be removed from the activated lifecycle interval, the tensor variable i in the register ri allocated to the tensor variable i corresponding to the lifecycle interval at the farthest end point is transferred to the memory, and then the released register ri is allocated to the tensor variable j that exceeds the required number of registers.
  • Step S6 Allocate the registers allocated to the expired life cycle interval l i to the tensor variable j that exceeds the required number of registers;
  • step S7 adding the tensor variable transferred to the memory back to the activated life cycle interval and allocating a free register to it.
  • the tensor variable i in register r i needs to be stored in memory; whenever the data flow passes through a usage node containing a tensor variable i, the tensor variable i needs to be loaded from memory to register r i .
  • the process of adding the tensor variable transferred to memory back to the interval list in the active state Mark the location shown.
  • Step 1 Since both nodes V1 and V9 contain the definition of tensor variable a0 , it is necessary to store the tensor variable a0 in register r0 at nodes V1 and V9 into memory. As shown in Figure 6 Mark the location shown.
  • the tensor variable a 0 needs to be loaded from the memory to the register r 0 at the nodes.
  • Embodiment 2 A memory optimization method for neural network computing, in which three registers are allocated to tensor variables in the computation graph execution flow for neural network computing during the memory optimization process, as follows:
  • Step S1 Reconstruct the computation graph into a topological structure computation graph; such as the computation graph shown on the left side of FIG8 .
  • Step S2 Construct the life cycle interval of the tensor variable; as shown in the calculation graph on the right side of Figure 8.
  • Step S3 constructing a scan line about the life cycle interval
  • a scan line parallel to the starting line of the life cycle interval is constructed.
  • the scan line is used to assist in observing the status of free registers and tensor variables.
  • the scan line works by observing whether there are free registers that can be allocated to tensor variables in the data flow execution process as the scan line moves from the starting end of the life cycle interval to the ending end of the life cycle interval, see Figure 9, the top horizontal line represents the scan line.
  • Step S4 Allocate tensor variables to free registers
  • the free register r 3 is assigned to the tensor variable x.
  • the starting position of the scan line that is, at the node V 1 , it is found that there is a free register r 3 that can be assigned to the tensor variable x.
  • register r 1 is assigned to the tensor variable y at node V 2.
  • the scan line scans to the position of node V 2 , it is found that the scan line has passed through the life cycle interval of register r 1 , so the life cycle interval of register r 1 can be removed from the list of life cycle intervals in the activated state, and register r 1 is recycled to the free register list.
  • the free register r 1 can be assigned to the tensor variable y.
  • register r 2 is assigned to the tensor variable z at node V 3.
  • the scan line scans to the position of node V 3 , it is found that the scan line has passed through the life cycle interval of register r 2 , so the life cycle interval of register r 2 can be removed from the list of life cycle intervals in the activated state, and register r 2 is recycled to the free register list.
  • the free register r 2 can be assigned to the tensor variable z.
  • Step S5 allocating registers of tensor variables corresponding to the life cycle interval at the farthest end point to tensor variables exceeding the required number of registers;
  • the registers allocated to the expired lifecycle interval l y are allocated to the tensor variable w that exceeds the required number of registers.
  • the scan line scans to the position of node V 5 , it is found that the scan line has passed through the lifecycle interval l y corresponding to the register r 1 allocated to the tensor variable y, so the tensor variable y can be removed from the list of active lifecycle intervals, and the register r 1 can be recycled to the free register list.
  • the free register r 1 can be allocated to the tensor variable w that exceeds the required number of registers.
  • Step S6 allocating the registers allocated to the expired life cycle interval to the tensor variables that exceed the required number of registers;
  • the registers allocated to the expired lifecycle interval are recycled to the free register list.
  • the scan line scans to the end position of the node V 8 , it is found that the scan line has passed through the lifecycle interval l z corresponding to the register r 2 allocated to the tensor variable z and the lifecycle interval l w corresponding to the register r 1 allocated to the tensor variable w . Therefore, the tensor variables z and w corresponding to the expired lifecycle intervals l z and l w are removed from the list of active lifecycle intervals, and the registers r 2 and r 1 are recycled to the free register list.
  • the registers allocated to the expired life cycle interval are recycled into the free register pool and the free registers are allocated to the active life cycle interval.
  • the scan line scans to the position of node V 9 , it is found that the scan line has passed through the life cycle interval l b corresponding to the register r 3 allocated to the tensor variable b. Therefore, the tensor variable b corresponding to the expired life cycle interval l b is removed from the list of active life cycle intervals, and the register r 3 is recycled to the free register list.
  • Step S7 Add the tensor variable transferred to the memory back to the activated life cycle interval and allocate a free register to it.
  • the present invention also provides an embodiment 3 of a memory optimization device for neural network computing.
  • a memory optimization device for neural network computing provided in Example 3 of the present invention includes a memory and one or more processors, wherein the memory stores executable code, and when the one or more processors execute the executable code, they are used to implement a memory optimization method for neural network computing in the above-mentioned embodiment.
  • Embodiment 3 of a memory optimization device for neural network computing of the present invention can be applied to any device with data processing capability, and the device with data processing capability can be a device or apparatus such as a computer.
  • Device embodiment 3 can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by the processor of any device with data processing capability in which it is located to read the corresponding computer program instructions in the non-volatile memory into the memory for execution. From the hardware level, as shown in FIG18, it is a hardware structure diagram of any device with data processing capability in which a memory optimization device for neural network computing of the present invention is located.
  • any device with data processing capability in which the device in Embodiment 3 is located can also include other hardware according to the actual function of the device with data processing capability, which will not be described in detail.
  • the relevant parts can refer to the partial description of the method embodiment.
  • the device embodiment 3 described above is only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of the present invention. A person of ordinary skill in the art can understand and implement it without paying creative labor.
  • An embodiment of the present invention further provides a computer-readable storage medium having a program stored thereon.
  • the program is executed by a processor, a memory optimization method for neural network calculation in the above embodiment is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the aforementioned embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium may also be an external storage device of any device with data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, a flash card, etc. equipped on the device.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capability.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capability, and may also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种面向神经网络计算的内存优化方法和装置,包括以下步骤:步骤S1:将计算图重构为拓扑结构计算图;步骤S2:构建关于张量变量的生命周期区间;步骤S3:构建关于生命周期区间的扫描线;步骤S4:将张量变量分配到空闲寄存器;步骤S5:分配给超出寄存器需求数量的张量变量;步骤S6:将已过期的所述生命周期区间所分配的寄存器分配给超出寄存器需求数量的张量变量;步骤S7:将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间并为其分配空闲寄存器。本发明优化用于神经网络计算的计算图的数据流的内存,减少数据流中张量变量所需的内存开销,降低大模型对于硬件内存资源的要求。

Description

一种面向神经网络计算的内存优化方法和装置
本申请要求于2022年9月27日向中国国家知识产权局提交的发明专利申请号为202211177786.5,发明名称为“一种面向神经网络计算的内存优化方法和装置”的中国专利申请的优先权权益,其全部内容通过引用合并于本申请。
技术领域
本发明涉及一种基于特定计算模型的计算机***技术领域,尤其涉及一种面向神经网络计算的内存优化方法和装置。
背景技术
随着工业界复杂场景对大规模神经网络应用的需求越来越紧迫,大模型对于内存空间的占用不断增大,人工智能硬件操作***自身的内存资源不能满足大模型训练对于内存的需求,所以优化面向神经网络计算的内存技术变得极为重要。
为此,我们提出一种面向神经网络计算的内存优化方法和装置。
发明内容
本发明的目的在于提供一种面向神经网络计算的内存优化方法和装置,以解决如何优化减少张量变量对于深度学习操作***内存资源的持久依赖和占用,减少数据流中张量变量所需的内存开销,并降低大模型对于硬件内存资源的要求的问题。
本发明采用的技术方案如下:
一种面向神经网络计算的内存优化方法,包括以下步骤:
步骤S1:将计算图重构为拓扑结构计算图;
步骤S2:构建关于张量变量的生命周期区间;
步骤S3:构建关于生命周期区间的扫描线;
步骤S4:将张量变量分配到空闲寄存器;
步骤S5:将最远终点的所述生命周期区间对应张量变量的寄存器分配给超出寄存器需求数量的张量变量;
步骤S6:将已过期的所述生命周期区间所分配的寄存器分配给超出寄存器需求数量的张量变量;
步骤S7:将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间并为其分配空闲寄存器。
进一步地,所述步骤S1具体包括以下子步骤:
步骤S11:后序顺序遍历计算图,得到子图访问列表;
步骤S12:逆序后序所述子图访问列表,得到计算图的拓扑结构顺序;
步骤S13:根据所述拓扑结构顺序重构计算图,得到拓扑结构计算图。
进一步地,所述后序顺序为当访问所述计算图的某个节点时,则优先递归地访问所述节点的后继节点。
进一步地,所述步骤S2具体为构建关于每个节点中包含张量变量的生命周期区间,所述节点包含的张量变量对应的关于生命周期区间起始于所述张量变量处于存活状态的第一个节点的位置,终止于所述张量变量处于存活状态的最后一个节点的位置。
进一步地,所述步骤S3具体为在所述拓扑结构计算图的起始节点处,构建一条与所述生命周期区间平行的扫描线,所述扫描线用于从所述生命周期区间的起始端向所述生命周期区间的终止端移动过程中,观察是否存在空闲寄存器可以分配给数据流执行过程中的张量变量。
进一步地,所述步骤S5具体为执行流位于某个节点的位置时,当所述节点既没有空闲寄存器,也没有可以从处于激活状态的所述生命周期区间中移除的已经被扫描过期的所述生命周期区间,则将最远终点的所述生命周期区间对应的张量变量所分配的寄存器中的张量变量转移到内存中,然后将所释放的寄存器分配给超出寄存器需求数量的张量变量。
进一步地,所述步骤S6具体为执行流位于某个节点的位置时,当所述扫描线已经穿过张量变量所分配寄存器对应的所述生命周期区间,将张量变量从处于激活状态的所述生命周期区间中移除,对应分配的寄存器回收到空闲寄存器列表中,将所述空闲寄存器分配给超出寄存器需求数量的张量变量。
进一步地,所述步骤S7具体为执行流位于某个节点的位置时,当存在空闲寄存器,则将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间,并且将空闲寄存器分配给对应的所述生命周期区间。
本发明还提供一种面向神经网络计算的内存优化装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例任一项所述的一种面向神经网络计算的内存优化方法。
本发明还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例任一项所述的一种面向神经网络计算的内存优化方法。
本发明的有益效果是:本发明提供了计算图执行过程中产生的张量变量与物理寄存器和内存的映射关系,并且提供了基于所述映射关系的优化方法。寄存器可以存储计算图执行 过程中产生的张量变量在内存中的存储位置。传统的张量变量存储方法是直接将张量变量的值存储到内存中。由于张量变量的值既可以存储在内存中也可以存储在寄存器中,考虑到寄存器允许中央处理器单元直接访问和访问速度快的特性,所以,本发明提出的借助寄存器优化内存的方法优化了用于神经网络计算的计算图的数据流的内存,减少数据流中张量变量所需的内存开销,并降低了大模型对于硬件内存资源的要求。所述的用于神经网络计算的内存优化方法提高整个计算图的计算效率,节约硬件和时间成本。
附图说明
图1为本发明一种面向神经网络计算的内存优化方法的流程示意图;
图2为实施例1将计算图重构为拓扑结构的过程示意图;
图3为实施例1拓扑结构计算图;
图4为实施例1构建关于拓扑结构计算图节点包含张量变量生命周期的区间;
图5为实施例1将拓扑结构计算图节点包含的前两个张量变量分配到两个寄存器;
图6为实施例1将寄存器中张量变量转移到内存和将新张量变量分配到已空闲寄存器;
图7为实施例2用于神经网络计算的计算图;
图8为实施例2构建关于数据流中张量变量生命周期区间;
图9为实施例2构建关于张量变量生命周期区间的扫描线;
图10为实施例2将寄存器r 3分配给节点V 1处的变量x;
图11为实施例2将寄存器r 1分配给节点V 2处的变量y;
图12为实施例2将寄存器r 2分配给节点V 3处的变量z;
图13为实施例2将最远终点区间l x对应张量变量x的寄存器r 3分配给超出寄存器需求数量的张量变量b;
图14为实施例2将已过期生命周期区间l y所分配的寄存器r 1分配给超出寄存器需求数量的张量变量w;
图15为实施例2将已过期生命周期区间所对应的张量变量从处于激活状态的生命周期区间列表中移除并回收寄存器;
图16为实施例2将已过期生命周期区间对应的张量变量从处于激活状态的生命周期区间列表中移除并回收寄存器;
图17为实施例2将空闲寄存器r 3分配给l r3对应生命周期区间;
图18为实施例3中的一种面向神经网络计算的内存优化装置的示意图。
具体实施方式
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有付出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。
参见图1,一种面向神经网络计算的内存优化方法,包括以下步骤:
步骤S1:将计算图重构为拓扑结构计算图;
步骤S11:后序顺序遍历计算图,得到子图访问列表;
所述后序顺序为当访问所述计算图的某个节点时,则优先递归地访问所述节点的后继节点。
步骤S12:逆序后序所述子图访问列表,得到计算图的拓扑结构顺序;
步骤S13:根据所述拓扑结构顺序重构计算图,得到拓扑结构计算图。
步骤S2:构建关于张量变量的生命周期区间;
具体为构建关于每个节点中包含张量变量的生命周期区间,所述节点包含的张量变量对应的关于生命周期区间起始于所述张量变量处于存活状态的第一个节点的位置,终止于所述张量变量处于存活状态的最后一个节点的位置。
步骤S3:构建关于生命周期区间的扫描线;
在所述拓扑结构计算图的起始节点处,构建一条与所述生命周期区间平行的扫描线,所述扫描线用于从所述生命周期区间的起始端向所述生命周期区间的终止端移动过程中,观察是否存在空闲寄存器可以分配给数据流执行过程中的张量变量。
步骤S4:将张量变量分配到空闲寄存器;
步骤S5:将最远终点的所述生命周期区间对应张量变量的寄存器分配给超出寄存器需求数量的张量变量;
为执行流位于某个节点的位置时,当所述节点既没有空闲寄存器,也没有可以从处于激活状态的所述生命周期区间中移除的已经被扫描过期的所述生命周期区间,则将最远终点的所述生命周期区间对应的张量变量所分配的寄存器中的张量变量转移到内存中,然后将所释放的寄存器分配给超出寄存器需求数量的张量变量。
步骤S6:将已过期的所述生命周期区间所分配的寄存器分配给超出寄存器需求数量的张量变量;
为执行流位于某个节点的位置时,当所述扫描线已经穿过张量变量所分配寄存器对应的所述生命周期区间,将张量变量从处于激活状态的所述生命周期区间中移除,对应分配的寄存器回收到空闲寄存器列表中,将所述空闲寄存器分配给超出寄存器需求数量的张量变量。
步骤S7:将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间并为其 分配空闲寄存器。
为执行流位于某个节点的位置时,当存在空闲寄存器,则将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间,并且将空闲寄存器分配给对应的所述生命周期区间。
对下述实施例中对应附图的函数定义如下:
tf.random_uniform([[5,3]]):表示随机生成形状为5行3列的张量。
goto V i:表示进入执行V i节点的计算流。
if表达式goto V i:表示判断表达式的值是否为真,如果为真,则执行V i节点的计算流;否则执行其他分支节点的计算流。
tf.add(x,y):表示张量x与张量y进行相加操作。
tf.ones(ai.shape):表示创建一个与张量a i形状相同且所有元素都为1的张量。
Figure PCTCN2022124000-appb-000001
表示张量变量a i和张量变量a j关于张量变量a的正确定义的路由选择器。
tf.relu(x):表示将张量x输入整流线性单元。
tf.matmul(x,y):表示张量x与张量y进行矩阵乘法操作。
return b i:表示返回执行包含b i张量变量的分支。
I x:表示张量变量x的生命周期区间。
tf.subtract(x,y):表示张量x与张量y进行相减操作。
r i:表示将空闲寄存器r i分配给所对应生命周期区间的张量变量。
Figure PCTCN2022124000-appb-000002
表示存储操作,代表将寄存器r i中的张量变量a 0存储到内存中。
Figure PCTCN2022124000-appb-000003
表示存储操作,代表将内存中的张量变量a 0加载到寄存器r i中。
实施例1:
参见图2,步骤S1:将计算图重构为拓扑结构计算图;
步骤S11:后序顺序遍历计算图,得到子图访问列表;
按照后序顺序遍历计算图,得到子图访问列表为:D,B,E,C,F,A;
所述后序顺序为当访问所述计算图的某个节点时,则优先递归地访问所述节点的后继节点。
每当计算图中的某个节点C被按照后序顺序被访问完成时,那么与所述节点V c的所有连边都已经被访问过了。所述后序顺序的遍历可以确保计算图遍历中关于从节点V A指向节点V B的路径中节点V B一定优先于节点V A被访问。
步骤S12:逆序后序所述子图访问列表,得到计算图的拓扑结构顺序;
逆序后序所述子图访问列表,得到计算图的拓扑结构顺序为:A,F,C,E,B,D;
所述逆序后序节点列表是指将第一步后序顺序访问所得节点的列表进行逆序。所述逆序后序节点列表确保如果图中存在从节点V A指向节点V B的路径,那么所得拓扑顺序的列表中节点V A出现在节点V B之前。所述逆序后序的过程保证了拓扑结构的计算图在访问由某个节点V c连接的任何其他节点之前,需要优先访问所述的V c节点。
步骤S13:根据所述拓扑结构顺序重构计算图,得到拓扑结构计算图,参见图3。
参见图4,步骤S2:构建关于张量变量的生命周期区间;
具体为构建关于每个节点中包含张量变量的生命周期区间,所述节点包含的张量变量对应的关于生命周期区间起始于所述张量变量处于存活状态的第一个节点的位置,终止于所述张量变量处于存活状态的最后一个节点的位置。
对于节点包含的张量变量v,所述张量变量对应的关于生命周期区间l v起始于张量变量v处于存活状态的第一个节点的位置,终止于张量变量v处于存活状态的最后一个节点的位置。
步骤1:构建关于张量变量a 0的生命周期区间
Figure PCTCN2022124000-appb-000004
所述关于张量变量a 0的生命周期区间
Figure PCTCN2022124000-appb-000005
起始于节点V 1,终止于节点V 3
步骤2:构建关于张量变量a 1的生命周期区间
Figure PCTCN2022124000-appb-000006
所述关于张量变量a 1的生命周期区间
Figure PCTCN2022124000-appb-000007
起始于节点V 4,由于子图E与子图D之间存在由子图E指向子图D的连边,所以张量变量a 1会穿过节点V 8到达子图D,因此关于张量变量a 1的生命周期区间
Figure PCTCN2022124000-appb-000008
终止于节点V 8
步骤3:构建关于张量变量a 2的生命周期区间
Figure PCTCN2022124000-appb-000009
所述关于张量变量a 2的生命周期区间
Figure PCTCN2022124000-appb-000010
起始于节点V 5,由于子图E与子图D之间存在由子图E指向子图D的连边,所 以张量变量a 2会穿过节点V 8到达子图D,因此关于张量变量a 2的生命周期区间
Figure PCTCN2022124000-appb-000011
终止于节点V 8
步骤S3:构建关于生命周期区间的扫描线;
在所述拓扑结构计算图的起始节点处,构建一条与所述生命周期区间平行的扫描线,所述扫描线用于从所述生命周期区间的起始端向所述生命周期区间的终止端移动过程中,观察是否存在空闲寄存器可以分配给数据流执行过程中的张量变量。
参见图5,步骤S4:将张量变量分配到空闲寄存器;
所述将拓扑结构计算图节点包含的张量变量分配到两个寄存器r 0和r 1中包含以下过程:
步骤1:将张量变量a 0分配到寄存器r 0中。
步骤2:将张量变量a 1分配到寄存器r 1中。
步骤S5:将最远终点的所述生命周期区间对应张量变量的寄存器分配给超出寄存器需求数量的张量变量;
为执行流位于某个节点V i的位置时,当所述节点既没有空闲寄存器,也没有可以从处于激活状态的所述生命周期区间中移除的已经被扫描过期的所述生命周期区间,则将最远终点的所述生命周期区间对应的张量变量i所分配的寄存器r i中的张量变量i转移到内存中,然后将所释放的寄存器r i分配给超出寄存器需求数量的张量变量j。
步骤S6:将已过期的所述生命周期区间l i所分配的寄存器分配给超出寄存器需求数量的张量变量j;
为执行流位于某个节点V i的位置时,当所述扫描线已经穿过张量变量i所分配寄存器r i对应的所述生命周期区间l i,将张量变量i从处于激活状态的所述生命周期区间中移除,对应分配的寄存器r i回收到空闲寄存器列表中,将所述空闲寄存器r i分配给超出寄存器需求数量的张量变量j。
参见图6,步骤S7:将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间并为其分配空闲寄存器。
为执行流位于某个节点V i的位置时,当存在空闲寄存器r i,则将转移到内存中的张量变量i添加回处于激活状态的所述生命周期区间,并且将空闲寄存器r i分配给对应的所述生命周期区间l i
每当数据流流经包含张量变量i的重定义节点时,需要将寄存器r i中张量变量i存储到内存中;每当数据流流经包含张量变量i的使用节点时,需要将张量变量i从内存中加载到 寄存器r i中。所述将转移到内存中的张量变量添加回处于激活状态的区间列表的过程
Figure PCTCN2022124000-appb-000012
标记所示位置。
第一步、由于节点V 1和V 9处均包含张量变量a 0的定义,所以需要将节点V 1和V 9处寄存器r 0中的张量变量a 0存储到内存中。如图6中
Figure PCTCN2022124000-appb-000013
标记所示位置。
第二步、由于节点V 2,V 4,V 5,V 9和V 3处均包含张量变量a 0的使用,所以需要在所述节点处将张量变量a 0从内存中加载到寄存器r 0中。
参见图7,实施例2:一种面向神经网络计算的内存优化方法,在内存优化过程中用于神经网络计算的计算图执行流中的张量变量分配3个寄存器,具体如下:
步骤S1:将计算图重构为拓扑结构计算图;如图8左边所示的计算图。
步骤S2:构建关于张量变量的生命周期区间;如图8右边边所示的计算图。
步骤S3:构建关于生命周期区间的扫描线;
在拓扑结构计算图的起始节点V 1处,构建一条与所述生命周期区间起始线平行的扫描线。所述扫描线用来辅助观察空闲寄存器和张量变量的状态。扫描线的工作方式是在扫描线从生命周期区间的起始端向生命周期区间的终止端移动的过程中,观察是否存在空闲寄存器可以分配给数据流执行过程中的张量变量,参见图9,顶部横线代表扫描线。
步骤S4:将张量变量分配到空闲寄存器;
参见图10,将空闲寄存器r 3分配给张量变量x,扫描线的起始位置,也就是节点V 1处,发现存在空闲寄存器r 3可以分配给张量变量x。
参见图11,将寄存器r 1分配给节点V 2处的张量变量y。扫描线扫描至节点V 2的位置时,发现扫描线已经穿过寄存器r 1的所述生命周期区间,所以可以将寄存器r 1的所述生命周期区间从处于激活状态的所述生命周期区间列表中移除,将寄存器r 1回收到空闲寄存器列表中。最后将所述空闲寄存器r 1可以分配给张量变量y。
参见图12,将寄存器r 2分配给节点V 3处的张量变量z。扫描线扫描至节点V 3的位置时,发现扫描线已经穿过寄存器r 2的所述生命周期区间,所以可以将寄存器r 2的所述生命周期区间从处于激活状态的所述生命周期区间列表中移除,将寄存器r 2回收到空闲寄存器列表中。最后将所述空闲寄存器r 2可以分配给张量变量z。
步骤S5:将最远终点的所述生命周期区间对应张量变量的寄存器分配给超出寄存器需求数量的张量变量;
参见图13,扫描线扫描至节点V 4的位置时,发现既没有空闲寄存器,也没有可以从处于激 活状态的所述生命周期区间列表中移除的已经被扫描过期的所述生命周期区间。所以需要将最远终点的所述生命周期区间对应张量变量x所分配的寄存器r 3中的张量变量转移到内存中,然后将所释放的寄存器r 3分配给超出寄存器需求数量的张量变量b。由于张量变量x被存储到内存中,所以张量变量x对应的所述生命周期区间更新为虚线。
参见图14,将已过期的生命周期区间l y所分配的寄存器分配给超出寄存器需求数量的张量变量w。扫描线扫描至节点V 5的位置时,发现扫描线已经穿过张量变量y所分配寄存器r 1对应的生命周期区间l y,所以可以将张量变量y从处于激活状态的生命周期区间列表中移除,将寄存器r 1回收到空闲寄存器列表中。最后将所述空闲寄存器r 1可以分配给超出寄存器需求数量的张量变量w。
步骤S6:将已过期的所述生命周期区间所分配的寄存器分配给超出寄存器需求数量的张量变量;
参见图15,将已过期生命周期区间所分配的寄存器回收到空闲寄存器列表中。扫描线扫描至节点V 8的结束位置时,发现扫描线已经穿过张量变量z所分配寄存器r 2对应的生命周期区间l z和张量变量w所分配寄存器r 1对应的生命周期区间l w。所以将已过期生命周期区间l z和l w所对应的张量变量z和w从处于激活状态的生命周期区间列表中移除,将寄存器r 2和r 1回收到空闲寄存器列表中。
参见图16,将已过期生命周期区间所分配的寄存器回收到空闲寄存器池中并将空闲寄存器分配给处于激活状态的生命周期区间。扫描线扫描至节点V 9的位置时,发现扫描线已经穿过张量变量b所分配寄存器r 3对应的生命周期区间l b。所以将已过期生命周期区间l b所对应的张量变量b从处于激活状态的生命周期区间列表中移除,将寄存器r 3回收到空闲寄存器列表中。扫描线扫描至节点V 9的位置时,发现发现存在空闲寄存器r 1,将空闲寄存器r 1分配给
Figure PCTCN2022124000-appb-000014
对应生命周期区间。扫描线扫描至节点V 10的位置时,发现发现存在空闲寄存器r 3,将空闲寄存器r 3分配给
Figure PCTCN2022124000-appb-000015
对应生命周期区间。
步骤S7:将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间并为其分配空闲寄存器。
参见图17,扫描线扫描至节点V 10的位置时,发现发现存在空闲寄存器r 2,将转移到内存中的变量x添加回处于激活状态的生命周期区间列表,并且将空闲寄存器r 2分配给l x对应的生命周期区间。
与前述一种面向神经网络计算的内存优化方法的实施例相对应,本发明还提供了一种 面向神经网络计算的内存优化装置的实施例3。
参见图18,本发明实施例3提供的一种面向神经网络计算的内存优化装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例中的一种面向神经网络计算的内存优化方法。
本发明一种面向神经网络计算的内存优化装置的实施例3可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例3可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图18所示,为本发明一种面向神经网络计算的内存优化装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图18所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例3中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例3而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例3仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的一种面向神经网络计算的内存优化方法。
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种面向神经网络计算的内存优化方法,其特征在于,包括以下步骤:
    步骤S1:将计算图重构为拓扑结构计算图;
    步骤S2:构建关于张量变量的生命周期区间;
    步骤S3:构建关于生命周期区间的扫描线;
    步骤S4:将张量变量分配到空闲寄存器;
    步骤S5:将最远终点的所述生命周期区间对应张量变量的寄存器分配给超出寄存器需求数量的张量变量;
    步骤S6:将已过期的所述生命周期区间所分配的寄存器分配给超出寄存器需求数量的张量变量;
    步骤S7:将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间并为其分配空闲寄存器。
  2. 如权利要求1所述的一种面向神经网络计算的内存优化方法,其特征在于,所述步骤S1具体包括以下子步骤:
    步骤S11:后序顺序遍历计算图,得到子图访问列表;
    步骤S12:逆序后序所述子图访问列表,得到计算图的拓扑结构顺序;
    步骤S13:根据所述拓扑结构顺序重构计算图,得到拓扑结构计算图。
  3. 如权利要求2所述的一种面向神经网络计算的内存优化方法,其特征在于,所述后序顺序为当访问所述计算图的某个节点时,则优先递归地访问所述节点的后继节点。
  4. 如权利要求1所述的一种面向神经网络计算的内存优化方法,其特征在于,所述步骤S2具体为构建关于每个节点中包含张量变量的生命周期区间,所述节点包含的张量变量对应的关于生命周期区间起始于所述张量变量处于存活状态的第一个节点的位置,终止于所述张量变量处于存活状态的最后一个节点的位置。
  5. 如权利要求1所述的一种面向神经网络计算的内存优化方法,其特征在于,所述步骤S3具体为在所述拓扑结构计算图的起始节点处,构建一条与所述生命周期区间平行的扫描线,所述扫描线用于从所述生命周期区间的起始端向所述生命周期区间的终止端移动过程中,观察是否存在空闲寄存器可以分配给数据流执行过程中的张量变量。
  6. 如权利要求1所述的一种面向神经网络计算的内存优化方法,其特征在于,所述步骤S5具体为执行流位于某个节点的位置时,当所述节点既没有空闲寄存器,也没有可以从处于激活状态的所述生命周期区间中移除的已经被扫描过期的所述生命周期区间,则将最远终点的所述生命周期区间对应的张量变量所分配的寄存器中的张量变量转移到内存中,然后将所释 放的寄存器分配给超出寄存器需求数量的张量变量。
  7. 如权利要求1所述的一种面向神经网络计算的内存优化方法,其特征在于,所述步骤S6具体为执行流位于某个节点的位置时,当所述扫描线已经穿过张量变量所分配寄存器对应的所述生命周期区间,将张量变量从处于激活状态的所述生命周期区间中移除,对应分配的寄存器回收到空闲寄存器列表中,将所述空闲寄存器分配给超出寄存器需求数量的张量变量。
  8. 如权利要求1所述的一种面向神经网络计算的内存优化方法,其特征在于,所述步骤S7具体为执行流位于某个节点的位置时,当存在空闲寄存器,则将转移到内存中的张量变量添加回处于激活状态的所述生命周期区间,并且将空闲寄存器分配给对应的所述生命周期区间。
  9. 一种面向神经网络计算的内存优化装置,其特征在于,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现权利要求1-8中任一项所述的一种面向神经网络计算的内存优化方法。
  10. 一种计算机可读存储介质,其特征在于,其上存储有程序,该程序被处理器执行时,实现权利要求1-8中任一项所述的一种面向神经网络计算的内存优化方法。
PCT/CN2022/124000 2022-09-27 2022-10-09 一种面向神经网络计算的内存优化方法和装置 WO2024065865A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/072,969 US20240104395A1 (en) 2022-09-27 2022-12-01 Memory optimization method and device oriented to neural network computing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211177786.5A CN115269205B (zh) 2022-09-27 2022-09-27 一种面向神经网络计算的内存优化方法和装置
CN202211177786.5 2022-09-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/072,969 Continuation US20240104395A1 (en) 2022-09-27 2022-12-01 Memory optimization method and device oriented to neural network computing

Publications (1)

Publication Number Publication Date
WO2024065865A1 true WO2024065865A1 (zh) 2024-04-04

Family

ID=83756875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124000 WO2024065865A1 (zh) 2022-09-27 2022-10-09 一种面向神经网络计算的内存优化方法和装置

Country Status (2)

Country Link
CN (1) CN115269205B (zh)
WO (1) WO2024065865A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093452A (zh) * 2024-04-22 2024-05-28 北京壁仞科技开发有限公司 一种内存架构映射方法、设备、存储介质及程序产品

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814971A (zh) * 2020-06-30 2020-10-23 杭州国芯科技股份有限公司 一种神经网络的内存分配方法
CN112948001A (zh) * 2021-03-25 2021-06-11 安徽寒武纪信息科技有限公司 设定张量硬件配置的方法、可读存储介质及装置
US20220035544A1 (en) * 2020-07-31 2022-02-03 Sigmastar Technology Ltd. Memory allocation method and device, and electronic apparatus
CN114936099A (zh) * 2022-07-25 2022-08-23 之江实验室 一种用于神经网络计算的图优化方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100552631C (zh) * 2008-03-06 2009-10-21 中国人民解放军国防科学技术大学 一种利用剩余资源分配寄存器的方法
CN105653472A (zh) * 2015-12-31 2016-06-08 北京中科晶上科技有限公司 缓存辅助的向量寄存器堆的缓冲方法
CN107609641B (zh) * 2017-08-30 2020-07-03 清华大学 稀疏神经网络架构及其实现方法
CN108874445A (zh) * 2017-10-30 2018-11-23 上海寒武纪信息科技有限公司 神经网络处理器及使用处理器执行向量点积指令的方法
US11644834B2 (en) * 2017-11-10 2023-05-09 Nvidia Corporation Systems and methods for safe and reliable autonomous vehicles
US20210064987A1 (en) * 2019-09-03 2021-03-04 Nvidia Corporation Processor and system to convert tensor operations in machine learning
CN113050951A (zh) * 2021-03-31 2021-06-29 上海天旦网络科技发展有限公司 基于计算图的协议描述和解码方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814971A (zh) * 2020-06-30 2020-10-23 杭州国芯科技股份有限公司 一种神经网络的内存分配方法
US20220035544A1 (en) * 2020-07-31 2022-02-03 Sigmastar Technology Ltd. Memory allocation method and device, and electronic apparatus
CN112948001A (zh) * 2021-03-25 2021-06-11 安徽寒武纪信息科技有限公司 设定张量硬件配置的方法、可读存储介质及装置
CN114936099A (zh) * 2022-07-25 2022-08-23 之江实验室 一种用于神经网络计算的图优化方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093452A (zh) * 2024-04-22 2024-05-28 北京壁仞科技开发有限公司 一种内存架构映射方法、设备、存储介质及程序产品

Also Published As

Publication number Publication date
CN115269205B (zh) 2022-12-27
CN115269205A (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
Peyton Jones Parallel implementations of functional programming languages
US5249295A (en) Digital computer register allocation and code spilling using interference graph coloring
Kriemann Parallel-matrix arithmetics on shared memory systems
US6708331B1 (en) Method for automatic parallelization of software
WO2024021192A1 (zh) 一种用于神经网络计算的图优化方法和装置
Gidenstam et al. Efficient and reliable lock-free memory reclamation based on reference counting
WO2024065867A1 (zh) 一种用于神经网络编译的内存优化方法及装置
WO2024065865A1 (zh) 一种面向神经网络计算的内存优化方法和装置
Shu et al. Chare kernel—a runtime support system for parallel computations
CN101847096B (zh) 包含栈变量函数的优化方法
WO2023093185A1 (zh) 一种用于神经网络计算的数据流动方法和装置
Cole et al. Analysis of randomized work stealing with false sharing
US8458671B1 (en) Method and system for stack back-tracing in computer programs
CN114385089B (zh) 一种基于交叉编址的动态bank存储方法、装置及电子设备
US11983168B2 (en) Block verification method, apparatus and device
US20240104395A1 (en) Memory optimization method and device oriented to neural network computing
CN113485832B (zh) 用于对物理内存池进行分配管理的方法及装置、物理内存池
US20240104016A1 (en) Intermediate Representation Method and Apparatus for Compiling Computation Graphs
CN112507171A (zh) 一种任务调度方法、智能终端及存储介质
WO2023082901A1 (zh) 一种用于计算图编译的优化方法及装置
Kuchumov et al. Staccato: shared-memory work-stealing task scheduler with cache-aware memory management
CN105912404B (zh) 一种基于磁盘的大规模图数据中寻找强连通分量的方法
Hesselink et al. Wait-free concurrent memory management by create and read until deletion (CaRuD)
CN114237903A (zh) 内存分配优化方法、装置、电子设备、介质及程序产品
Kristensen et al. Managing communication latency-hiding at runtime for parallel programming languages and libraries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22960469

Country of ref document: EP

Kind code of ref document: A1