CN113918507A

CN113918507A - Method and device for adapting deep learning framework to AI acceleration chip

Info

Publication number: CN113918507A
Application number: CN202111497148.7A
Authority: CN
Inventors: 王拓; 杨非; 黄振华; 鲍虎军; 华炜
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-01-11
Anticipated expiration: 2041-12-09
Also published as: CN113918507B

Abstract

The invention discloses a method and a device for adapting an AI acceleration chip by a deep learning framework, which are specifically divided into three stages: the method comprises the following steps of chip type definition, chip type registration and chip memory support, wherein the chip type definition is to define the type of a chip to be supported into a proto file in the form of an enumeration value, so that the chip type can be correctly identified in a framework; the chip type registration is to register the infrastructure required by the chip into the hash table, so that the frame can conveniently find corresponding contents according to the chip type when required; the chip memory support is to put the relevant operations of the chip memory into a frame, so that the frame can carry out unified management on the memory space of the chip. The invention simplifies the work of the deep learning framework adaptive AI acceleration chip.

Description

Method and device for adapting deep learning framework to AI acceleration chip

Technical Field

The invention belongs to the field of deep learning basic software, and relates to a method and a device for adapting an AI acceleration chip to a deep learning framework.

Background

The deep learning framework is an operating system in the field of artificial intelligence, and helps a user conveniently realize various deep learning algorithms through five core components such as tensors, tensor-based operations (Op), computation graphs, automatic differentiation tools and hardware expansion packages (such as cublas and cudnn), so that computing resources of bottom hardware are fully released.

AI acceleration chips are also known as AI accelerators or computing cards, i.e., hardware dedicated to handling the vast amount of computing tasks in artificial intelligence applications. Different from the traditional chip, the AI chip has larger scale, more complex structure and stronger computing capability, and provides powerful support for computing power.

The varieties of the current AI accelerating chips are increasing day by day, and the chips are in a state of all flowers. The deep learning frame bottom layer is compatible with more types of AI accelerators, so that the compatibility of the frame can be improved, the most appropriate hardware is selected according to different application scenes, and the computational resources of the hardware are fully released. However, since the hardware structure of each AI acceleration chip is different and the operation mode is different, the whole process is performed from the beginning to the end and a great deal of repetitive work is performed when each hardware is supported in the deep learning framework.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a method and a device for adapting an AI acceleration chip by a deep learning framework, which simplify the work of the AI acceleration chip by the deep learning framework through three main steps of chip type definition, chip type registration and chip memory support, and the specific technical scheme is as follows:

a method for adapting an AI acceleration chip by a deep learning framework mainly comprises three stages:

a chip type definition phase, which defines the AI acceleration chip types to be supported in a customized or written file based on a certain data transmission format, such as Protobuf, wherein the AI acceleration chip types comprise enumeration types, the method is used for distinguishing different types of chips in the deep learning framework, so that the deep learning framework carries out corresponding processing according to different enumeration type values, the context manager, the device thread, the stream index generator, the computing core Kernel and other infrastructures in the deep learning framework are strongly bound with the chip types, because different chips operate differently, the above infrastructure implementations are also different, taking the implementation of the compute Kernel as an example, an OpenBLAS library may be used on the CPU, a cuBLAS library may be used on the GPU, the cnrt library and the cnnl library are used in the Membranan MLU, and through a chip type registration stage, a deep learning framework can automatically select a corresponding mode according to the chip type to realize the calculation of Kernel;

in the chip type registration stage, chip types, context managers, device threads, flow index generators and computing core Kernels related to the AI accelerating chip are registered in respective hash tables, and a registration mechanism based on a singleton mode is adopted to enable the chip types to be mapped to the context managers, the device threads, the flow index generators and the computing core Kernels one by one, so that a frame can conveniently find corresponding contents according to the chip types when needed, wherein the corresponding contents comprise the context managers, the device threads, the flow index generators and the computing core Kernels corresponding to the corresponding chip types;

in the chip memory support stage, operations related to the AI accelerating chip memory are put into a deep learning frame, so that the frame can uniformly manage the memory space of the chip.

Preferably, in the chip type definition stage, the type of the AI acceleration chip is added to a data structure related to the chip type definition.

Preferably, the key value of the hash table is a chip type to be registered, and the value is a processing function corresponding to the chip, so as to complete operations of creating various handles, calculating on a chip, and managing a memory;

the device context manager is a method capable of generating a chip operation handle, and the device context manager is registered in a hash table in a registration process and corresponds to chip types one by one, specifically: creating a context handle for generating various handles in the chip computing process, and then creating a device context manager for providing various handles to an external caller by calling the context handle and performing device synchronization operation, wherein the handle comprises: a flow handle and a chip operation handle;

the stream index generator is used for generating corresponding stream index numbers for different operations, the type registration is also used for creating a hash table, the key value is a chip type, and the value is a corresponding stream index generator.

Preferably, the device thread registration is performed, the process creates a device thread related to the chip type, and is used for creating a thread for starting the on-chip computing process, and after the device thread creation is completed, the device thread is registered in the hash table of the device thread to complete the one-to-one correspondence between the threads and the chip types.

Preferably, the Kernel registration is to first implement a computation logic inside the Kernel, then use a bituple formed by two items of a chip type and a data type as a key value, and register the Kernel as a value in a hash table related to the Kernel.

Preferably, in the chip memory support stage, based on different types of data transmission formats, the memory type of the AI acceleration chip to be supported is defined, and the on-chip memory is allocated, released, spatially partitioned, and data copied.

Preferably, the memory type of the AI accelerator chip that needs to be supported is defined to distinguish the memory types of different chips inside the frame, and the memory type of the AI accelerator chip is added to a data structure related to the memory type.

Preferably, the unified management of the storage space of the chip includes memory allocation, where the memory allocation includes: uniformly allocating a first storage space to the storage space of the chip, dividing the first storage space into sections with different space sizes according to different modules such as convolution, pooling and the like and corresponding to the required space sizes respectively, wherein the dividing method adopts a mode of adding an offset to an initial address or a mode of calling a specific API (application program interface) provided by the chip; the memory release is used for unified memory release.

Preferably, the unified management of the storage space of the chip includes data copying of an on-chip internal memory, and the AI acceleration chip copies data with the host before starting the calculation and after finishing the calculation, specifically: copying data from a host memory to an on-chip memory during the process of using the AI accelerating chip, copying the result back to the host memory after completing the calculation, copying the data between the host and the chip and between the chip and the chip during the calculation, and processing the data according to different conditions of the source and the destination of the data copy during the memory copy

The device for the deep learning framework to adapt the AI acceleration chip comprises one or more processors and is used for realizing the method for the deep learning framework to adapt the AI acceleration chip.

The invention has the beneficial effects that:

the invention simplifies the work of the deep learning framework adaptive AI acceleration chip.

Drawings

FIG. 1 is a schematic diagram illustrating an overall process of a deep learning framework adaptive AI acceleration chip according to the present invention;

FIG. 2 is a diagram illustrating deep learning framework adapted AI acceleration chip device manager and device thread registration in accordance with the present invention;

FIG. 3 is a schematic diagram of the infrastructure to be registered in the deep learning framework adapted AI acceleration chip registration step according to the present invention;

FIG. 4 is a flow chart illustrating the deep learning framework adapted AI to accelerate the allocation and release of space within the chip in accordance with the present invention;

fig. 5 is a block diagram of an apparatus for adapting an AI acceleration chip to a deep learning framework according to the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, the method for adapting the AI acceleration chip by the deep learning framework of the present invention mainly includes three stages: chip type definition, chip type registration and chip memory support. The chip type definition is that the type of a chip needing to be supported is written into a related data structure of a proto file in an enumeration value form, so that the chip type can be correctly identified in a framework, the Protobuf is used as a serialization tool to explain the whole process, the tools of the same type also include JSON, Hessian and the like, but the current mainstream deep learning framework is the Protobuf used; (ii) a The chip type registration is to register the infrastructure required by the chip into the hash table, so that the frame can conveniently find corresponding contents according to the chip type when required; the chip memory support is to put the relevant operations of the chip memory into a frame, so that the frame can carry out unified management on the memory space of the chip.

The chip type definition phase is based on the fact that the chip type definition phase is based on a certain data transmission format, such as Protobuf, defining, in a custom or written file, AI acceleration chip types to be supported, the AI acceleration chip types including enumerated types, the method is used for distinguishing different types of chips in the deep learning framework, so that the deep learning framework carries out corresponding processing according to different enumeration type values, the context manager, the device thread, the stream index generator, the computing core Kernel and other infrastructures in the deep learning framework are strongly bound with the chip types, because different chips operate differently, the above infrastructure implementations are also different, taking the implementation of the compute Kernel as an example, an OpenBLAS library may be used on the CPU, a cuBLAS library may be used on the GPU, the cnrt library and the cnnl library are used in the Membranan MLU, and through a chip type registration stage, the deep learning framework can automatically select a corresponding mode according to the chip type to realize the Kernel calculation.

As shown in fig. 3, the chip registration includes registration of several infrastructures, such as a chip type, a context manager, a device thread, a stream index generator, and a compute core Kernel, and the device context manager, the device thread, the stream index generator, and the compute core Kernel, which correspond to the chip type and the chip, are registered in respective hash tables by using a registration mechanism based on a singleton pattern, so that the chip type and the device context manager, the device thread, the stream index generator, and the compute core Kernel, which correspond to the chip, are mapped one by one. The key value of the hash table is a chip type to be registered, and is an enumeration type essentially, the value is a processing function corresponding to the chip, and completes various operations such as creation of handles, on-chip computation, memory management and the like, the operation of the AI acceleration chip often needs to use a handle, the handle is essentially a pointer, and points to resources needed for completing the operation, for example, a multi-stream on a GPU uses a stream handle cudasstream _ t, a cublas library uses a cubesandle _ t, and a cudnn library uses a cudnandle _ t; the topscoxt _ t handle is used on the original DTU chip. The value of the hash table is a class type and comprises various member functions, wherein the creation of a handle is realized by the member functions; after the registration is finished, the hash table can be inquired by using the chip type, so that the corresponding processing function can be conveniently found. Specifically, the registration process herein may use four hash tables, key values of the four hash tables are all chip types, such as CPU, GPU, MLU, and the like, value values of the hash tables are the above several infrastructures, value of the first table is a context manager, value of the second table is a device thread, value of the third table is a stream index generator, value of the fourth table is a compute Kernel, as shown in tables 1 to 4 below;

Key	Value
		CPU	context manager for CPU
GPU	Context manager for GPU
		MLU	Context manager for MLU
…	…

Table 1 context manager registry

Key	Value
		CPU	Device thread for CPU
GPU	Device thread for GPU
		MLU	Device threads for MLUs
…	…

Table 2 device thread registry

Key	Value
		CPU	Stream index generator for CPU
GPU	Stream index generator of GPU
		MLU	Stream index generator of MLU
…	…

Table 3 stream index generator registry

Key	Value
		CPU	Kernel of computation core on CPU
GPU	Compute Kernel on GPU
		MLU	Compute Kernel on MLU
…	…

Table 4 compute Kernel registry

As shown in fig. 2, the registration process of two main infrastructures, a device context manager and a device thread, is shown, and a method for generating a chip operation handle is provided in the device context manager. The registration process is to put the chip type as a key value and the device context manager as a pair of key-value pairs of the value into a hash table, and conveniently query the corresponding method through the chip type and the hash table when in later use, thereby obtaining the corresponding handle. The device thread includes a method for creating a thread associated with the chip type, and the registration process is identical to the registration of the device context manager, except that it requires another hash table.

In the chip memory support step, unified management, namely unified allocation and release, is performed on the on-chip memory in the training process, a large storage space which is uniformly allocated is cut into small sections for different purposes, and the AI accelerating chip needs to copy data with the host before starting calculation and after finishing calculation, so that methods such as memory space allocation, release, segmentation, data copy and the like for the current chip need to be realized.

As shown in fig. 4, which is a flowchart of on-chip memory allocation, space partitioning, and release, the storage space is uniformly allocated and managed in the program running process. Specifically, the program first determines whether the required space is on the chip, if not, the program does not perform processing, if the required space is on the chip, the required space is allocated according to a pre-calculated result, which is usually a large space, and then the space is divided according to the requirements of different parts, and the dividing manner includes two manners: 1) the base address plus offset; 2) the specific API of the chip is called, and the specific splitting mode needs to be determined according to the specific situation of the chip. After the memory space is well allocated, subsequent training tasks can be carried out, after the training is finished, the memory space needs to be released before the program exits, the program firstly judges whether the space needing to be released is on the chip, if so, the space is released, otherwise, the program directly exits.

Corresponding to the embodiment of the method for adapting the AI acceleration chip by the deep learning framework, the invention also provides an embodiment of a device for adapting the AI acceleration chip by the deep learning framework.

Referring to fig. 5, an apparatus for adapting an AI acceleration chip by a deep learning framework according to an embodiment of the present invention includes one or more processors, and is configured to implement the method for adapting an AI acceleration chip by a deep learning framework in the foregoing embodiment.

The embodiment of the apparatus for adapting the AI acceleration chip in the deep learning framework of the present invention can be applied to any device with data processing capability, such as a computer or other devices or apparatuses. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of any device with data processing capability where the apparatus for adapting an AI acceleration chip to a deep learning frame of the present invention is located is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, in the embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for adapting the deep learning framework to the AI acceleration chip in the foregoing embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The basic principles of the present disclosure have been described in connection with specific embodiments, but it should be noted that it will be understood by those skilled in the art that all or any of the steps or components of the method and apparatus of the present disclosure may be implemented in any computing device, including processors, storage media, etc., or network of computing devices, in hardware, firmware, software, or a combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present disclosure.

Thus, the objects of the present disclosure may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. Thus, the object of the present disclosure can also be achieved merely by providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.

It is also noted that in the apparatus and methods of the present disclosure, it is apparent that individual components or steps may be disassembled and/or re-assembled. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for adapting an AI acceleration chip by a deep learning framework is characterized by comprising three stages:

a chip type definition stage, defining an AI acceleration chip type to be supported by a self-defined or written file based on different types of data transmission formats, wherein the AI acceleration chip type comprises enumeration types and is used for distinguishing different types of chips in a deep learning frame, so that the deep learning frame performs corresponding processing according to different enumeration type values;

in the chip type registration stage, chip types, device context managers, device threads, flow index generators and computing core Kernels related to the AI accelerating chip are registered in respective hash tables, and a registration mechanism based on a singleton mode is adopted to enable the chip types to be mapped to the context managers, the device threads, the flow index generators and the computing core Kernels one by one;

2. The method for adapting the AI acceleration chip through the deep learning framework of claim 1, wherein the chip type definition stage adds the type of the AI acceleration chip to a data structure associated with the chip type definition.

3. The method for adapting to the AI acceleration chip through the deep learning framework as claimed in claim 1, wherein the key value of the hash table is a chip type to be registered, and the value is a processing function corresponding to the chip, so as to complete operations of creating various handles, calculating on chip and managing memory;

the device context manager is a method capable of generating a chip operation handle, and in the registration process, the device context manager is registered in a hash table to be registered in the device context manager and corresponds to chip types one by one, specifically: creating a context handle, and creating a device context manager by calling the context handle, wherein the device context manager is used for providing various handles to an external caller and performing device synchronization operation, and the handle comprises: a flow handle and a chip operation handle;

4. The method for adapting the AI acceleration chip of the deep learning framework of claim 1, characterized in that the device thread is registered, the process creates a device thread related to a chip type for creating a thread for starting an on-chip computation process, and the device thread is registered in its hash table after creation to complete one-to-one correspondence between the thread and the chip type.

5. The method for adapting to the AI acceleration chip through the deep learning framework as claimed in claim 1, wherein the computing Kernel registration is to first implement a computing logic inside the computing Kernel, then use a binary group composed of a chip type and a data type as a key value, and register the computing Kernel as a value in a hash table related to the computing Kernel.

6. The method for adapting to the AI acceleration chip through the deep learning framework as claimed in claim 1, wherein in the chip memory support phase, memory types of the AI acceleration chip to be supported are defined based on different types of data transmission formats, and the operations of allocating, releasing, space partitioning, and data copying are performed on the on-chip memory.

7. The method as claimed in claim 6, wherein the memory type of the AI accelerator chip to be supported is defined to distinguish the memory types of different chips within the frame, and the memory type of the AI accelerator chip is added to the data structure related to the memory type.

8. The method for adapting the AI acceleration chip through the deep learning framework of claim 6, wherein the unified management of the memory space of the chip comprises memory allocation, and the memory allocation comprises: uniformly distributing a first storage space in the storage space of the chip, dividing the first storage space into sections with different space sizes according to the required space sizes respectively corresponding to different modules, wherein the dividing method adopts a mode of adding an offset into an initial address or a mode of calling a specific API (application program interface) provided by the chip; the memory release is used for unified memory release.

9. The method according to claim 6, wherein the unified management of the memory space of the chip includes data copying for an on-chip memory, and the data copying is performed between the AI acceleration chip and the host before the start of the computation and after the completion of the computation, specifically: copying data from a host memory to an on-chip memory in the process of using the AI accelerating chip, copying a result back to the host memory after completing calculation, copying data between the host and the chip and between the chip and the chip in the calculation process, and processing according to different conditions of a source and a destination of data copying in memory copying.

10. An apparatus for a deep learning framework adapted AI acceleration chip, comprising one or more processors configured to implement the method for a deep learning framework adapted AI acceleration chip of any of claims 1-9.