WO2020124948A1 - 网络离线模型的处理方法、人工智能处理装置及相关产品 - Google Patents
网络离线模型的处理方法、人工智能处理装置及相关产品 Download PDFInfo
- Publication number
- WO2020124948A1 WO2020124948A1 PCT/CN2019/087631 CN2019087631W WO2020124948A1 WO 2020124948 A1 WO2020124948 A1 WO 2020124948A1 CN 2019087631 W CN2019087631 W CN 2019087631W WO 2020124948 A1 WO2020124948 A1 WO 2020124948A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- sub
- quantization
- parameter
- layer
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Definitions
- the application name is "a neural network quantization method, device and related products"
- This application relates to the field of information processing technology, and in particular to a processing method of a network offline model, an artificial intelligence processing device, and related products.
- the terminal obtains and processes information based on the processor.
- this way of processing information based on the processor running a software program is limited by the type of network model, that is, for some new network models, the processor is not compatible with the network type version.
- the offline model of the network running on the processor is built under the framework of the machine. When constructing the network model, no distinction is made between the various layers of the network, resulting in a single processor that is not compatible with various offline models of the network.
- An embodiment of the present application provides an offline model processing method.
- the type identifier of the offline network is saved, in order to perform all types of offline networks compatible with the type identifier.
- an embodiment of the present application provides a method for processing a network offline model, the method includes:
- the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type;
- the sub-network operation parameters are defined in the constructed network offline model to obtain the constructed network offline model, and the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
- an offline model artificial intelligence device includes:
- An obtaining module used to obtain the operating unit information of each sub-network in the offline model of the network, the operating unit information includes the correspondence between the sub-network and the operating unit type, and the operating unit type includes a general processing unit type or artificial intelligence processing Unit type
- a building module used to define sub-network operating parameters in the constructed network offline model according to the operating unit information to obtain a constructed network offline model, and the sub-network operating parameters are used to represent operating units of each sub-network Types of.
- an embodiment of the present application provides a computer device, including a memory and a processor, a computer program that can be run on the processor is stored on the memory, and the processor is implemented as the first when executing the computer program Aspect of the method.
- an embodiment of the present application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method according to the first aspect is implemented.
- an embodiment of the present application provides a combined processing device, wherein the combined processing device includes the artificial intelligence processing device described in the second aspect, a universal interconnection interface, and other processing devices;
- the artificial intelligence processing device interacts with the other processing devices to jointly complete the calculation operation specified by the user.
- an embodiment of the present application provides a parameter processing method, which is applied to an artificial intelligence chip in which an upper-layer language interface and a deep learning framework are deployed.
- the deep learning framework includes a container, and the container To interface with the upper-layer language, the method includes:
- the upper-layer language interface injects a first parameter into the container, where the first parameter is used to describe the degree of parallelism of the deep learning framework;
- the deep learning framework obtains the first parameter from the container, and interacts with the module data of the deep learning framework to obtain a second parameter, and passes the second parameter to In the container, the second parameter is used to monitor the parallel computing performance of the deep learning framework described by the first parameter, and the container is a class or structure for storing parameters;
- the upper-layer language interface obtains the second parameter from the container.
- the method further includes: the container includes a parameter data field, and the parameter data field is used to point to the first parameter and the second parameter.
- the first parameter includes data parallelism and model parallelism.
- the second parameter includes the channel disappearing time and the sum of the channel disappearing time.
- the interacting the first parameter with the module data of the deep learning framework to obtain the second parameter includes:
- CETS channel disappearance time
- CETS total channel disappearance time
- the model parallelism is transferred to the module of the deep learning framework for data interaction, and the CET and CETS corresponding to the data parallelism are obtained.
- the deep learning framework is MXNet deep learning framework.
- the deep learning framework further includes a carrier
- the method further includes:
- Parameter transfer interaction between the container and the module of the deep learning framework is performed through the carrier, and the parameter includes a first parameter and a second parameter.
- the artificial intelligence chip further includes an underlying library module, and the method further includes:
- Parameter transfer interaction between the container and the underlying library module is performed through the carrier, and the parameter includes a first parameter and a second parameter.
- the container includes a native class or structure in the deep learning framework, or a class or structure independently created in the deep learning framework for the artificial intelligence chip.
- an embodiment of the present application provides a parameter processing device, which is applied to an artificial intelligence chip, in which an upper-layer language interface and a deep learning framework are deployed, the deep learning framework includes a container, and the container Interfaced with the upper-layer language interface, the device includes:
- a writing module configured to write a first parameter into the container through the upper-layer language interface, wherein the first parameter is used to describe the degree of parallelism of the deep learning framework
- the calculation module is used to obtain the first parameter from the container through the deep learning framework, and interact the first parameter with the data of the module of the deep learning framework to obtain the second parameter, and
- the second parameter is transferred to the container, the second parameter is used to monitor the performance of parallel operations, and the container is a class or structure used to store parameters;
- An obtaining module configured to obtain a second parameter from the container through the upper-layer language interface.
- an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs.
- the one or more programs are stored in the memory and are configured to The processor executes, and the program includes instructions for performing the steps in the method of the sixth aspect.
- an embodiment of the present application provides a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes a computer to perform the method described in the sixth aspect.
- an embodiment of the present application provides a chip, including the parameter processing apparatus provided in the seventh aspect.
- an embodiment of the present application provides a chip packaging structure including the chip described in the tenth aspect above;
- an embodiment of the present application provides a board card including the chip packaging structure described in the eleventh aspect.
- an embodiment of the present application provides an electronic device including the chip packaging structure described in the eleventh aspect or the board card described in the twelfth aspect.
- an embodiment of the present application provides a storage medium for storing a computer program for electronic data exchange, wherein the computer program causes the computer to execute the instructions of the steps described in any method of the sixth aspect.
- an embodiment of the present application provides a neural network quantization method, including:
- the target quantization layer is at least one of the calculation layers of the original neural network
- the weights of the target quantization layer of the original neural network to determine the quantization parameters of the weights of the corresponding layer; using the input data of the target quantization layer of the original neural network to determine the quantization parameters of the input data of the corresponding layer; wherein, the The weight and input data of the target quantization layer are based on the principle of maximum absolute value without distortion;
- the target quantization layer of the original neural network is quantized according to the quantization parameter of the weight value and the quantization parameter of the input data.
- the calculation layer includes at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
- the step of determining the quantization parameter of the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network includes:
- the first quantization parameter and the second quantization parameter of the weight of the corresponding layer are determined according to the maximum value of the absolute value of the weight of each layer in the target quantization layer.
- the step of using the input data of the target quantization layer of the original neural network to determine the quantization parameter of the input data of the corresponding layer includes:
- the first quantization parameter and the second quantization parameter of the input data of the corresponding layer are determined according to the maximum value of the absolute value of the input data of each layer in the target quantization layer.
- the method further includes:
- the first quantization method includes:
- the second quantization method includes:
- the third quantization method includes:
- the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
- the method further includes:
- the target quantization layer includes a convolutional layer And/or fully connected layer
- the input data quantization intermediate parameter of each layer in the target quantization layer is used to obtain the input data quantization result of the corresponding layer.
- the method further includes:
- Each of the target quantization layers of the original neural network is processed using a first quantization method, a second quantization method, or a third quantization method; wherein, the target quantization layer further includes a calculation layer of the original neural network At least one layer other than the convolutional layer and/or the fully connected layer;
- the first quantization method includes:
- the second quantization method includes:
- the third quantization method includes:
- the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
- an embodiment of the present application provides a neural network quantization device.
- the device includes:
- a data reading unit used to obtain the weight and input data of the target quantization layer of the original neural network; wherein, the target quantization layer is at least one of the calculation layers of the original neural network;
- a quantization parameter determination unit for determining the quantization parameter of the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network; determining the input data of the corresponding layer by using the input data of the target quantization layer of the original neural network Quantization parameters; wherein, the weights and input data of the target quantization layer adopt the principle of maximum absolute value without distortion;
- the quantization unit is configured to quantize the target quantization layer of the original neural network according to the quantization parameter of the weight value and the quantization parameter of the input data.
- the calculation layer includes at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
- the quantization parameter determination unit is specifically configured to obtain the maximum value of the absolute value of the weight of each layer in the target quantization layer; according to the weight of each layer in the target quantization layer The maximum value of the absolute value of determines the first quantization parameter and the second quantization parameter of the corresponding layer weight.
- the quantization parameter determination unit is specifically configured to obtain the maximum value of the absolute value of the input data of each layer in the target quantization layer; according to the input data of each layer in the target quantization layer The maximum value of the absolute value of determines the first and second quantization parameters of the input data of the corresponding layer.
- the device further includes:
- a processing unit configured to process the first quantization method, the second quantization method, or the third quantization method on each of the target quantization layers of the original neural network; wherein,
- the first quantization method includes:
- the second quantization method includes:
- the third quantization method includes:
- the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
- the device further includes:
- a processing unit configured to obtain a weight quantization intermediate parameter of a corresponding channel by using the first weight quantization parameter and the second weight quantization parameter of each channel of each layer in the target quantization layer; wherein, the target quantization Layers include convolutional layers and/or fully connected layers;
- the input data quantization intermediate parameter of each layer in the target quantization layer is used to obtain the input data quantization result of the corresponding layer.
- the processing unit is further configured to process the first quantization method, the second quantization method, or the third quantization method for each of the target quantization layers of the original neural network; wherein, the The target quantization layer also includes at least one layer other than the convolution layer and/or the fully connected layer in the calculation layer of the original neural network;
- the first quantization method includes:
- the second quantization method includes:
- the third quantization method includes:
- the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
- an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
- the processor implements the computer program to implement the tenth The method described in five aspects.
- an embodiment of the present application provides a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes a computer to perform the method according to the fifteenth aspect.
- an embodiment of the present application provides a computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer Perform the method described in the fifteenth aspect.
- the operation unit information of the network offline model is obtained, and when constructing the network offline model, the operation parameters of each sub-network are defined, and each operation mark is marked in the operation parameters
- the type of operation unit of the sub-network so as to classify the sub-network of the offline model of the network, so that when running the offline model of the network, each sub-network is assigned to the corresponding processor to run, so as to achieve compatible operation of the offline model of the network, enriching the labor
- an upper-layer language interface and a deep learning framework are deployed in the artificial intelligence chip.
- the deep-learning framework includes a container, and the container is connected to the upper-layer language interface.
- the upper-layer language interface will The first parameter is written into the container, and then the deep learning framework obtains the first parameter from the container, combines the first parameter and the module parameter of the deep learning framework to obtain the second parameter, and passes the second parameter to the container, and finally the upper language interface Obtain the second parameter from the container and provide it to the user.
- the first parameter is used to describe the degree of parallelism of the deep learning framework and the second parameter is used to monitor the performance of parallel operations, this process improves the effect of parallel operations in the deep learning framework by writing the first parameter to the container. Statistics and obtaining the second parameter improve the monitorability of parallel computing performance.
- the target quantization layer of the original neural network is quantized to obtain the quantization parameter of the weight and the quantization parameter of the input data, and then the target quantization layer is completed according to the quantization parameter Quantify.
- the target quantization layer thus quantized performs operations, since the input data and the weights are quantized data, it reduces the storage space of the weights and the storage space of the input data, and reduces the amount of calculation of bits It is also correspondingly reduced, so it has the advantages of reducing the amount of calculation, increasing the calculation speed, saving storage space, reducing power consumption, and saving costs.
- FIG. 1 is a method for processing a network offline model provided by an embodiment of this application
- FIG. 2 is another method for processing a network offline model provided by an embodiment of the present application
- FIG. 3 is a schematic structural diagram of an artificial intelligence device of a network offline model provided by an embodiment of the present application
- FIG. 4 is a block diagram of functional units of an artificial intelligence device of a network offline model provided by an embodiment of the present application
- 5A is an artificial intelligence chip provided by an embodiment of the present application.
- 5B is a schematic flowchart of a parameter processing method disclosed in an application example
- FIG. 6 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
- FIG. 7 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
- 9A is a schematic diagram of a combined processing device provided by an embodiment of the present application.
- FIG. 9B is a structural diagram of another combined processing device provided by an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of a board provided by an embodiment of the present application.
- FIG. 11 is a schematic structural diagram of a neural network architecture
- FIG. 12 is a schematic flowchart of a neural network quantization method provided by an embodiment of the present application.
- 13A is a schematic diagram of the weight structure of the convolutional layer provided by this application.
- 13B is a schematic diagram of the data structure of one channel of the weights of the convolutional layer provided by this application;
- FIG. 14 is a schematic flowchart of a quantization operation device provided by an embodiment of the present application.
- 15 is a structural diagram of an electronic device provided by an embodiment of the present application.
- the artificial intelligence processing device in this application may include a smart phone (such as an Android phone, an iOS phone, a Windows phone, etc.), a tablet computer, a palmtop computer, a laptop computer, a mobile Internet device MID (Mobile Internet Devices), or wearable
- a smart phone such as an Android phone, an iOS phone, a Windows phone, etc.
- a tablet computer such as an Samsung Galaxy Tabs, etc.
- a palmtop computer such as a Samsung Galaxy Tabs, etc.
- laptop computer such as a tablet computer, a palmtop computer, a laptop computer, a mobile Internet device MID (Mobile Internet Devices), or wearable
- MID Mobile Internet Devices
- FIG. 1 is a schematic flowchart of a method for processing a network offline model provided by an embodiment of the present application.
- the method is applied to a network offline model.
- the network offline model includes a general-purpose processor and an artificial intelligence processor.
- the method includes the content shown in steps S101 to S102:
- Step S101 Obtain the operation unit information of each sub-network in the offline model of the network, where the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type .
- the operation unit information further includes entry function information of the sub-network, and the entry function information is used when the artificial intelligence processing unit runs the sub-network
- the offline function corresponding to the sub-network is retrieved through the entry function, and the offline instructions of some sub-networks are pre-compiled to speed up the operation speed of the network offline model.
- the general-purpose processor may include a central processing unit CPU (Central Processing Unit, CPU for short), a graphics processing unit GPU (Graphics Processing Unit, GPU for short), and/or an image processing unit IPU (Image Processing Unit, IPU for short) )
- the artificial intelligence processor includes a machine learning processor unit MLU (Machine Learning Processing Unit, referred to as: MLU), wherein the artificial intelligence processor can be integrated by multiple MLUs, composed of a Multi-core artificial intelligence processor.
- MLU Machine Learning Processing Unit
- the offline model of the network before obtaining the operating unit information of each sub-network in the offline model of the network, first determine whether multiple network layers of the offline model of the network can be merged, if so, merge the multiple network layers that can be merged into a sub-network, Taking the network layer that cannot be fused as a single sub-network, after performing the fusion operation on the offline model of the network, several sub-networks corresponding to the offline model of the network are obtained.
- each sub-network can be a separate network layer, or a sub-network can be obtained by fusing several network layers, for example, when the network offline model includes the convolution layer Convolution, the normalization layer BatchNorm, and the scaling layer Scale ,
- the convolution layer Convolution in the offline model of the network, the normalization layer BatchNorm and the scaling layer Scale can be fused to obtain a sub-network.
- the operating unit information of each sub-network in the offline model of the network is obtained to determine the operating unit type of each sub-network, so that when constructing the offline model of the network, the The field corresponding to the operation unit type defines the operation unit type of each subnet.
- Step S102 Define sub-network operation parameters in the constructed network offline model according to the operation unit information to obtain a constructed network offline model, where the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
- the artificial intelligence device uses a pre-installed machine learning framework to build a network offline model.
- the following takes convolutional neural network framework caffe (Convolutional Architecture for Fast Feature Embedding, abbreviated as: caffe) as an example to specifically describe the construction of a network offline model. .
- caffe Convolutional Architecture for Fast Feature Embedding
- generating offline files requires two inputs, one is a prototxt file containing network information, and the other is a caffemodel file containing the trained weights and offsets.
- caffe first calls the underlying library interface to create an offline file, and then caffe divides the entire network of prototxt input into several subnetworks according to whether each layer can run on the artificial intelligence processor, and then several subnetworks Can be executed on an artificial intelligence processor.
- the caffe framework will call the underlying library interface to compile the subnet into offline instructions that can run on the artificial intelligence processor. Then the caffe framework saves the generated offline instructions to the pre-generated offline file by calling the interface provided by the underlying library.
- caffe will first take the trained caffemodel.
- the weight and offset data are read out and stored in the corresponding blob, where blob is a data structure defined in caffe, which is used to transfer data between layers.
- blob is a data structure defined in caffe, which is used to transfer data between layers.
- These weights and offset data will be passed to the underlying library when caffe calls the underlying library to generate offline instructions, and then caffe calls the underlying interface of the underlying library to store the offline instructions, weights and bias data together in the offline file.
- caffe calls the underlying library to compile the subnet to generate offline instructions
- caffe can specify that it can run on several cores when running the subnet, which is called the specified model parallelism, and the subnet can be used as a model.
- custom unit information is also stored, and each subnet corresponds to a unit information.
- the unit information can be generated through the protobuf mechanism, and caffe can append the unit information to the back of the offline file by calling the relevant interface provided by protobuf, and the information is used later when running the offline file.
- a unit information of .SegmentInfoUnit may be pre-defined in a format, which is used to save sub-network operating parameters of each sub-network.
- the sub-network operation parameters include sub-network name, operation unit type and sub-network parameter information
- the sub-network parameter information may be used to indicate resource scheduling of the processor when executing the sub-network.
- the sub-network parameter information may include convolution kernel information, etc., and may be used to represent resource information of an artificial intelligence processing unit that needs to be deployed to operate the sub-network.
- the unit information can also store the index identifier of the offline instruction corresponding to each sub-network and the index identifier of the calculation parameter, which is convenient for reading the offline instruction and the calculation parameter corresponding to each sub-network from the offline file, Then, the unit information is appended to the offline file caffemodel, so as to read the sub-network operating parameters of each sub-network and the offline command corresponding to the sub-network from the offline file through the underlying interface of caffe based on the index identification. Calculation parameters.
- the calculation parameter is parameter data related to each sub-network operation, for example, when the sub-network is a convolutional layer, the calculation parameter is a weight and an offset. If the convolutional layer is not offset, the offset It is zero, as another example, if the sub-network is the active layer, the calculation parameter is the activation function.
- storing the sub-network operating parameters of each sub-network in a data structure corresponding to each sub-network may be: based on the Protocol Buffers mechanism, obtaining a preset BP Message, through the compiler in the Protocol Buffers mechanism Compile the fields in the layer of each subnet that match the BP Message into a binary file, and save the binary file in a data structure in the format of .SegmentInfoUnit.
- the Protocol Buffers mechanism is only an exemplary illustration, and this application does not limit the network information for storing the sub-network operating parameters.
- the operating unit type provides a new method for saving the offline model of the network; moreover, based on the saved operating unit type of each sub-network, different operating units can be used to run different network layers.
- the operation of the network offline model can be made more flexible and more compatible and applied to various artificial intelligence devices.
- FIG. 2 is a schematic flowchart of another method for processing a network offline model provided by an embodiment of the present application.
- the method is applied to an artificial intelligence device.
- the artificial intelligence device may include a general-purpose processor and an artificial intelligence processor.
- the method includes the content shown in steps S201-S205:
- Step S201 Obtain the operation unit information of each sub-network in the offline model of the network, where the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type .
- Step S202 Define the sub-network operation parameters in the constructed network offline model according to the operation unit information to obtain the constructed network offline model.
- the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
- Step S203 Determine the operation unit corresponding to the target subnetwork according to the subnetwork operation parameters, and the target subnetwork is any subnetwork of the network offline model.
- Step S204 Run the target sub-network in an operation unit corresponding to the target sub-network to implement running the network offline model.
- the implementation process of running the target sub-network on the corresponding operating unit may be: sequentially traversing the data structure through the interface of the machine learning framework to read the network operating parameters of the network offline model, and determining according to the network operating parameters
- the operation unit that executes the target subnet, and the operation unit of the previous subnet and the next subnet connected to the target subnet, that is, completes the forward inference process, and instructs the operation unit of the target subnet to start from the previous subnet Obtain the input data from the operation unit of the server, and send the output result of the target subnetwork as the input data to the operation unit of the next subnetwork.
- the operation unit type in the network operation parameters of the target subnetwork is artificial intelligence Processing unit type
- the operating unit type of the last sub-network is a general processing unit type
- the operating unit type of the next sub-network is a general processing unit type, instructing the artificial intelligence processing unit to obtain data from the general processing unit, and the acquired data
- the output results are sent to the general processing unit to complete the forward inference process of the offline model of the network, in accordance with the running order of the offline model of the network.
- a general processing unit and an artificial intelligence processing unit are provided in the artificial intelligence processing device, and the operating unit of each sub-network is determined based on the operating parameters of each sub-network, and then the corresponding operation
- the unit runs the sub-network, so that when the artificial intelligence processing unit does not support the operation of the sub-network, the general-purpose processing unit runs the operation of the sub-network, that is, the general-purpose processing unit and the artificial intelligence processing unit work together to be compatible
- All types of network offline models increase the scope of application of network offline models, and the general processing unit and artificial intelligence processing unit work together to put the network layer that can run on the artificial intelligence processing unit on the artificial intelligence processing unit.
- Putting the entire network offline model in the general processing unit for execution speeds up the inference process of the entire offline network, and generates offline instructions in advance for the network layer that can run on the artificial intelligence processing unit, saving the generation of offline while executing The time consumed by the instruction; in addition, the general processing unit can perform part or all of the operation of the network offline model, reducing the work pressure of the artificial intelligence processing unit.
- the implementation process of determining the operation unit corresponding to the target sub-network according to the operation parameters of the sub-network may be: acquiring the model parallelism of the offline model of the network; according to the scheduling mechanism of the artificial intelligence processing unit, The model parallelism and the sub-network operating parameters determine the artificial intelligence processing unit corresponding to the target sub-network.
- the offline instruction corresponding to the target subnetwork is read from the offline file of the offline model of the network, and the offline instruction is parsed to obtain the The parallelism of the model contained in the offline instruction.
- the number of artificial intelligence processing units required to run the target subnet is obtained.
- the scheduling mechanism of the artificial intelligence processing unit is obtained. Deploy multiple artificial intelligence processing units corresponding to this number, designate multiple artificial intelligence processing units corresponding to this number as the artificial intelligence processing unit running the target subnetwork, and assign offline instructions and calculations corresponding to this subnetwork
- the parameters are distributed to the multiple artificial intelligence processing units to complete the operation of the target sub-network.
- the model parallelism of each sub-network can be set in advance, that is, the number of artificial intelligence processing units required to run the sub-network can be specified, so that on the artificial intelligence processor, the multi-core artificial intelligence processing unit can jointly execute with The operation corresponding to the sub-network improves the running speed of the sub-network.
- each target sub-network may be: acquiring the interface instruction when calling the underlying library; parsing the interface instruction to obtain the channel identifier included in the interface instruction; determining the manual according to the channel identifier A channel for transmitting data by an intelligent processing unit; running the target sub-network on the artificial intelligence processing unit through the channel to run the network offline model.
- each target artificial intelligence processing unit contains multiple data transmission channels.
- the corresponding channel is designated by the interface instruction to transmit offline instructions and calculation parameters to the target artificial intelligence processing unit, thereby speeding up the artificial intelligence
- the read and write speed of the processing unit accelerates the reasoning process of the offline model of the network.
- FIG. 3 is a schematic structural diagram of an artificial intelligence device of a network offline model provided by an embodiment of the present application.
- the artificial intelligence device 300 includes a general-purpose processor and an artificial intelligence processor, memory, and communication An interface and one or more programs, wherein the one or more programs are different from the one or more application programs, and the one or more programs are stored in the memory and configured to be executed by the processor, the above
- the program includes instructions for performing the following steps:
- the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type;
- the sub-network operation parameters are defined in the constructed network offline model to obtain the constructed network offline model, and the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
- each sub-network includes multiple network layers after fusion
- the sub-network operation parameters include sub-network name, operation unit type information and sub-network parameter information.
- the above program further includes instructions for performing the following steps:
- Execution of the constructed network offline model specifically includes instructions for performing the following steps:
- the above-mentioned program specifically includes Step instructions:
- the artificial intelligence processing unit corresponding to the target sub-network is determined according to the scheduling mechanism of the artificial intelligence processing unit, the model parallelism, and the operating parameters of the sub-network.
- the operation unit corresponding to the target sub-network is an artificial intelligence processing unit
- the operation unit corresponding to the target sub-network runs the target sub-network to implement the offline model of the network
- the above program specifically includes instructions for performing the following steps:
- the target sub-network is executed on the artificial intelligence processing unit through the channel to run the network offline model.
- FIG. 4 shows a possible functional unit block diagram of the artificial intelligence device 400 of the network offline model involved in the above embodiment.
- the artificial intelligence device 400 includes: an acquisition module 410 and a construction module 420;
- the obtaining module 410 is used to obtain the operating unit information of each sub-network in the offline model of the network.
- the operating unit information includes the correspondence between the sub-network and the operating unit type.
- the operating unit type includes a general processing unit type or artificial intelligence. Processing unit type;
- the construction module 420 is used to define subnetwork operation parameters in the constructed network offline model according to the operation unit information to obtain a constructed network offline model, and the subnetwork operation parameters are used to represent the operation of each subnetwork Unit type.
- each sub-network includes multiple network layers after fusion.
- sub-network operation parameters include sub-network name, operation unit type information and sub-network parameter information.
- the artificial intelligence device 400 further includes: an execution module 430;
- the execution module 430 is used to run the constructed network offline model, specifically used for:
- the execution module 430 is specifically configured to:
- the artificial intelligence processing unit corresponding to the target sub-network is determined according to the scheduling mechanism of the artificial intelligence processing unit, the model parallelism, and the operating parameters of the sub-network.
- the operation unit corresponding to the target sub-network is an artificial intelligence processing unit
- the operation unit corresponding to the target sub-network runs the target sub-network to implement the operation of the network offline model
- the execution module 430 is specifically used for:
- An embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a computer program, wherein the computer program is executed by a processor to implement any offline model described in the foregoing method embodiments Some or all steps of the processing method.
- An embodiment of the present application further provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute as described in the above method embodiments Some or all steps of any offline model processing method.
- the deep learning framework does not have a mechanism and method for parameter setting related to artificial intelligence chips, resulting in users being unable to set parameters for artificial intelligence chips or obtain data related to chip operation. How to improve this situation has become an urgent problem to be solved.
- the purpose of the present disclosure is to provide a parameter processing method and related products, by adding a container, and then writing the first parameter used to describe the parallelism of the deep learning framework into the container, and then writing the first
- the parameters are combined with other modules of the deep learning framework to obtain the second parameter for monitoring the performance of the parallel operation, which improves the calculation effect of the deep learning framework and increases the monitorability of the parallel operation performance.
- the artificial intelligence chip 10 includes an upper-layer language interface 101 and a deep learning framework 100.
- the upper-layer language interface is used to access a programming language
- the deep learning framework includes containers and modules of other deep learning frameworks.
- the container can interact with the modules of the deep learning framework.
- the modules of the deep learning framework include the graph executor module, various operator modules, and the engine module.
- the upper-layer language interface 101 may also be deployed on other chips or devices.
- the other chips or devices are connected to the artificial intelligence chip, and information exchange between the two can also be performed.
- the artificial intelligence chip 10 may also include an underlying library module 102, and the underlying library module includes an underlying runtime library and a driver module.
- the deep learning framework 100 also includes a carrier for data transfer between the container and other modules of the deep learning framework or the underlying library module.
- FIG. 5B is a schematic flowchart of a parameter processing method disclosed in the application example.
- the parameter processing method is applied to the artificial intelligence chip shown in FIG. 5A.
- the method specifically includes the following steps:
- the upper-layer language interface writes a first parameter to the container, where the first parameter is used to describe the degree of parallelism of the deep learning framework.
- Deep learning framework is a code framework used for deep learning projects.
- Currently popular deep learning frameworks include Tensorflow, Caffe, Theano, MXNet, Torch, and PyTorch.
- An interface is a shared boundary where two independent components in the system exchange information.
- the upper language interface and the deep learning framework are two independent components, so there is an interface between them for information interaction.
- Upper-level languages such as Python and R can be used in deep learning. Under normal circumstances, the upper-level language interface is directly connected to the deep learning framework.
- the lack of related parameter setting mechanism in this interface prevents users from setting and acquiring parameters of the artificial intelligence chip. Therefore, a new container is added below the upper language interface for parameter setting and related data Obtain.
- the parameter data fields for parameter setting and parameter acquisition in the container can be added in the container or in other modules, and then the position of parameter setting and parameter acquisition is designated as the container position.
- a container is a class or structure used to store data and belongs to a module in a deep learning framework.
- the container in the deep learning framework may be a native class or structure in the deep learning framework, and then add fields for parameter setting and parameter acquisition to the class or structure, such as the graphexecutor class.
- the container in the deep learning framework may also be a class or structure independently created by the user for the parameter processing method in the artificial intelligence chip, such as the mludevice device class.
- the method further includes: a parameter data field is included in the container, and the parameter data field is used to point to the first parameter and the second parameter.
- the parameter data field is created in the container, there is no data field related to the first parameter and the second parameter in the entire artificial intelligence chip, so it is impossible to set the first parameter and obtain the second parameter.
- One parameter and the second parameter are managed.
- the first parameter includes data parallelism and model parallelism.
- the deep learning framework in this embodiment is an MXNet deep learning framework.
- Data parallelism refers to the parallel processing of data by different cores or processing units.
- Data parallelism refers to the maximum number of parallel executions when data is processed in parallel;
- model parallelism model parallelism or model Parallel processing (MP) refers to the parallel processing of an operator or model on multiple cores, and the degree of model parallelism refers to the maximum number of parallel executions of parallel processing of a model or operator.
- MP model parallelism or model Parallel processing
- the degree of model parallelism refers to the maximum number of parallel executions of parallel processing of a model or operator.
- the set parallelism parameters can be matched with the hardware foundation of the artificial intelligence chip.
- the scale of the input data Sparsity or other characteristics are different, you also need to set different parallelism parameters.
- the set data parallelism and/or model parallelism are written through the programming language, and then injected into the container through the upper language interface, that is, the setting of the first parameter is completed.
- MXNet is a deep learning framework that supports languages such as C++, Python, R, Scala, Julia, Matlab, and JavaScript. It supports command and symbolic programming and can run on any hardware including artificial intelligence chips. It is currently the best deep learning framework. one. Therefore, the MXNet deep learning framework can be well combined with the method of the embodiment of the present application to complete the setting of the first parameter and the acquisition of the second parameter.
- the deep learning framework obtains the first parameter from the container, interacts the first parameter with module data of the deep learning framework, obtains a second parameter, and passes the second parameter Into the container, the second parameter is used to monitor the performance of the parallel operation of the deep learning framework described by the first parameter.
- the module of the deep learning framework obtains the first parameter from the container.
- the modules of the deep learning framework include the graph executor module, various operator modules, and the engine module. For example, if each operator module needs to perform parallel operations, it needs to obtain the first parameter, and then combine the other parameters in the operator module according to the first parameter, such as data size, etc., to obtain the second parameter.
- the second parameter is used to The parameter for monitoring the performance of the parallel operation, and the obtained second parameter needs to be returned to the container.
- the second parameter includes the channel disappearing time and the sum of the channel disappearing time.
- interacting the first parameter with the module data of the deep learning framework to obtain the second parameter includes: transferring the data parallelism to the module of the deep learning framework for data interaction, and obtaining the channel disappearance time corresponding to the data parallelism ( CET) and the total channel disappearance time (CETS); the model parallelism is transferred to the deep learning framework module for data interaction to obtain the CET and CETS corresponding to the data parallelism, where CETS and CET are used to calculate the computing time of the operator.
- CET data parallelism
- CETS total channel disappearance time
- the deep learning framework adopts DP or MP
- the channel disappearing time (Channel Elapsed Time, CET) and the total channel disappearing time (Channel Elapsed Time Sum, CETS) are all
- the performance parameters of the parallel operation of the two parallel channels are used for the calculation time of the unified operator.
- the second parameter of the single module or the entire deep learning framework obtained according to the first parameter and the module of the deep learning framework is transferred to the container, that is, the acquisition of the second parameter is completed.
- the upper-layer language interface obtains a second parameter from the container.
- the upper-level language interface and the container can obtain the second parameter from the container and expose it. Then the second parameter is visible to the user.
- the user can monitor the computing performance of the deep learning framework through the second parameter, and then modify the first parameter Or other parameters to adjust or improve the second parameter to improve the computing effect of the deep learning framework.
- the deep learning framework further includes a carrier
- the method further includes: the container and the module of the deep learning framework perform data transmission and interaction through the carrier.
- the carrier is a class or structure used for data transfer and interaction in the deep learning framework.
- the container is not directly related to other modules of deep learning, and data can be transferred through the carrier.
- the carrier in the MXNet framework may be the operator's context class OpContext.
- the first parameter may be assigned to the carrier, and then the carrier passes the first parameter to the module of the deep learning framework.
- the second parameter can also be transferred from the module of the deep learning framework to the container by the carrier.
- the artificial intelligence chip further includes an underlying library module
- the method further includes: performing parameter transfer interaction between the container and the underlying library module through the carrier, and the parameter includes a first parameter and a second parameter .
- the low-level library modules include low-level runtime libraries and driver modules.
- the parameters of these low-level libraries may also affect the parallel performance or other performance of the deep learning framework. Therefore, the container can also interact with the low-level library modules through the carrier in order to Obtain parallel computing performance parameters or other performance parameters.
- the upper-layer language interface and the deep learning framework are deployed in the artificial intelligence chip.
- the deep-learning framework includes a container, and the container is connected to the upper-layer language interface.
- the upper-layer language interface writes the first parameter into the container.
- the deep learning framework obtains the first parameter from the container, combines the first parameter and the module parameter of the deep learning framework to obtain the second parameter, and passes the second parameter to the container, and finally the upper-level language interface obtains the second parameter from the container and Provided to users.
- the first parameter is used to describe the degree of parallelism of the deep learning framework and the second parameter is used to monitor the performance of parallel operations
- this process improves the effect of parallel operations in the deep learning framework by writing the first parameter to the container.
- Statistics and acquisition of the second parameter improve the monitorability of parallel computing performance.
- FIG. 6 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
- the parameter processing method includes:
- the upper-level language interface injects the first parameter into the container, where the first parameter is used to describe the degree of parallelism of the deep learning framework;
- the deep learning framework further includes a carrier.
- the deep learning framework obtains the first parameter from the container, and interacts with the first parameter and module data of the deep learning framework through the carrier to obtain the first Two parameters
- the deep learning framework passes the second parameter to the container through the carrier, and the second parameter is used to monitor the performance of the parallel operation;
- the artificial intelligence chip further includes an underlying library module, and the container and the underlying library module perform parameter transmission and interaction through the carrier, and the parameters include a first parameter and a second parameter.
- FIG. 7 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
- the parameter processing method includes:
- S313 Inject the data parallelism and/or the model parallelism into the container through the upper-layer language interface
- S314 Pass the data parallelism to the module of the deep learning framework for data interaction to obtain CET and CETS corresponding to the data parallelism, and the CETS and the CET are used to calculate the computing time of the operator;
- the upper-layer language interface obtains the CETS and CET corresponding to the data parallelism and/or the model parallelism from the container.
- FIG. 8 is a parameter processing device provided by an embodiment of the present application, which is applied to an artificial intelligence chip shown in FIG. 5A.
- the parameter processing device 410 includes:
- a writing module 411 configured to write a first parameter into the container through the upper-layer language interface, wherein the first parameter is used to describe the degree of parallelism of the deep learning framework;
- the calculation module 412 is configured to obtain the first parameter from the container through the deep learning framework, interact the first parameter with data of the module of the deep learning framework, and obtain a second parameter, and The second parameter is transferred to the container, and the second parameter is used to monitor the performance of the parallel operation;
- the obtaining module 413 is configured to obtain the second parameter from the container through the upper-layer language interface.
- the upper-layer language interface first writes the first parameter into the container, and then the deep learning framework obtains the first parameter from the container, combining the first parameter and the module parameter of the deep learning framework to obtain The second parameter, and pass the second parameter to the container, and finally the upper language interface obtains the second parameter from the container and provides it to the user.
- the first parameter is used to describe the degree of parallelism of the deep learning framework and the second parameter is used to monitor the performance of parallel operations, this process improves the effect of parallel operations in the deep learning framework by writing the first parameter to the container. Statistics and acquisition of the second parameter improve the monitorability of parallel computing performance.
- the writing module is further used to:
- a parameter data field is included in the container, and the parameter data field is used to point to the first parameter and the second parameter.
- the first parameter includes data parallelism and model parallelism.
- the second parameter is the sum of the channel disappearing time and the channel disappearing time.
- the calculation module is specifically used to:
- CETS channel disappearance time
- CETS total channel disappearance time
- the model parallelism is transferred to the module of the deep learning framework for data interaction, and the CET and CETS corresponding to the data parallelism are obtained.
- the deep learning framework is an MXNet deep learning framework.
- the deep learning framework further includes a carrier
- the calculation module is further configured to:
- Parameter transfer interaction between the container and the module of the deep learning framework is performed through the carrier, and the parameter includes a first parameter and a second parameter.
- the artificial intelligence chip further includes an underlying library module, and the calculation module is further used to:
- Parameter transfer interaction between the container and the underlying library module is performed through the carrier, and the parameter includes a first parameter and a second parameter.
- the container includes a native class or structure in the deep learning framework, or a class or structure independently created in the deep learning framework for the artificial intelligence chip.
- the present application also discloses a combined processing device, which includes the above-mentioned parameter processing device, universal interconnection interface, and other processing devices.
- the parameter processing device interacts with other processing devices to complete the operation specified by the user.
- 9A is a schematic diagram of a combined processing device.
- Other processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
- the number of processors included in other processing devices is not limited.
- Other processing devices serve as an interface for parameter processing devices to interact with external data to achieve functions such as data handling, and complete basic control of the start and stop of this parameter processing device; other processing devices can also cooperate with parameter processing devices to complete calculation tasks .
- the universal interconnection interface is used to transfer data and control instructions between the parameter processing device and other processing devices.
- the parameter processing device obtains the required input data from other processing devices and writes them to the on-chip storage device of the parameter processing device; it can obtain control instructions from other processing devices and write them to the control buffer on-chip of the parameter processing device; it can also be read
- the data in the storage module of the parameter processing device is transmitted to other processing devices.
- a structure diagram of another combined processing device may further include a storage device, and the storage device is respectively connected to the parameter processing device and the other processing device.
- the storage device is used to store data of the parameter processing device and the other processing device, and is particularly suitable for calculation data that cannot be completely saved in the internal storage of the parameter processing device or other processing device.
- the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
- the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
- a chip is also applied, which includes the above parameter processing device.
- a chip packaging structure is applied, which includes the above chip.
- a board card is applied, which includes the above chip packaging structure.
- FIG. 10 provides a board card.
- the board card may also include other supporting components, including but not limited to: a storage device 710, a receiving device 720, and a control device 730;
- the storage device 710 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
- the storage device may include multiple sets of storage units 711. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
- DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
- the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
- each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transfer data twice in one clock cycle.
- a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
- the interface device is electrically connected to the chip in the chip packaging structure.
- the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
- the interface device may also be other interfaces.
- the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
- the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
- the control device is electrically connected to the chip.
- the control device is used to monitor the state of the chip.
- the chip and the control device may be electrically connected through an SPI interface.
- the control device may include a microcontroller (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
- the control device can realize the adjustment of the working state of multiple processing chips, multiple processing cores or multiple processing circuits in the chip.
- an electronic device is applied, which includes the above-mentioned board.
- Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , Mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
- the vehicles include airplanes, ships, and/or vehicles;
- the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and
- the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
- neural networks have broad and attractive prospects in the fields of system identification, pattern recognition, and intelligent control.
- intelligent control people are particularly interested in the self-learning function of neural networks, and regard this important feature of neural networks. It is one of the key keys to solve the problem of controller adaptability in automatic control.
- Existing neural network architectures are all based on multi-bit architectures, such as the commonly used 32Bit architecture.
- the existing neural network architectures occupy more bits, require higher storage space and processing bandwidth, and increase costs.
- the embodiments of the present application provide a neural network quantization method and related products, which can reduce the number of bits of the neural network architecture, reduce storage space and processing bandwidth, and reduce costs.
- FIG. 11 provides a schematic diagram of a neural network architecture.
- the neural network architecture may include a multi-layer structure.
- the multi-layer structure may include: an input layer and a convolution layer 1 , Batch normalization (batchnorm) layer, convolution layer 2, intermediate layer (the neural network architecture with different functions has different intermediate layers, the intermediate layer can be at least one layer), convolution layer n, fully connected layer 1, activation (Eg activation function: softmax) layer.
- the layer with a large amount of calculation can be called a calculation layer, such as a convolution layer, a fully connected layer, etc.
- the above calculation layer can also include other types of layers.
- this application The neural network architecture in FIG. 11 is provided for illustration only, and the neural network in this application is not limited to the architecture shown in FIG. 11.
- FIG. 12 provides a neural network quantization method.
- This method can be implemented under the neural network architecture shown in FIG. 11.
- the method shown in 12 does not limit the structure of the neural network architecture.
- the method shown in FIG. 12 may be executed by a neural network chip.
- a general-purpose chip or an electronic device containing a chip may also be used to implement the general-purpose chip, such as a central processing unit CPU, a graphics processor GPU, and so on.
- the method is shown in Figure 12, and includes the following steps:
- Step S221 Obtain the weight and input data of the target quantization layer of the original neural network; wherein, the target quantization layer is at least one of the calculation layers of the original neural network;
- the original neural network in step S221 may be a known neural network, such as a trained neural network model, and the neural network model includes input data of the input layer.
- the at least one layer may specifically include one or more layers.
- the above calculation layer may include at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
- a convolution layer may include at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
- the above calculation layer may also be other layers, and this application does not limit the specific expression form of the calculation layer.
- Step S222 Use the weights of the target quantization layer of the original neural network to determine the quantization parameters of the weights of the corresponding layer; use the input data of the target quantization layer of the original neural network to determine the quantization parameters of the input data of the corresponding layer;
- the principle of absolute value maximum distortion is adopted, that is, the weight of the target quantization layer and the input data adopt the principle of absolute value maximum distortion.
- Step S223 Quantify the target quantization layer of the original neural network according to the quantization parameter of the weight value and the quantization parameter of the input data.
- the implementation method of the above step S223 may specifically include: storing the weight quantization parameter and the input data quantization parameter in the ini configuration file of the target quantization layer, if the target quantization layer is the first layer of the neural network, the above ini
- the configuration file can also include: mean and variance.
- the technical solution provided by the present application quantizes the target quantization layer of the original neural network to obtain the quantization parameter of the weight and the quantization parameter of the input data, and then completes the quantization of the target quantization layer according to the quantization parameter.
- the target quantization layer thus quantized performs operations, since the input data and the weights are quantized data, it reduces the storage space of the weights and the storage space of the input data, and reduces the amount of calculation of bits It is also reduced accordingly, so it has the advantages of reducing the amount of calculation, increasing the calculation speed, and reducing the power consumption.
- the above quantization parameter for determining the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network may specifically include:
- the maximum value of the absolute value of the above weight may specifically be: the value with the largest absolute value among all elements of the weight, for example, the weight contains 5 elements, and their values are ⁇ 1, ⁇ 2, ⁇ 3, ⁇ 4, and ⁇ 5, respectively, then the weight
- the maximum value of the absolute value of is
- the above-mentioned use of the input data of the target quantization layer of the original neural network to determine the quantization parameter of the input data of the corresponding layer may specifically include:
- the first quantization parameter and the second quantization parameter of the input data of the corresponding layer are determined according to the maximum value of the absolute value of the input data of each layer in the target quantization layer.
- the maximum value of the absolute value of the input data may specifically be: the maximum value of the absolute value of all elements of the input data.
- the above method may further include:
- the first quantization method, the second quantization method, or the third quantization method are used to process each of the target quantization layers of the original neural network, which may include: adopting the weights of each of the target quantization layers
- the first quantization method, the second quantization method, or the third quantization method is processed to obtain a weight quantization result.
- it may also include: using the first quantization method, the second quantization method, or the input data of each layer in the target quantization layer
- the third quantization method performs processing to obtain the input data quantization result.
- the above-mentioned first quantization method may include: quantizing the weight of the corresponding layer using the first quantization parameter of the weight of each layer in the target quantization layer to obtain the weight quantization result of the corresponding layer; using the target quantization The first quantization parameter of the input data of each layer in the layer quantizes the input data of the corresponding layer to obtain the input data quantization result of the corresponding layer.
- the first quantization method may specifically be:
- fp32 data may be a weight or an element value of input data
- fix8 data may be a weight quantization result of the one element value or a corresponding quantization value of the input data quantization result
- position may be a first quantization parameter.
- the maximum absolute value of the abs_max weight value because the fix8 data is 8-bit data, there are 8 bits, one of which is the sign bit, the integer bit occupies 7 bits, and the decimal digit occupies 0 bits.
- the maximum value of the integer represented is 2 7- 1. Take 127 when calculating the position.
- the above second quantization method may include:
- the second quantization method may specifically be:
- the above new_scale may be a quantization intermediate parameter, and scale may be a second quantization parameter; when the fp32 data is an element value of the weight, the new_scale may be a weight quantization intermediate parameter, and when the fp32 data is one of the input data For element values, the new_scale can quantify intermediate parameters for input data.
- the third quantization method includes:
- the above third quantization method may specifically be:
- the chip can choose according to the actual situation, that is, the input data quantization method at the same layer can use the first quantization method, the weight quantization method
- the second quantization method or the third quantization method may be used.
- a combination of other three quantization methods may also be used.
- the present application does not limit which method is used to quantify the input data and weights.
- the above method may further include:
- the target quantization layer includes a convolutional layer And/or fully connected layer, using the weight quantization intermediate parameter of each channel to obtain the weight quantization result of the corresponding channel, the weight quantization result of each channel of each layer in the target quantization layer constitutes the weight of the corresponding layer Value quantization results;
- Each channel of each layer in the above target quantization layer may contain the weight of the layer or a data block of the input data.
- the target quantization layer uses a convolution layer as an example, and the weight may be four-dimensional as shown in FIG. 13A
- Data M, KH, KW, C each channel of the convolutional layer may contain a three-dimensional data block KH, KW, C as shown in FIG. 13A (as shown in FIG. 13B), each data block corresponds to a position and A scale, then the convolutional layer has n channels, there are n data blocks, the weight of the convolutional layer corresponds to n positions and n scales.
- n new_scale can be obtained as weight quantization intermediate parameters, and then the compiler uses n new_scale for conversion to obtain n position' and n scale', and select the maximum value from n position' Compensate for n scale's, and finally use the following formula to obtain the weight quantization result of each data block.
- the formula is:
- fp32 data (fix8 data * 2 position'-max )/scale”.
- positio n'-max is the maximum value selected from n positions'
- scale" is the compensation result for scale'.
- the weight quantization result corresponding to each data block constitutes the weight quantization result of the current convolutional layer.
- For the current convolutional layer no matter how many channels or data blocks, and there is only one input data, then it corresponds to 1 position and 1 scale.
- the quantization of other layers of the neural network may also be quantized using the above-mentioned first quantization method, second quantization method, or third quantization method.
- FIG. 14 is a neural network quantization device.
- the device includes:
- the data reading unit 421 is used to obtain the weight and input data of the target quantization layer of the original neural network; wherein, the target quantization layer is at least one of the calculation layers of the original neural network;
- the quantization parameter determination unit 422 is used to determine the quantization parameter of the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network; and use the input data of the target quantization layer of the original neural network to determine the input data of the corresponding layer Quantization parameter; wherein, the weights and input data of the target quantization layer adopt the principle of maximum absolute value without distortion;
- the quantization unit 423 is configured to quantize the target quantization layer of the original neural network according to the quantization parameter of the weight value and the quantization parameter of the input data.
- the above quantization parameter determination unit 422 is specifically configured to obtain the maximum value of the absolute value of the weight of each layer in the target quantization layer; according to the weight of each layer in the target quantization layer The maximum value of the absolute value determines the first quantization parameter and the second quantization parameter of the weight value of the corresponding layer.
- the above quantization parameter determination unit 422 is specifically configured to obtain the maximum value of the absolute value of the input data of each layer in the target quantization layer; according to the input data of each layer in the target quantization layer The maximum value of the absolute value determines the first quantization parameter and the second quantization parameter of the input data of the corresponding layer.
- the device further includes:
- the processing unit 424 is configured to process the first quantization method, the second quantization method, or the third quantization method for each of the target quantization layers of the original neural network; wherein,
- the first quantization method includes:
- the second quantization method includes:
- the third quantization method includes:
- the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
- the processing unit 424 is configured to obtain the weight quantization intermediate parameter of the corresponding channel by using the first weight quantization parameter and the second weight quantization parameter of each channel of each layer in the target quantization layer; wherein ,
- the target quantization layer includes a convolutional layer and/or a fully connected layer,
- the input data quantization intermediate parameter of each layer in the target quantization layer is used to obtain the input data quantization result of the corresponding layer.
- the processing unit 424 is further configured to process the first quantization method, the second quantization method, or the third quantization method on each of the target quantization layers of the original neural network; wherein, the target quantization The layer also includes at least one layer other than the convolution layer and/or the fully connected layer in the computing layer of the original neural network;
- the first quantization method includes:
- the second quantization method includes:
- the third quantization method includes:
- the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
- FIG. 15 provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
- the processor executes the computer program as shown in FIG. Method and detailed plan.
- the above-mentioned processor may specifically be a general-purpose processor, such as a central processing unit CPU, an image processor GPU.
- the above-mentioned processor may also be a dedicated processor for neural networks, such as a pulse array machine, a machine learning processor, etc.
- the above-mentioned processor may also be a processor combining a general-purpose processor and a neural network dedicated processor. This application does not limit the specific expression form of the above-mentioned processor.
- the above electronic equipment may include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches , Headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
- the above vehicles include airplanes, ships, and/or vehicles; the above household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and the medical equipment includes nuclear magnetic resonance instruments, ultrasound And/or electrocardiograph.
- Embodiments of the present application also provide a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes the computer to execute the method and the detailed solution shown in FIG. 12.
- An embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute the method shown in FIG. 12 and Refine the program.
- the disclosed device may be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above integrated unit may be implemented in the form of hardware or software program modules.
- the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory.
- the technical solution of the present application may essentially be a part that contributes to the prior art or all or part of the technical solution may be embodied in the form of a software product, and the computer software product is stored in a memory.
- Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
- the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
- the program may be stored in a computer-readable memory, and the memory may include: a flash disk , Read-Only Memory (English: Read-Only Memory, abbreviation: ROM), Random Access Device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.
- ROM Read-Only Memory
- RAM Random Access Device
- magnetic disk or optical disk etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (15)
- 一种网络离线模型的处理方法,其特征在于,所述方法包括:获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
- 根据权利要求1所述的方法,其特征在于,各子网络包括融合后的多个网络层。
- 根据权利要求1所述的方法,其特征在于,所述子网络运行参数包括子网络名称、运行单元类型信息和子网络参数信息。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:执行所述构建后的网络离线模型,具体包括:根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络;在所述目标子网络对应的运行单元运行所述目标子网络,以实现执行所述网络离线模型。
- 根据权利要求4所述的方法,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,所述根据所述子网络运行参数,确定目标子网络对应的运行单元,包括:获取所述网络离线模型的模型并行度;根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
- 根据权利要求4所述的方法,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,所述在所述每一个子网络对应的运行单元运行所述每一个子网络,以实现运行所述网络离线模型,包括:在调用底层库接口时,获取从所述底层接口传入的通道标识;根据所述通道标识确定所述人工智能处理单元传输数据的通道;通过所述通道将所述目标子网络在所述人工智能处理单元上执行,以运行所述网络离线模型。
- 一种人工智能处理装置,其特征在于,所述装置包括:获取模块,用于获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;构建模块,用于根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
- 根据权利要求7所述的装置,其特征在于,各个子网络包括融合后的多个网络层。
- 根据权利要求7所述的装置,其特征在于,所述子网络运行参数包括子网络名称、运行单元类型信息和子网络参数信息。
- 根据权利要求7所述的装置,其特征在于,所述装置还包括:执行模块;所述执行模块,用于执行所述构建后的网络离线模型,具体用于:根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络;在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型。
- 根据权利要求10所述的装置,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,在根据所述子网络运行参数,确定目标子网络对应的运行单元,所述执行模块,具体用于:获取所述网络离线模型的模型并行度;根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
- 根据权利要求10所述的装置,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型方面,所述执行模块,具体用于:在调用底层库接口时,获取从所述底层接口传入的通道标识;根据所述通道标识确定所述人工智能处理单元传输数据的通道;通过所述通道将所述目标子网络在所述人工智能处理单元上执行,以运行所述网络离线模型。
- 一种计算机设备,包括存储器、处理器,所述存储器上存储有可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至6中任一项所述方法的步骤。
- 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至6中任一项所述方法的步骤。
- 一种组合处理装置,其特征在于,所述组合处理装置包括如权利要求7所述的人工智能处理装置,通用互联接口和其它处理装置;所述人工智能处理装置与所述其它处理装置进行交互,共同完成用户指定的计算操作。
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811570061.6A CN109739514B (zh) | 2018-12-21 | 2018-12-21 | 参数处理方法及相关产品 |
CN201811570061.6 | 2018-12-21 | ||
CN201811654179.7 | 2018-12-29 | ||
CN201811646109.7 | 2018-12-29 | ||
CN201811646109.7A CN109754072B (zh) | 2018-12-29 | 2018-12-29 | 网络离线模型的处理方法、人工智能处理装置及相关产品 |
CN201811654179.7A CN109754074A (zh) | 2018-12-29 | 2018-12-29 | 一种神经网络量化方法、装置以及相关产品 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020124948A1 true WO2020124948A1 (zh) | 2020-06-25 |
Family
ID=71101016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/087631 WO2020124948A1 (zh) | 2018-12-21 | 2019-05-20 | 网络离线模型的处理方法、人工智能处理装置及相关产品 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020124948A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650913A (zh) * | 2016-12-31 | 2017-05-10 | 中国科学技术大学 | 一种基于深度卷积神经网络的车流密度估计方法 |
CN106886023A (zh) * | 2017-02-27 | 2017-06-23 | 中国人民解放军理工大学 | 一种基于动态卷积神经网络的雷达回波外推方法 |
CN108615046A (zh) * | 2018-03-16 | 2018-10-02 | 北京邮电大学 | 一种储粮害虫检测识别方法及装置 |
US20180365562A1 (en) * | 2017-06-20 | 2018-12-20 | Battelle Memorial Institute | Prediction of social media postings as trusted news or as types of suspicious news |
CN109754072A (zh) * | 2018-12-29 | 2019-05-14 | 北京中科寒武纪科技有限公司 | 网络离线模型的处理方法、人工智能处理装置及相关产品 |
-
2019
- 2019-05-20 WO PCT/CN2019/087631 patent/WO2020124948A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650913A (zh) * | 2016-12-31 | 2017-05-10 | 中国科学技术大学 | 一种基于深度卷积神经网络的车流密度估计方法 |
CN106886023A (zh) * | 2017-02-27 | 2017-06-23 | 中国人民解放军理工大学 | 一种基于动态卷积神经网络的雷达回波外推方法 |
US20180365562A1 (en) * | 2017-06-20 | 2018-12-20 | Battelle Memorial Institute | Prediction of social media postings as trusted news or as types of suspicious news |
CN108615046A (zh) * | 2018-03-16 | 2018-10-02 | 北京邮电大学 | 一种储粮害虫检测识别方法及装置 |
CN109754072A (zh) * | 2018-12-29 | 2019-05-14 | 北京中科寒武纪科技有限公司 | 网络离线模型的处理方法、人工智能处理装置及相关产品 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106951926B (zh) | 一种混合架构的深度学习方法及装置 | |
CN109711540B (zh) | 一种计算装置及板卡 | |
US20240111536A1 (en) | Data processing apparatus and related products | |
CN110458285B (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN112084023A (zh) | 数据并行处理的方法、电子设备及计算机可读存储介质 | |
WO2020124948A1 (zh) | 网络离线模型的处理方法、人工智能处理装置及相关产品 | |
CN116434040A (zh) | 一种面向risc-v体系架构的实时目标检测方法及*** | |
CN112766475B (zh) | 处理部件及人工智能处理器 | |
CN108052315A (zh) | 医疗软件通信方法及装置 | |
CN111949317B (zh) | 指令处理方法、装置及相关产品 | |
CN115373646A (zh) | 扩展信息方法、装置和相关产品 | |
WO2023232079A1 (zh) | 数据存储、访问、运算方法及相关产品 | |
US20240220819A1 (en) | Compiling method, running method, and related product | |
CN111078125B (zh) | 运算方法、装置及相关产品 | |
CN111079915B (zh) | 运算方法、装置及相关产品 | |
CN113742266B (zh) | 集成电路装置、电子设备、板卡和计算方法 | |
CN111078285B (zh) | 运算方法、***及相关产品 | |
WO2023236929A1 (zh) | 基于指令读取数据中的目标数据的方法及其设备 | |
CN114648437A (zh) | 对图片数据进行卷积的方法、装置、存储介质以及板卡 | |
WO2023134588A1 (zh) | 计算***、方法、装置及加速设备 | |
WO2022135049A1 (zh) | 规约多维向量的方法、电子设备以及存储介质 | |
CN114185667A (zh) | 数据处理方法及装置以及相关产品 | |
WO2020073874A1 (zh) | 机器学习运算的分配***及方法 | |
CN115543329A (zh) | 对运行于人工智能芯片上的区域候选网络进行优化的编译方法及其相关产品 | |
CN113435591A (zh) | 数据处理方法、装置、计算机设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19898361 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19898361 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/01/2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19898361 Country of ref document: EP Kind code of ref document: A1 |