WO2020124948A1 - 网络离线模型的处理方法、人工智能处理装置及相关产品 - Google Patents

网络离线模型的处理方法、人工智能处理装置及相关产品 Download PDF

Info

Publication number
WO2020124948A1
WO2020124948A1 PCT/CN2019/087631 CN2019087631W WO2020124948A1 WO 2020124948 A1 WO2020124948 A1 WO 2020124948A1 CN 2019087631 W CN2019087631 W CN 2019087631W WO 2020124948 A1 WO2020124948 A1 WO 2020124948A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
sub
quantization
parameter
layer
Prior art date
Application number
PCT/CN2019/087631
Other languages
English (en)
French (fr)
Inventor
孔维广
黄亚玲
王进
沈宇斌
郭志斌
宋新开
刘少礼
吕秀全
张昊翀
杨辉
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811570061.6A external-priority patent/CN109739514B/zh
Priority claimed from CN201811646109.7A external-priority patent/CN109754072B/zh
Priority claimed from CN201811654179.7A external-priority patent/CN109754074A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2020124948A1 publication Critical patent/WO2020124948A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the application name is "a neural network quantization method, device and related products"
  • This application relates to the field of information processing technology, and in particular to a processing method of a network offline model, an artificial intelligence processing device, and related products.
  • the terminal obtains and processes information based on the processor.
  • this way of processing information based on the processor running a software program is limited by the type of network model, that is, for some new network models, the processor is not compatible with the network type version.
  • the offline model of the network running on the processor is built under the framework of the machine. When constructing the network model, no distinction is made between the various layers of the network, resulting in a single processor that is not compatible with various offline models of the network.
  • An embodiment of the present application provides an offline model processing method.
  • the type identifier of the offline network is saved, in order to perform all types of offline networks compatible with the type identifier.
  • an embodiment of the present application provides a method for processing a network offline model, the method includes:
  • the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type;
  • the sub-network operation parameters are defined in the constructed network offline model to obtain the constructed network offline model, and the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
  • an offline model artificial intelligence device includes:
  • An obtaining module used to obtain the operating unit information of each sub-network in the offline model of the network, the operating unit information includes the correspondence between the sub-network and the operating unit type, and the operating unit type includes a general processing unit type or artificial intelligence processing Unit type
  • a building module used to define sub-network operating parameters in the constructed network offline model according to the operating unit information to obtain a constructed network offline model, and the sub-network operating parameters are used to represent operating units of each sub-network Types of.
  • an embodiment of the present application provides a computer device, including a memory and a processor, a computer program that can be run on the processor is stored on the memory, and the processor is implemented as the first when executing the computer program Aspect of the method.
  • an embodiment of the present application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method according to the first aspect is implemented.
  • an embodiment of the present application provides a combined processing device, wherein the combined processing device includes the artificial intelligence processing device described in the second aspect, a universal interconnection interface, and other processing devices;
  • the artificial intelligence processing device interacts with the other processing devices to jointly complete the calculation operation specified by the user.
  • an embodiment of the present application provides a parameter processing method, which is applied to an artificial intelligence chip in which an upper-layer language interface and a deep learning framework are deployed.
  • the deep learning framework includes a container, and the container To interface with the upper-layer language, the method includes:
  • the upper-layer language interface injects a first parameter into the container, where the first parameter is used to describe the degree of parallelism of the deep learning framework;
  • the deep learning framework obtains the first parameter from the container, and interacts with the module data of the deep learning framework to obtain a second parameter, and passes the second parameter to In the container, the second parameter is used to monitor the parallel computing performance of the deep learning framework described by the first parameter, and the container is a class or structure for storing parameters;
  • the upper-layer language interface obtains the second parameter from the container.
  • the method further includes: the container includes a parameter data field, and the parameter data field is used to point to the first parameter and the second parameter.
  • the first parameter includes data parallelism and model parallelism.
  • the second parameter includes the channel disappearing time and the sum of the channel disappearing time.
  • the interacting the first parameter with the module data of the deep learning framework to obtain the second parameter includes:
  • CETS channel disappearance time
  • CETS total channel disappearance time
  • the model parallelism is transferred to the module of the deep learning framework for data interaction, and the CET and CETS corresponding to the data parallelism are obtained.
  • the deep learning framework is MXNet deep learning framework.
  • the deep learning framework further includes a carrier
  • the method further includes:
  • Parameter transfer interaction between the container and the module of the deep learning framework is performed through the carrier, and the parameter includes a first parameter and a second parameter.
  • the artificial intelligence chip further includes an underlying library module, and the method further includes:
  • Parameter transfer interaction between the container and the underlying library module is performed through the carrier, and the parameter includes a first parameter and a second parameter.
  • the container includes a native class or structure in the deep learning framework, or a class or structure independently created in the deep learning framework for the artificial intelligence chip.
  • an embodiment of the present application provides a parameter processing device, which is applied to an artificial intelligence chip, in which an upper-layer language interface and a deep learning framework are deployed, the deep learning framework includes a container, and the container Interfaced with the upper-layer language interface, the device includes:
  • a writing module configured to write a first parameter into the container through the upper-layer language interface, wherein the first parameter is used to describe the degree of parallelism of the deep learning framework
  • the calculation module is used to obtain the first parameter from the container through the deep learning framework, and interact the first parameter with the data of the module of the deep learning framework to obtain the second parameter, and
  • the second parameter is transferred to the container, the second parameter is used to monitor the performance of parallel operations, and the container is a class or structure used to store parameters;
  • An obtaining module configured to obtain a second parameter from the container through the upper-layer language interface.
  • an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs.
  • the one or more programs are stored in the memory and are configured to The processor executes, and the program includes instructions for performing the steps in the method of the sixth aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes a computer to perform the method described in the sixth aspect.
  • an embodiment of the present application provides a chip, including the parameter processing apparatus provided in the seventh aspect.
  • an embodiment of the present application provides a chip packaging structure including the chip described in the tenth aspect above;
  • an embodiment of the present application provides a board card including the chip packaging structure described in the eleventh aspect.
  • an embodiment of the present application provides an electronic device including the chip packaging structure described in the eleventh aspect or the board card described in the twelfth aspect.
  • an embodiment of the present application provides a storage medium for storing a computer program for electronic data exchange, wherein the computer program causes the computer to execute the instructions of the steps described in any method of the sixth aspect.
  • an embodiment of the present application provides a neural network quantization method, including:
  • the target quantization layer is at least one of the calculation layers of the original neural network
  • the weights of the target quantization layer of the original neural network to determine the quantization parameters of the weights of the corresponding layer; using the input data of the target quantization layer of the original neural network to determine the quantization parameters of the input data of the corresponding layer; wherein, the The weight and input data of the target quantization layer are based on the principle of maximum absolute value without distortion;
  • the target quantization layer of the original neural network is quantized according to the quantization parameter of the weight value and the quantization parameter of the input data.
  • the calculation layer includes at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
  • the step of determining the quantization parameter of the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network includes:
  • the first quantization parameter and the second quantization parameter of the weight of the corresponding layer are determined according to the maximum value of the absolute value of the weight of each layer in the target quantization layer.
  • the step of using the input data of the target quantization layer of the original neural network to determine the quantization parameter of the input data of the corresponding layer includes:
  • the first quantization parameter and the second quantization parameter of the input data of the corresponding layer are determined according to the maximum value of the absolute value of the input data of each layer in the target quantization layer.
  • the method further includes:
  • the first quantization method includes:
  • the second quantization method includes:
  • the third quantization method includes:
  • the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
  • the method further includes:
  • the target quantization layer includes a convolutional layer And/or fully connected layer
  • the input data quantization intermediate parameter of each layer in the target quantization layer is used to obtain the input data quantization result of the corresponding layer.
  • the method further includes:
  • Each of the target quantization layers of the original neural network is processed using a first quantization method, a second quantization method, or a third quantization method; wherein, the target quantization layer further includes a calculation layer of the original neural network At least one layer other than the convolutional layer and/or the fully connected layer;
  • the first quantization method includes:
  • the second quantization method includes:
  • the third quantization method includes:
  • the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
  • an embodiment of the present application provides a neural network quantization device.
  • the device includes:
  • a data reading unit used to obtain the weight and input data of the target quantization layer of the original neural network; wherein, the target quantization layer is at least one of the calculation layers of the original neural network;
  • a quantization parameter determination unit for determining the quantization parameter of the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network; determining the input data of the corresponding layer by using the input data of the target quantization layer of the original neural network Quantization parameters; wherein, the weights and input data of the target quantization layer adopt the principle of maximum absolute value without distortion;
  • the quantization unit is configured to quantize the target quantization layer of the original neural network according to the quantization parameter of the weight value and the quantization parameter of the input data.
  • the calculation layer includes at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
  • the quantization parameter determination unit is specifically configured to obtain the maximum value of the absolute value of the weight of each layer in the target quantization layer; according to the weight of each layer in the target quantization layer The maximum value of the absolute value of determines the first quantization parameter and the second quantization parameter of the corresponding layer weight.
  • the quantization parameter determination unit is specifically configured to obtain the maximum value of the absolute value of the input data of each layer in the target quantization layer; according to the input data of each layer in the target quantization layer The maximum value of the absolute value of determines the first and second quantization parameters of the input data of the corresponding layer.
  • the device further includes:
  • a processing unit configured to process the first quantization method, the second quantization method, or the third quantization method on each of the target quantization layers of the original neural network; wherein,
  • the first quantization method includes:
  • the second quantization method includes:
  • the third quantization method includes:
  • the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
  • the device further includes:
  • a processing unit configured to obtain a weight quantization intermediate parameter of a corresponding channel by using the first weight quantization parameter and the second weight quantization parameter of each channel of each layer in the target quantization layer; wherein, the target quantization Layers include convolutional layers and/or fully connected layers;
  • the input data quantization intermediate parameter of each layer in the target quantization layer is used to obtain the input data quantization result of the corresponding layer.
  • the processing unit is further configured to process the first quantization method, the second quantization method, or the third quantization method for each of the target quantization layers of the original neural network; wherein, the The target quantization layer also includes at least one layer other than the convolution layer and/or the fully connected layer in the calculation layer of the original neural network;
  • the first quantization method includes:
  • the second quantization method includes:
  • the third quantization method includes:
  • the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
  • an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor implements the computer program to implement the tenth The method described in five aspects.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes a computer to perform the method according to the fifteenth aspect.
  • an embodiment of the present application provides a computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer Perform the method described in the fifteenth aspect.
  • the operation unit information of the network offline model is obtained, and when constructing the network offline model, the operation parameters of each sub-network are defined, and each operation mark is marked in the operation parameters
  • the type of operation unit of the sub-network so as to classify the sub-network of the offline model of the network, so that when running the offline model of the network, each sub-network is assigned to the corresponding processor to run, so as to achieve compatible operation of the offline model of the network, enriching the labor
  • an upper-layer language interface and a deep learning framework are deployed in the artificial intelligence chip.
  • the deep-learning framework includes a container, and the container is connected to the upper-layer language interface.
  • the upper-layer language interface will The first parameter is written into the container, and then the deep learning framework obtains the first parameter from the container, combines the first parameter and the module parameter of the deep learning framework to obtain the second parameter, and passes the second parameter to the container, and finally the upper language interface Obtain the second parameter from the container and provide it to the user.
  • the first parameter is used to describe the degree of parallelism of the deep learning framework and the second parameter is used to monitor the performance of parallel operations, this process improves the effect of parallel operations in the deep learning framework by writing the first parameter to the container. Statistics and obtaining the second parameter improve the monitorability of parallel computing performance.
  • the target quantization layer of the original neural network is quantized to obtain the quantization parameter of the weight and the quantization parameter of the input data, and then the target quantization layer is completed according to the quantization parameter Quantify.
  • the target quantization layer thus quantized performs operations, since the input data and the weights are quantized data, it reduces the storage space of the weights and the storage space of the input data, and reduces the amount of calculation of bits It is also correspondingly reduced, so it has the advantages of reducing the amount of calculation, increasing the calculation speed, saving storage space, reducing power consumption, and saving costs.
  • FIG. 1 is a method for processing a network offline model provided by an embodiment of this application
  • FIG. 2 is another method for processing a network offline model provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an artificial intelligence device of a network offline model provided by an embodiment of the present application
  • FIG. 4 is a block diagram of functional units of an artificial intelligence device of a network offline model provided by an embodiment of the present application
  • 5A is an artificial intelligence chip provided by an embodiment of the present application.
  • 5B is a schematic flowchart of a parameter processing method disclosed in an application example
  • FIG. 6 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
  • 9A is a schematic diagram of a combined processing device provided by an embodiment of the present application.
  • FIG. 9B is a structural diagram of another combined processing device provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a board provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a neural network architecture
  • FIG. 12 is a schematic flowchart of a neural network quantization method provided by an embodiment of the present application.
  • 13A is a schematic diagram of the weight structure of the convolutional layer provided by this application.
  • 13B is a schematic diagram of the data structure of one channel of the weights of the convolutional layer provided by this application;
  • FIG. 14 is a schematic flowchart of a quantization operation device provided by an embodiment of the present application.
  • 15 is a structural diagram of an electronic device provided by an embodiment of the present application.
  • the artificial intelligence processing device in this application may include a smart phone (such as an Android phone, an iOS phone, a Windows phone, etc.), a tablet computer, a palmtop computer, a laptop computer, a mobile Internet device MID (Mobile Internet Devices), or wearable
  • a smart phone such as an Android phone, an iOS phone, a Windows phone, etc.
  • a tablet computer such as an Samsung Galaxy Tabs, etc.
  • a palmtop computer such as a Samsung Galaxy Tabs, etc.
  • laptop computer such as a tablet computer, a palmtop computer, a laptop computer, a mobile Internet device MID (Mobile Internet Devices), or wearable
  • MID Mobile Internet Devices
  • FIG. 1 is a schematic flowchart of a method for processing a network offline model provided by an embodiment of the present application.
  • the method is applied to a network offline model.
  • the network offline model includes a general-purpose processor and an artificial intelligence processor.
  • the method includes the content shown in steps S101 to S102:
  • Step S101 Obtain the operation unit information of each sub-network in the offline model of the network, where the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type .
  • the operation unit information further includes entry function information of the sub-network, and the entry function information is used when the artificial intelligence processing unit runs the sub-network
  • the offline function corresponding to the sub-network is retrieved through the entry function, and the offline instructions of some sub-networks are pre-compiled to speed up the operation speed of the network offline model.
  • the general-purpose processor may include a central processing unit CPU (Central Processing Unit, CPU for short), a graphics processing unit GPU (Graphics Processing Unit, GPU for short), and/or an image processing unit IPU (Image Processing Unit, IPU for short) )
  • the artificial intelligence processor includes a machine learning processor unit MLU (Machine Learning Processing Unit, referred to as: MLU), wherein the artificial intelligence processor can be integrated by multiple MLUs, composed of a Multi-core artificial intelligence processor.
  • MLU Machine Learning Processing Unit
  • the offline model of the network before obtaining the operating unit information of each sub-network in the offline model of the network, first determine whether multiple network layers of the offline model of the network can be merged, if so, merge the multiple network layers that can be merged into a sub-network, Taking the network layer that cannot be fused as a single sub-network, after performing the fusion operation on the offline model of the network, several sub-networks corresponding to the offline model of the network are obtained.
  • each sub-network can be a separate network layer, or a sub-network can be obtained by fusing several network layers, for example, when the network offline model includes the convolution layer Convolution, the normalization layer BatchNorm, and the scaling layer Scale ,
  • the convolution layer Convolution in the offline model of the network, the normalization layer BatchNorm and the scaling layer Scale can be fused to obtain a sub-network.
  • the operating unit information of each sub-network in the offline model of the network is obtained to determine the operating unit type of each sub-network, so that when constructing the offline model of the network, the The field corresponding to the operation unit type defines the operation unit type of each subnet.
  • Step S102 Define sub-network operation parameters in the constructed network offline model according to the operation unit information to obtain a constructed network offline model, where the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
  • the artificial intelligence device uses a pre-installed machine learning framework to build a network offline model.
  • the following takes convolutional neural network framework caffe (Convolutional Architecture for Fast Feature Embedding, abbreviated as: caffe) as an example to specifically describe the construction of a network offline model. .
  • caffe Convolutional Architecture for Fast Feature Embedding
  • generating offline files requires two inputs, one is a prototxt file containing network information, and the other is a caffemodel file containing the trained weights and offsets.
  • caffe first calls the underlying library interface to create an offline file, and then caffe divides the entire network of prototxt input into several subnetworks according to whether each layer can run on the artificial intelligence processor, and then several subnetworks Can be executed on an artificial intelligence processor.
  • the caffe framework will call the underlying library interface to compile the subnet into offline instructions that can run on the artificial intelligence processor. Then the caffe framework saves the generated offline instructions to the pre-generated offline file by calling the interface provided by the underlying library.
  • caffe will first take the trained caffemodel.
  • the weight and offset data are read out and stored in the corresponding blob, where blob is a data structure defined in caffe, which is used to transfer data between layers.
  • blob is a data structure defined in caffe, which is used to transfer data between layers.
  • These weights and offset data will be passed to the underlying library when caffe calls the underlying library to generate offline instructions, and then caffe calls the underlying interface of the underlying library to store the offline instructions, weights and bias data together in the offline file.
  • caffe calls the underlying library to compile the subnet to generate offline instructions
  • caffe can specify that it can run on several cores when running the subnet, which is called the specified model parallelism, and the subnet can be used as a model.
  • custom unit information is also stored, and each subnet corresponds to a unit information.
  • the unit information can be generated through the protobuf mechanism, and caffe can append the unit information to the back of the offline file by calling the relevant interface provided by protobuf, and the information is used later when running the offline file.
  • a unit information of .SegmentInfoUnit may be pre-defined in a format, which is used to save sub-network operating parameters of each sub-network.
  • the sub-network operation parameters include sub-network name, operation unit type and sub-network parameter information
  • the sub-network parameter information may be used to indicate resource scheduling of the processor when executing the sub-network.
  • the sub-network parameter information may include convolution kernel information, etc., and may be used to represent resource information of an artificial intelligence processing unit that needs to be deployed to operate the sub-network.
  • the unit information can also store the index identifier of the offline instruction corresponding to each sub-network and the index identifier of the calculation parameter, which is convenient for reading the offline instruction and the calculation parameter corresponding to each sub-network from the offline file, Then, the unit information is appended to the offline file caffemodel, so as to read the sub-network operating parameters of each sub-network and the offline command corresponding to the sub-network from the offline file through the underlying interface of caffe based on the index identification. Calculation parameters.
  • the calculation parameter is parameter data related to each sub-network operation, for example, when the sub-network is a convolutional layer, the calculation parameter is a weight and an offset. If the convolutional layer is not offset, the offset It is zero, as another example, if the sub-network is the active layer, the calculation parameter is the activation function.
  • storing the sub-network operating parameters of each sub-network in a data structure corresponding to each sub-network may be: based on the Protocol Buffers mechanism, obtaining a preset BP Message, through the compiler in the Protocol Buffers mechanism Compile the fields in the layer of each subnet that match the BP Message into a binary file, and save the binary file in a data structure in the format of .SegmentInfoUnit.
  • the Protocol Buffers mechanism is only an exemplary illustration, and this application does not limit the network information for storing the sub-network operating parameters.
  • the operating unit type provides a new method for saving the offline model of the network; moreover, based on the saved operating unit type of each sub-network, different operating units can be used to run different network layers.
  • the operation of the network offline model can be made more flexible and more compatible and applied to various artificial intelligence devices.
  • FIG. 2 is a schematic flowchart of another method for processing a network offline model provided by an embodiment of the present application.
  • the method is applied to an artificial intelligence device.
  • the artificial intelligence device may include a general-purpose processor and an artificial intelligence processor.
  • the method includes the content shown in steps S201-S205:
  • Step S201 Obtain the operation unit information of each sub-network in the offline model of the network, where the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type .
  • Step S202 Define the sub-network operation parameters in the constructed network offline model according to the operation unit information to obtain the constructed network offline model.
  • the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
  • Step S203 Determine the operation unit corresponding to the target subnetwork according to the subnetwork operation parameters, and the target subnetwork is any subnetwork of the network offline model.
  • Step S204 Run the target sub-network in an operation unit corresponding to the target sub-network to implement running the network offline model.
  • the implementation process of running the target sub-network on the corresponding operating unit may be: sequentially traversing the data structure through the interface of the machine learning framework to read the network operating parameters of the network offline model, and determining according to the network operating parameters
  • the operation unit that executes the target subnet, and the operation unit of the previous subnet and the next subnet connected to the target subnet, that is, completes the forward inference process, and instructs the operation unit of the target subnet to start from the previous subnet Obtain the input data from the operation unit of the server, and send the output result of the target subnetwork as the input data to the operation unit of the next subnetwork.
  • the operation unit type in the network operation parameters of the target subnetwork is artificial intelligence Processing unit type
  • the operating unit type of the last sub-network is a general processing unit type
  • the operating unit type of the next sub-network is a general processing unit type, instructing the artificial intelligence processing unit to obtain data from the general processing unit, and the acquired data
  • the output results are sent to the general processing unit to complete the forward inference process of the offline model of the network, in accordance with the running order of the offline model of the network.
  • a general processing unit and an artificial intelligence processing unit are provided in the artificial intelligence processing device, and the operating unit of each sub-network is determined based on the operating parameters of each sub-network, and then the corresponding operation
  • the unit runs the sub-network, so that when the artificial intelligence processing unit does not support the operation of the sub-network, the general-purpose processing unit runs the operation of the sub-network, that is, the general-purpose processing unit and the artificial intelligence processing unit work together to be compatible
  • All types of network offline models increase the scope of application of network offline models, and the general processing unit and artificial intelligence processing unit work together to put the network layer that can run on the artificial intelligence processing unit on the artificial intelligence processing unit.
  • Putting the entire network offline model in the general processing unit for execution speeds up the inference process of the entire offline network, and generates offline instructions in advance for the network layer that can run on the artificial intelligence processing unit, saving the generation of offline while executing The time consumed by the instruction; in addition, the general processing unit can perform part or all of the operation of the network offline model, reducing the work pressure of the artificial intelligence processing unit.
  • the implementation process of determining the operation unit corresponding to the target sub-network according to the operation parameters of the sub-network may be: acquiring the model parallelism of the offline model of the network; according to the scheduling mechanism of the artificial intelligence processing unit, The model parallelism and the sub-network operating parameters determine the artificial intelligence processing unit corresponding to the target sub-network.
  • the offline instruction corresponding to the target subnetwork is read from the offline file of the offline model of the network, and the offline instruction is parsed to obtain the The parallelism of the model contained in the offline instruction.
  • the number of artificial intelligence processing units required to run the target subnet is obtained.
  • the scheduling mechanism of the artificial intelligence processing unit is obtained. Deploy multiple artificial intelligence processing units corresponding to this number, designate multiple artificial intelligence processing units corresponding to this number as the artificial intelligence processing unit running the target subnetwork, and assign offline instructions and calculations corresponding to this subnetwork
  • the parameters are distributed to the multiple artificial intelligence processing units to complete the operation of the target sub-network.
  • the model parallelism of each sub-network can be set in advance, that is, the number of artificial intelligence processing units required to run the sub-network can be specified, so that on the artificial intelligence processor, the multi-core artificial intelligence processing unit can jointly execute with The operation corresponding to the sub-network improves the running speed of the sub-network.
  • each target sub-network may be: acquiring the interface instruction when calling the underlying library; parsing the interface instruction to obtain the channel identifier included in the interface instruction; determining the manual according to the channel identifier A channel for transmitting data by an intelligent processing unit; running the target sub-network on the artificial intelligence processing unit through the channel to run the network offline model.
  • each target artificial intelligence processing unit contains multiple data transmission channels.
  • the corresponding channel is designated by the interface instruction to transmit offline instructions and calculation parameters to the target artificial intelligence processing unit, thereby speeding up the artificial intelligence
  • the read and write speed of the processing unit accelerates the reasoning process of the offline model of the network.
  • FIG. 3 is a schematic structural diagram of an artificial intelligence device of a network offline model provided by an embodiment of the present application.
  • the artificial intelligence device 300 includes a general-purpose processor and an artificial intelligence processor, memory, and communication An interface and one or more programs, wherein the one or more programs are different from the one or more application programs, and the one or more programs are stored in the memory and configured to be executed by the processor, the above
  • the program includes instructions for performing the following steps:
  • the operation unit information includes the correspondence between the sub-network and the operation unit type, and the operation unit type includes a general processing unit type or an artificial intelligence processing unit type;
  • the sub-network operation parameters are defined in the constructed network offline model to obtain the constructed network offline model, and the sub-network operation parameters are used to indicate the operation unit type of each sub-network.
  • each sub-network includes multiple network layers after fusion
  • the sub-network operation parameters include sub-network name, operation unit type information and sub-network parameter information.
  • the above program further includes instructions for performing the following steps:
  • Execution of the constructed network offline model specifically includes instructions for performing the following steps:
  • the above-mentioned program specifically includes Step instructions:
  • the artificial intelligence processing unit corresponding to the target sub-network is determined according to the scheduling mechanism of the artificial intelligence processing unit, the model parallelism, and the operating parameters of the sub-network.
  • the operation unit corresponding to the target sub-network is an artificial intelligence processing unit
  • the operation unit corresponding to the target sub-network runs the target sub-network to implement the offline model of the network
  • the above program specifically includes instructions for performing the following steps:
  • the target sub-network is executed on the artificial intelligence processing unit through the channel to run the network offline model.
  • FIG. 4 shows a possible functional unit block diagram of the artificial intelligence device 400 of the network offline model involved in the above embodiment.
  • the artificial intelligence device 400 includes: an acquisition module 410 and a construction module 420;
  • the obtaining module 410 is used to obtain the operating unit information of each sub-network in the offline model of the network.
  • the operating unit information includes the correspondence between the sub-network and the operating unit type.
  • the operating unit type includes a general processing unit type or artificial intelligence. Processing unit type;
  • the construction module 420 is used to define subnetwork operation parameters in the constructed network offline model according to the operation unit information to obtain a constructed network offline model, and the subnetwork operation parameters are used to represent the operation of each subnetwork Unit type.
  • each sub-network includes multiple network layers after fusion.
  • sub-network operation parameters include sub-network name, operation unit type information and sub-network parameter information.
  • the artificial intelligence device 400 further includes: an execution module 430;
  • the execution module 430 is used to run the constructed network offline model, specifically used for:
  • the execution module 430 is specifically configured to:
  • the artificial intelligence processing unit corresponding to the target sub-network is determined according to the scheduling mechanism of the artificial intelligence processing unit, the model parallelism, and the operating parameters of the sub-network.
  • the operation unit corresponding to the target sub-network is an artificial intelligence processing unit
  • the operation unit corresponding to the target sub-network runs the target sub-network to implement the operation of the network offline model
  • the execution module 430 is specifically used for:
  • An embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a computer program, wherein the computer program is executed by a processor to implement any offline model described in the foregoing method embodiments Some or all steps of the processing method.
  • An embodiment of the present application further provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute as described in the above method embodiments Some or all steps of any offline model processing method.
  • the deep learning framework does not have a mechanism and method for parameter setting related to artificial intelligence chips, resulting in users being unable to set parameters for artificial intelligence chips or obtain data related to chip operation. How to improve this situation has become an urgent problem to be solved.
  • the purpose of the present disclosure is to provide a parameter processing method and related products, by adding a container, and then writing the first parameter used to describe the parallelism of the deep learning framework into the container, and then writing the first
  • the parameters are combined with other modules of the deep learning framework to obtain the second parameter for monitoring the performance of the parallel operation, which improves the calculation effect of the deep learning framework and increases the monitorability of the parallel operation performance.
  • the artificial intelligence chip 10 includes an upper-layer language interface 101 and a deep learning framework 100.
  • the upper-layer language interface is used to access a programming language
  • the deep learning framework includes containers and modules of other deep learning frameworks.
  • the container can interact with the modules of the deep learning framework.
  • the modules of the deep learning framework include the graph executor module, various operator modules, and the engine module.
  • the upper-layer language interface 101 may also be deployed on other chips or devices.
  • the other chips or devices are connected to the artificial intelligence chip, and information exchange between the two can also be performed.
  • the artificial intelligence chip 10 may also include an underlying library module 102, and the underlying library module includes an underlying runtime library and a driver module.
  • the deep learning framework 100 also includes a carrier for data transfer between the container and other modules of the deep learning framework or the underlying library module.
  • FIG. 5B is a schematic flowchart of a parameter processing method disclosed in the application example.
  • the parameter processing method is applied to the artificial intelligence chip shown in FIG. 5A.
  • the method specifically includes the following steps:
  • the upper-layer language interface writes a first parameter to the container, where the first parameter is used to describe the degree of parallelism of the deep learning framework.
  • Deep learning framework is a code framework used for deep learning projects.
  • Currently popular deep learning frameworks include Tensorflow, Caffe, Theano, MXNet, Torch, and PyTorch.
  • An interface is a shared boundary where two independent components in the system exchange information.
  • the upper language interface and the deep learning framework are two independent components, so there is an interface between them for information interaction.
  • Upper-level languages such as Python and R can be used in deep learning. Under normal circumstances, the upper-level language interface is directly connected to the deep learning framework.
  • the lack of related parameter setting mechanism in this interface prevents users from setting and acquiring parameters of the artificial intelligence chip. Therefore, a new container is added below the upper language interface for parameter setting and related data Obtain.
  • the parameter data fields for parameter setting and parameter acquisition in the container can be added in the container or in other modules, and then the position of parameter setting and parameter acquisition is designated as the container position.
  • a container is a class or structure used to store data and belongs to a module in a deep learning framework.
  • the container in the deep learning framework may be a native class or structure in the deep learning framework, and then add fields for parameter setting and parameter acquisition to the class or structure, such as the graphexecutor class.
  • the container in the deep learning framework may also be a class or structure independently created by the user for the parameter processing method in the artificial intelligence chip, such as the mludevice device class.
  • the method further includes: a parameter data field is included in the container, and the parameter data field is used to point to the first parameter and the second parameter.
  • the parameter data field is created in the container, there is no data field related to the first parameter and the second parameter in the entire artificial intelligence chip, so it is impossible to set the first parameter and obtain the second parameter.
  • One parameter and the second parameter are managed.
  • the first parameter includes data parallelism and model parallelism.
  • the deep learning framework in this embodiment is an MXNet deep learning framework.
  • Data parallelism refers to the parallel processing of data by different cores or processing units.
  • Data parallelism refers to the maximum number of parallel executions when data is processed in parallel;
  • model parallelism model parallelism or model Parallel processing (MP) refers to the parallel processing of an operator or model on multiple cores, and the degree of model parallelism refers to the maximum number of parallel executions of parallel processing of a model or operator.
  • MP model parallelism or model Parallel processing
  • the degree of model parallelism refers to the maximum number of parallel executions of parallel processing of a model or operator.
  • the set parallelism parameters can be matched with the hardware foundation of the artificial intelligence chip.
  • the scale of the input data Sparsity or other characteristics are different, you also need to set different parallelism parameters.
  • the set data parallelism and/or model parallelism are written through the programming language, and then injected into the container through the upper language interface, that is, the setting of the first parameter is completed.
  • MXNet is a deep learning framework that supports languages such as C++, Python, R, Scala, Julia, Matlab, and JavaScript. It supports command and symbolic programming and can run on any hardware including artificial intelligence chips. It is currently the best deep learning framework. one. Therefore, the MXNet deep learning framework can be well combined with the method of the embodiment of the present application to complete the setting of the first parameter and the acquisition of the second parameter.
  • the deep learning framework obtains the first parameter from the container, interacts the first parameter with module data of the deep learning framework, obtains a second parameter, and passes the second parameter Into the container, the second parameter is used to monitor the performance of the parallel operation of the deep learning framework described by the first parameter.
  • the module of the deep learning framework obtains the first parameter from the container.
  • the modules of the deep learning framework include the graph executor module, various operator modules, and the engine module. For example, if each operator module needs to perform parallel operations, it needs to obtain the first parameter, and then combine the other parameters in the operator module according to the first parameter, such as data size, etc., to obtain the second parameter.
  • the second parameter is used to The parameter for monitoring the performance of the parallel operation, and the obtained second parameter needs to be returned to the container.
  • the second parameter includes the channel disappearing time and the sum of the channel disappearing time.
  • interacting the first parameter with the module data of the deep learning framework to obtain the second parameter includes: transferring the data parallelism to the module of the deep learning framework for data interaction, and obtaining the channel disappearance time corresponding to the data parallelism ( CET) and the total channel disappearance time (CETS); the model parallelism is transferred to the deep learning framework module for data interaction to obtain the CET and CETS corresponding to the data parallelism, where CETS and CET are used to calculate the computing time of the operator.
  • CET data parallelism
  • CETS total channel disappearance time
  • the deep learning framework adopts DP or MP
  • the channel disappearing time (Channel Elapsed Time, CET) and the total channel disappearing time (Channel Elapsed Time Sum, CETS) are all
  • the performance parameters of the parallel operation of the two parallel channels are used for the calculation time of the unified operator.
  • the second parameter of the single module or the entire deep learning framework obtained according to the first parameter and the module of the deep learning framework is transferred to the container, that is, the acquisition of the second parameter is completed.
  • the upper-layer language interface obtains a second parameter from the container.
  • the upper-level language interface and the container can obtain the second parameter from the container and expose it. Then the second parameter is visible to the user.
  • the user can monitor the computing performance of the deep learning framework through the second parameter, and then modify the first parameter Or other parameters to adjust or improve the second parameter to improve the computing effect of the deep learning framework.
  • the deep learning framework further includes a carrier
  • the method further includes: the container and the module of the deep learning framework perform data transmission and interaction through the carrier.
  • the carrier is a class or structure used for data transfer and interaction in the deep learning framework.
  • the container is not directly related to other modules of deep learning, and data can be transferred through the carrier.
  • the carrier in the MXNet framework may be the operator's context class OpContext.
  • the first parameter may be assigned to the carrier, and then the carrier passes the first parameter to the module of the deep learning framework.
  • the second parameter can also be transferred from the module of the deep learning framework to the container by the carrier.
  • the artificial intelligence chip further includes an underlying library module
  • the method further includes: performing parameter transfer interaction between the container and the underlying library module through the carrier, and the parameter includes a first parameter and a second parameter .
  • the low-level library modules include low-level runtime libraries and driver modules.
  • the parameters of these low-level libraries may also affect the parallel performance or other performance of the deep learning framework. Therefore, the container can also interact with the low-level library modules through the carrier in order to Obtain parallel computing performance parameters or other performance parameters.
  • the upper-layer language interface and the deep learning framework are deployed in the artificial intelligence chip.
  • the deep-learning framework includes a container, and the container is connected to the upper-layer language interface.
  • the upper-layer language interface writes the first parameter into the container.
  • the deep learning framework obtains the first parameter from the container, combines the first parameter and the module parameter of the deep learning framework to obtain the second parameter, and passes the second parameter to the container, and finally the upper-level language interface obtains the second parameter from the container and Provided to users.
  • the first parameter is used to describe the degree of parallelism of the deep learning framework and the second parameter is used to monitor the performance of parallel operations
  • this process improves the effect of parallel operations in the deep learning framework by writing the first parameter to the container.
  • Statistics and acquisition of the second parameter improve the monitorability of parallel computing performance.
  • FIG. 6 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
  • the parameter processing method includes:
  • the upper-level language interface injects the first parameter into the container, where the first parameter is used to describe the degree of parallelism of the deep learning framework;
  • the deep learning framework further includes a carrier.
  • the deep learning framework obtains the first parameter from the container, and interacts with the first parameter and module data of the deep learning framework through the carrier to obtain the first Two parameters
  • the deep learning framework passes the second parameter to the container through the carrier, and the second parameter is used to monitor the performance of the parallel operation;
  • the artificial intelligence chip further includes an underlying library module, and the container and the underlying library module perform parameter transmission and interaction through the carrier, and the parameters include a first parameter and a second parameter.
  • FIG. 7 is a schematic flowchart of another parameter processing method provided by an embodiment of the present application.
  • the parameter processing method includes:
  • S313 Inject the data parallelism and/or the model parallelism into the container through the upper-layer language interface
  • S314 Pass the data parallelism to the module of the deep learning framework for data interaction to obtain CET and CETS corresponding to the data parallelism, and the CETS and the CET are used to calculate the computing time of the operator;
  • the upper-layer language interface obtains the CETS and CET corresponding to the data parallelism and/or the model parallelism from the container.
  • FIG. 8 is a parameter processing device provided by an embodiment of the present application, which is applied to an artificial intelligence chip shown in FIG. 5A.
  • the parameter processing device 410 includes:
  • a writing module 411 configured to write a first parameter into the container through the upper-layer language interface, wherein the first parameter is used to describe the degree of parallelism of the deep learning framework;
  • the calculation module 412 is configured to obtain the first parameter from the container through the deep learning framework, interact the first parameter with data of the module of the deep learning framework, and obtain a second parameter, and The second parameter is transferred to the container, and the second parameter is used to monitor the performance of the parallel operation;
  • the obtaining module 413 is configured to obtain the second parameter from the container through the upper-layer language interface.
  • the upper-layer language interface first writes the first parameter into the container, and then the deep learning framework obtains the first parameter from the container, combining the first parameter and the module parameter of the deep learning framework to obtain The second parameter, and pass the second parameter to the container, and finally the upper language interface obtains the second parameter from the container and provides it to the user.
  • the first parameter is used to describe the degree of parallelism of the deep learning framework and the second parameter is used to monitor the performance of parallel operations, this process improves the effect of parallel operations in the deep learning framework by writing the first parameter to the container. Statistics and acquisition of the second parameter improve the monitorability of parallel computing performance.
  • the writing module is further used to:
  • a parameter data field is included in the container, and the parameter data field is used to point to the first parameter and the second parameter.
  • the first parameter includes data parallelism and model parallelism.
  • the second parameter is the sum of the channel disappearing time and the channel disappearing time.
  • the calculation module is specifically used to:
  • CETS channel disappearance time
  • CETS total channel disappearance time
  • the model parallelism is transferred to the module of the deep learning framework for data interaction, and the CET and CETS corresponding to the data parallelism are obtained.
  • the deep learning framework is an MXNet deep learning framework.
  • the deep learning framework further includes a carrier
  • the calculation module is further configured to:
  • Parameter transfer interaction between the container and the module of the deep learning framework is performed through the carrier, and the parameter includes a first parameter and a second parameter.
  • the artificial intelligence chip further includes an underlying library module, and the calculation module is further used to:
  • Parameter transfer interaction between the container and the underlying library module is performed through the carrier, and the parameter includes a first parameter and a second parameter.
  • the container includes a native class or structure in the deep learning framework, or a class or structure independently created in the deep learning framework for the artificial intelligence chip.
  • the present application also discloses a combined processing device, which includes the above-mentioned parameter processing device, universal interconnection interface, and other processing devices.
  • the parameter processing device interacts with other processing devices to complete the operation specified by the user.
  • 9A is a schematic diagram of a combined processing device.
  • Other processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as an interface for parameter processing devices to interact with external data to achieve functions such as data handling, and complete basic control of the start and stop of this parameter processing device; other processing devices can also cooperate with parameter processing devices to complete calculation tasks .
  • the universal interconnection interface is used to transfer data and control instructions between the parameter processing device and other processing devices.
  • the parameter processing device obtains the required input data from other processing devices and writes them to the on-chip storage device of the parameter processing device; it can obtain control instructions from other processing devices and write them to the control buffer on-chip of the parameter processing device; it can also be read
  • the data in the storage module of the parameter processing device is transmitted to other processing devices.
  • a structure diagram of another combined processing device may further include a storage device, and the storage device is respectively connected to the parameter processing device and the other processing device.
  • the storage device is used to store data of the parameter processing device and the other processing device, and is particularly suitable for calculation data that cannot be completely saved in the internal storage of the parameter processing device or other processing device.
  • the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above parameter processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the above chip packaging structure.
  • FIG. 10 provides a board card.
  • the board card may also include other supporting components, including but not limited to: a storage device 710, a receiving device 720, and a control device 730;
  • the storage device 710 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
  • the storage device may include multiple sets of storage units 711. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
  • DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
  • the interface device is electrically connected to the chip in the chip packaging structure.
  • the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
  • the control device can realize the adjustment of the working state of multiple processing chips, multiple processing cores or multiple processing circuits in the chip.
  • an electronic device is applied, which includes the above-mentioned board.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , Mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
  • the vehicles include airplanes, ships, and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • neural networks have broad and attractive prospects in the fields of system identification, pattern recognition, and intelligent control.
  • intelligent control people are particularly interested in the self-learning function of neural networks, and regard this important feature of neural networks. It is one of the key keys to solve the problem of controller adaptability in automatic control.
  • Existing neural network architectures are all based on multi-bit architectures, such as the commonly used 32Bit architecture.
  • the existing neural network architectures occupy more bits, require higher storage space and processing bandwidth, and increase costs.
  • the embodiments of the present application provide a neural network quantization method and related products, which can reduce the number of bits of the neural network architecture, reduce storage space and processing bandwidth, and reduce costs.
  • FIG. 11 provides a schematic diagram of a neural network architecture.
  • the neural network architecture may include a multi-layer structure.
  • the multi-layer structure may include: an input layer and a convolution layer 1 , Batch normalization (batchnorm) layer, convolution layer 2, intermediate layer (the neural network architecture with different functions has different intermediate layers, the intermediate layer can be at least one layer), convolution layer n, fully connected layer 1, activation (Eg activation function: softmax) layer.
  • the layer with a large amount of calculation can be called a calculation layer, such as a convolution layer, a fully connected layer, etc.
  • the above calculation layer can also include other types of layers.
  • this application The neural network architecture in FIG. 11 is provided for illustration only, and the neural network in this application is not limited to the architecture shown in FIG. 11.
  • FIG. 12 provides a neural network quantization method.
  • This method can be implemented under the neural network architecture shown in FIG. 11.
  • the method shown in 12 does not limit the structure of the neural network architecture.
  • the method shown in FIG. 12 may be executed by a neural network chip.
  • a general-purpose chip or an electronic device containing a chip may also be used to implement the general-purpose chip, such as a central processing unit CPU, a graphics processor GPU, and so on.
  • the method is shown in Figure 12, and includes the following steps:
  • Step S221 Obtain the weight and input data of the target quantization layer of the original neural network; wherein, the target quantization layer is at least one of the calculation layers of the original neural network;
  • the original neural network in step S221 may be a known neural network, such as a trained neural network model, and the neural network model includes input data of the input layer.
  • the at least one layer may specifically include one or more layers.
  • the above calculation layer may include at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
  • a convolution layer may include at least one of a convolution layer, a fully connected layer, an LRN normalization layer, a deconvolution layer, a Reorg layer, and a Normalize normalization layer.
  • the above calculation layer may also be other layers, and this application does not limit the specific expression form of the calculation layer.
  • Step S222 Use the weights of the target quantization layer of the original neural network to determine the quantization parameters of the weights of the corresponding layer; use the input data of the target quantization layer of the original neural network to determine the quantization parameters of the input data of the corresponding layer;
  • the principle of absolute value maximum distortion is adopted, that is, the weight of the target quantization layer and the input data adopt the principle of absolute value maximum distortion.
  • Step S223 Quantify the target quantization layer of the original neural network according to the quantization parameter of the weight value and the quantization parameter of the input data.
  • the implementation method of the above step S223 may specifically include: storing the weight quantization parameter and the input data quantization parameter in the ini configuration file of the target quantization layer, if the target quantization layer is the first layer of the neural network, the above ini
  • the configuration file can also include: mean and variance.
  • the technical solution provided by the present application quantizes the target quantization layer of the original neural network to obtain the quantization parameter of the weight and the quantization parameter of the input data, and then completes the quantization of the target quantization layer according to the quantization parameter.
  • the target quantization layer thus quantized performs operations, since the input data and the weights are quantized data, it reduces the storage space of the weights and the storage space of the input data, and reduces the amount of calculation of bits It is also reduced accordingly, so it has the advantages of reducing the amount of calculation, increasing the calculation speed, and reducing the power consumption.
  • the above quantization parameter for determining the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network may specifically include:
  • the maximum value of the absolute value of the above weight may specifically be: the value with the largest absolute value among all elements of the weight, for example, the weight contains 5 elements, and their values are ⁇ 1, ⁇ 2, ⁇ 3, ⁇ 4, and ⁇ 5, respectively, then the weight
  • the maximum value of the absolute value of is
  • the above-mentioned use of the input data of the target quantization layer of the original neural network to determine the quantization parameter of the input data of the corresponding layer may specifically include:
  • the first quantization parameter and the second quantization parameter of the input data of the corresponding layer are determined according to the maximum value of the absolute value of the input data of each layer in the target quantization layer.
  • the maximum value of the absolute value of the input data may specifically be: the maximum value of the absolute value of all elements of the input data.
  • the above method may further include:
  • the first quantization method, the second quantization method, or the third quantization method are used to process each of the target quantization layers of the original neural network, which may include: adopting the weights of each of the target quantization layers
  • the first quantization method, the second quantization method, or the third quantization method is processed to obtain a weight quantization result.
  • it may also include: using the first quantization method, the second quantization method, or the input data of each layer in the target quantization layer
  • the third quantization method performs processing to obtain the input data quantization result.
  • the above-mentioned first quantization method may include: quantizing the weight of the corresponding layer using the first quantization parameter of the weight of each layer in the target quantization layer to obtain the weight quantization result of the corresponding layer; using the target quantization The first quantization parameter of the input data of each layer in the layer quantizes the input data of the corresponding layer to obtain the input data quantization result of the corresponding layer.
  • the first quantization method may specifically be:
  • fp32 data may be a weight or an element value of input data
  • fix8 data may be a weight quantization result of the one element value or a corresponding quantization value of the input data quantization result
  • position may be a first quantization parameter.
  • the maximum absolute value of the abs_max weight value because the fix8 data is 8-bit data, there are 8 bits, one of which is the sign bit, the integer bit occupies 7 bits, and the decimal digit occupies 0 bits.
  • the maximum value of the integer represented is 2 7- 1. Take 127 when calculating the position.
  • the above second quantization method may include:
  • the second quantization method may specifically be:
  • the above new_scale may be a quantization intermediate parameter, and scale may be a second quantization parameter; when the fp32 data is an element value of the weight, the new_scale may be a weight quantization intermediate parameter, and when the fp32 data is one of the input data For element values, the new_scale can quantify intermediate parameters for input data.
  • the third quantization method includes:
  • the above third quantization method may specifically be:
  • the chip can choose according to the actual situation, that is, the input data quantization method at the same layer can use the first quantization method, the weight quantization method
  • the second quantization method or the third quantization method may be used.
  • a combination of other three quantization methods may also be used.
  • the present application does not limit which method is used to quantify the input data and weights.
  • the above method may further include:
  • the target quantization layer includes a convolutional layer And/or fully connected layer, using the weight quantization intermediate parameter of each channel to obtain the weight quantization result of the corresponding channel, the weight quantization result of each channel of each layer in the target quantization layer constitutes the weight of the corresponding layer Value quantization results;
  • Each channel of each layer in the above target quantization layer may contain the weight of the layer or a data block of the input data.
  • the target quantization layer uses a convolution layer as an example, and the weight may be four-dimensional as shown in FIG. 13A
  • Data M, KH, KW, C each channel of the convolutional layer may contain a three-dimensional data block KH, KW, C as shown in FIG. 13A (as shown in FIG. 13B), each data block corresponds to a position and A scale, then the convolutional layer has n channels, there are n data blocks, the weight of the convolutional layer corresponds to n positions and n scales.
  • n new_scale can be obtained as weight quantization intermediate parameters, and then the compiler uses n new_scale for conversion to obtain n position' and n scale', and select the maximum value from n position' Compensate for n scale's, and finally use the following formula to obtain the weight quantization result of each data block.
  • the formula is:
  • fp32 data (fix8 data * 2 position'-max )/scale”.
  • positio n'-max is the maximum value selected from n positions'
  • scale" is the compensation result for scale'.
  • the weight quantization result corresponding to each data block constitutes the weight quantization result of the current convolutional layer.
  • For the current convolutional layer no matter how many channels or data blocks, and there is only one input data, then it corresponds to 1 position and 1 scale.
  • the quantization of other layers of the neural network may also be quantized using the above-mentioned first quantization method, second quantization method, or third quantization method.
  • FIG. 14 is a neural network quantization device.
  • the device includes:
  • the data reading unit 421 is used to obtain the weight and input data of the target quantization layer of the original neural network; wherein, the target quantization layer is at least one of the calculation layers of the original neural network;
  • the quantization parameter determination unit 422 is used to determine the quantization parameter of the weight of the corresponding layer by using the weight of the target quantization layer of the original neural network; and use the input data of the target quantization layer of the original neural network to determine the input data of the corresponding layer Quantization parameter; wherein, the weights and input data of the target quantization layer adopt the principle of maximum absolute value without distortion;
  • the quantization unit 423 is configured to quantize the target quantization layer of the original neural network according to the quantization parameter of the weight value and the quantization parameter of the input data.
  • the above quantization parameter determination unit 422 is specifically configured to obtain the maximum value of the absolute value of the weight of each layer in the target quantization layer; according to the weight of each layer in the target quantization layer The maximum value of the absolute value determines the first quantization parameter and the second quantization parameter of the weight value of the corresponding layer.
  • the above quantization parameter determination unit 422 is specifically configured to obtain the maximum value of the absolute value of the input data of each layer in the target quantization layer; according to the input data of each layer in the target quantization layer The maximum value of the absolute value determines the first quantization parameter and the second quantization parameter of the input data of the corresponding layer.
  • the device further includes:
  • the processing unit 424 is configured to process the first quantization method, the second quantization method, or the third quantization method for each of the target quantization layers of the original neural network; wherein,
  • the first quantization method includes:
  • the second quantization method includes:
  • the third quantization method includes:
  • the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
  • the processing unit 424 is configured to obtain the weight quantization intermediate parameter of the corresponding channel by using the first weight quantization parameter and the second weight quantization parameter of each channel of each layer in the target quantization layer; wherein ,
  • the target quantization layer includes a convolutional layer and/or a fully connected layer,
  • the input data quantization intermediate parameter of each layer in the target quantization layer is used to obtain the input data quantization result of the corresponding layer.
  • the processing unit 424 is further configured to process the first quantization method, the second quantization method, or the third quantization method on each of the target quantization layers of the original neural network; wherein, the target quantization The layer also includes at least one layer other than the convolution layer and/or the fully connected layer in the computing layer of the original neural network;
  • the first quantization method includes:
  • the second quantization method includes:
  • the third quantization method includes:
  • the input data quantization result of the corresponding layer is obtained by using the first quantization parameter and the second quantization parameter of the input data of each layer in the target quantization layer.
  • FIG. 15 provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the computer program as shown in FIG. Method and detailed plan.
  • the above-mentioned processor may specifically be a general-purpose processor, such as a central processing unit CPU, an image processor GPU.
  • the above-mentioned processor may also be a dedicated processor for neural networks, such as a pulse array machine, a machine learning processor, etc.
  • the above-mentioned processor may also be a processor combining a general-purpose processor and a neural network dedicated processor. This application does not limit the specific expression form of the above-mentioned processor.
  • the above electronic equipment may include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches , Headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
  • the above vehicles include airplanes, ships, and/or vehicles; the above household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and the medical equipment includes nuclear magnetic resonance instruments, ultrasound And/or electrocardiograph.
  • Embodiments of the present application also provide a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes the computer to execute the method and the detailed solution shown in FIG. 12.
  • An embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute the method shown in FIG. 12 and Refine the program.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software program modules.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory.
  • the technical solution of the present application may essentially be a part that contributes to the prior art or all or part of the technical solution may be embodied in the form of a software product, and the computer software product is stored in a memory.
  • Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the program may be stored in a computer-readable memory, and the memory may include: a flash disk , Read-Only Memory (English: Read-Only Memory, abbreviation: ROM), Random Access Device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.
  • ROM Read-Only Memory
  • RAM Random Access Device
  • magnetic disk or optical disk etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种网络离线模型的处理方法、人工智能处理装置及相关产品,其中,相关产品包括组合处理装置,所述组合处理装置包括该人工智能处理装置,通用互联接口和其它处理装置;所述人工智能处理装置与所述其它处理装置进行交互,共同完成用户指定的计算操作。本申请实施例有利于提高网络离线模型的运算速度。

Description

网络离线模型的处理方法、人工智能处理装置及相关产品
本申请要求:
于2018年12月29日提交中国专利局、申请号为2018116461097,申请名称为“网络离线模型的处理方法、人工智能处理装置及相关产品”;
于2018年12月29日提交中国专利局、申请号为2018116541797,申请名称为“一种神经网络量化方法、装置以及相关产品”;
于2018年12月21日提交中国专利局、申请号为2018115700616、申请名称为“参数处理方法及相关产品”;
以上三个中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息处理技术领域,具体涉及一种网络离线模型的处理方法、人工智能处理装置及相关产品。
背景技术
随着信息技术的不断发展和人们日益增长的需求,人们对信息及时性的要求越来越高了。目前,终端对信息的获取以及处理均是基于处理器实现的。在实践中发现,这种基于处理器运行软件程序来处理信息的方式,受限于网络模型的类型,也就是说,对于一些新生的网络模型,处理器对网络类型的版本不兼容。目前,在处理器上运行的网络离线模型,是在机器框架下构建的,在构建网络模型时,未对各层网络加以区分,导致单一处理器无法兼容各种网络离线模型。
发明内容
本申请实施例提供了一种离线模型的处理方法,在保存离线网络时,保存该离线网络的类型标识,以期依据类型标识兼容执行所有类型的离线网络。
第一方面,本申请实施例提供了一种网络离线模型的处理方法,该方法包括:
获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;
根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
第二方面,本申请实施例提供一种离线模型的人工智能装置,所述装置包括:
获取模块,用于获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;
构建模块,用于根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
第三方面,本申请实施例提供一种计算机设备,包括存储器、处理器,所述存储器上存储有可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的方法。
第四方面,本申请实施例提供一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的方法。
第五方面,本申请实施例提供一种组合处理装置,其特征在于,所述组合处理装置包括如第二方面所述的人工智能处理装置,通用互联接口和其它处理装置;
所述人工智能处理装置与所述其它处理装置进行交互,共同完成用户指定的计算操作。
第六方面,本申请实施例提供了一种参数处理方法,应用于人工智能芯片,所述人工智能芯片中部署了上层语言接口和深度学习框架,所述深度学习框架中包括容器,所述容器与所述上层语言接口连接,所述方法包括:
所述上层语言接口将第一参数注入所述容器中,其中所述第一参数用于描述所述深度学习框架的并行程度;
所述深度学习框架从所述容器中获取所述第一参数,并将所述第一参数与所述深度学习框架的模块数据进行交互,获得第二参数,并将所述第二参数传递到所述容器中,所述第二参数用于监测所述第一参数描述的深度学习框架的并行运算性能,所述容器是用于存放参数的类或结构体;
所述上层语言接口从所述容器中获取第二参数。
可选情况下,在所述上层语言接口将第一参数写入容器中之前,所述方法还包括:所述容器中包括参数数据字段,所述参数数据字段用于指向第一参数和第二参数。
可选情况下,所述第一参数包括数据并行度和模型并行度。
可选情况下,所述第二参数包括通道消失时间和通道消失时间总和。
可选情况下,所述将所述第一参数与所述深度学习框架的模块数据进行交互,获得第二参数,包括:
将所述数据并行度传递到深度学习框架的模块进行数据交互,获得所述数据并行度对应的通道消失时间(CET)和通道消失时间总和(CETS),所述CETS和所述CET用于统计算子的计算时间;
将所述模型并行度传递到深度学习框架的模块进行数据交互,获得所述数据并行度对应的CET和CETS。
可选情况下,所述深度学习框架为MXNet深度学习框架。
可选情况下,所述深度学习框架还包括载体,所述方法还包括:
通过所述载体进行所述容器与所述深度学习框架的模块之间的参数传递交互,所述参数包括第一参数和第二参数。
可选情况下,所述人工智能芯片还包括底层库模块,所述方法还包括:
通过所述载体进行所述容器与所述底层库模块之间的参数传递交互,所述参数包括第一参数和第二参数。
可选情况下,所述容器包括所述深度学习框架中的原生类或结构体,或者针对所述人工智能芯片在所述深度学习框架中独立创建的类或结构体。
第七方面,本申请实施例提供了一种参数处理装置,应用于人工智能芯片,所述人工智能芯片中部署了上层语言接口和深度学习框架,所述深度学习框架中包括容器,所述容器与所述上层语言接口连接,所述装置包括:
写入模块,用于通过所述上层语言接口将第一参数写入容器中,其中所述第一参数用于描述所述深度学习框架的并行程度;
计算模块,用于通过所述深度学习框架从所述容器中获取所述第一参数,并将所述第一参数与所述深度学习框架的模块的数据进行交互,获得第二参数,并将所述第二参数传递到所述容器中,所述第二参数用于监测并行运算的性能,所述容器为用于存放参数的类或结构体;
获取模块,用于通过所述上层语言接口从所述容器中获取第二参数。
第八方面,本申请实施例提供了一种电子装置,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置给所述处理器执行,所述程序包括用于执行第六方面所述的方法中的步骤的指令。
第九方面,本申请实施例提供了一种计算机可读存储介质,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第六方面所述的方法。
第十方面,本申请实施例提供了一种芯片,包括第七方面提供的参数处理装置。
第十一方面,本申请实施例提供了一种芯片封装结构,该封装结构包括上述第十方面所述的芯片;
第十二方面,本申请实施例提供了一种板卡,该板卡包括上述第十一方面所述的芯片封装结构。
第十三方面,本申请实施例提供了一种电子装置,该电子装置包括上述第十一方面所述的芯片封装结构或者上述第十二方面所述的板卡。
第十四方面,本申请实施例提供了一种存储介质,用于存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第六方面任一方法所述的步骤的指令。
第十五方面,本申请实施例提供了一种神经网络量化方法,包括:
获取原始神经网络的目标量化层的权值和输入数据;其中,所述目标量化层为所述原始神经网络的计算层中的至少一层;
利用所述原始神经网络的目标量化层的权值确定对应层的权值的量化参数;利用所述原始神经网络的目标量化层的输入数据确定对应层的输入数据的量化参数;其中,所述目标量化层的权值和输入数据均采用绝对值最大值不失真原则;
根据所述权值的量化参数和所述输入数据的量化参数对所述原始神经网络的目标量化层进行量化。
可选情况下,所述计算层包括:卷积层、全连接层、LRN归一化层,反卷积层、Reorg层,Normalize归一化层中的至少一种。
可选情况下,所述利用所述原始神经网络的目标量化层的权值确定对应层的权值的量化参数的步骤包括:
获取所述目标量化层中的每一层的权值的绝对值的最大值;
根据所述目标量化层中的每一层的权值的绝对值的最大值确定对应层的权值的第一量 化参数和第二量化参数。
可选情况下,所述利用所述原始神经网络的目标量化层的输入数据确定对应层的输入数据的量化参数的步骤包括:
获取所述目标量化层中的每一层的输入数据的绝对值的最大值;
根据所述目标量化层中的每一层的输入数据的绝对值的最大值确定对应层的输入数据的第一量化参数和第二量化参数。
可选情况下,所述方法还包括:
对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理;其中,
所述第一量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数将对应层的权值进行量化,获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果;
所述第二量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;
根据所述权值量化中间参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;
根据输入数据量化中间参数获得对应层的输入数据量化结果;
所述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
可选情况下,所述方法还包括:
利用所述目标量化层中的每一层的每个通道的第一权值量化参数和第二权值量化参数获得对应通道的权值量化中间参数;其中,所述目标量化层包括卷积层和/或全连接层;
利用每个通道的权值量化中间参数获得对应通道的权值量化结果,所述目标量化层中的每一层的每个通道的权值量化结果构成对应层的权值量化结果;
利用所述目标量化层中的每一层的第一输入数据量化参数和第二输入数据量化参数获得对应层的输入数据量化中间参数;
利用所述目标量化层中的每一层的输入数据量化中间参数获得对应层的输入数据量化结果。
可选情况下,所述方法还包括:
对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理;其中,所述目标量化层还包括所述原始神经网络的计算层中除了卷 积层和/或全连接层之外的其他至少一层;
所述第一量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数将对应层的权值进行量化,获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果;
所述第二量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;
根据所述权值量化中间参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;
根据输入数据量化中间参数获得对应层的输入数据量化结果;
所述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
第十六方面,本申请实施例提供了一种神经网络量化装置,所述装置包括:
数据读取单元,用于获取原始神经网络的目标量化层的权值和输入数据;其中,所述目标量化层为所述原始神经网络的计算层中的至少一层;
量化参数确定单元,用于利用所述原始神经网络的目标量化层的权值确定对应层的权值的量化参数;利用所述原始神经网络的目标量化层的输入数据确定对应层的输入数据的量化参数;其中,所述目标量化层的权值和输入数据均采用绝对值最大值不失真原则;
量化单元,用于根据所述权值的量化参数和所述输入数据的量化参数对所述原始神经网络的目标量化层进行量化。
可选情况下,所述计算层包括:卷积层、全连接层、LRN归一化层,反卷积层、Reorg层,Normalize归一化层中的至少一种。
可选情况下,所述量化参数确定单元,具体用于获取所述目标量化层中的每一层的权值的绝对值的最大值;根据所述目标量化层中的每一层的权值的绝对值的最大值确定对应层的权值的第一量化参数和第二量化参数。
可选情况下,所述量化参数确定单元,具体用于获取所述目标量化层中的每一层的输入数据的绝对值的最大值;根据所述目标量化层中的每一层的输入数据的绝对值的最大值确定对应层的输入数据的第一量化参数和第二量化参数。
可选情况下,所述装置还包括:
处理单元,用于对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理;其中,
所述第一量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数将对应层的权值进行量化,获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果;
所述第二量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;
根据所述权值量化中间参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;
根据输入数据量化中间参数获得对应层的输入数据量化结果;
所述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
可选情况下,所述装置还包括:
处理单元,用于利用所述目标量化层中的每一层的每个通道的第一权值量化参数和第二权值量化参数获得对应通道的权值量化中间参数;其中,所述目标量化层包括卷积层和/或全连接层;
利用每个通道的权值量化中间参数获得对应通道的权值量化结果,所述目标量化层中的每一层的每个通道的权值量化结果构成对应层的权值量化结果;
利用所述目标量化层中的每一层的第一输入数据量化参数和第二输入数据量化参数获得对应层的输入数据量化中间参数;
利用所述目标量化层中的每一层的输入数据量化中间参数获得对应层的输入数据量化结果。
可选情况下,所述处理单元,还用于对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理;其中,所述目标量化层还包括所述原始神经网络的计算层中除了卷积层和/或全连接层之外的其他至少一层;
所述第一量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数将对应层的权值进行量化,获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果;
所述第二量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;
根据所述权值量化中间参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;
根据输入数据量化中间参数获得对应层的输入数据量化结果;
所述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
第十七方面,本申请实施例提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现第十五方面所述的方法。
第十八方面,本申请实施例提供了一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第十五方面所述的方法。
第十九方面,本申请实施例提供了一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行第十五方面所述的方法。
可以看出,在如第一方面至第五方面的申请实施例中,获取网络离线模型的运行单元信息,在构建该网络离线模型时,定义各个子网络的运行参数,在运行参数中标记各个子网络的运行单元类型,从而对网络离线模型的子网络进行分类,以便在运行该网络离线模型时,将各个子网络分配给各自对应的处理器运行,实现兼容运行该网络离线模型,丰富人工智能处理装置可运行的网络离线模型的类型。
在如第六方面至第十二方面的申请实施例中,在人工智能芯片中部署了上层语言接口和深度学习框架,深度学习框架中包括容器,容器与上层语言接口连接,首先上层语言接口将第一参数写入容器中,然后深度学习框架从容器中获取第一参数,结合第一参数和深度学习框架的模块参数获得第二参数,并将第二参数传递到容器中,最后上层语言接口从容器中获取第二参数并提供给用户。因为第一参数用于描述深度学习框架的并行程度,第二参数用于监测并行运算的性能,因此这个过程通过向容器中写入第一参数,提升了深度学习框架中的并行运算效果,通过统计并获取第二参数,提升了并行运算性能的可监测性。
在如第十三方面至第十七方面的申请实施例中,将原神经网络的目标量化层执行量化得到权值的量化参数以及输入数据的量化参数,然后依据该量化参数完成目标量化层的量化。这样量化后的目标量化层在执行运算时,由于该输入数据以及权值均为量化后的数据,因此其减少了权值的存储空间以及输入数据的存储空间,并且较少比特位的运算量也相应减少,因此其具有减少运算量,提高运算速度,节省存储空间、降低功耗、节省成本的优点。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域 普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种网络离线模型的处理方法;
图2为本申请实施例提供的另一种网络离线模型的处理方法;
图3为本申请实施例提供的一种网络离线模型的人工智能装置的结构示意图;
图4为本申请实施例提供的一种网络离线模型的人工智能装置的功能单元组成框图;
图5A为本申请实施例提供的一种人工智能芯片;
图5B是申请实施例公开的一种参数处理方法流程示意图;
图6是本申请实施例提供的另一种参数处理方法流程示意图;
图7是本申请实施例提供的另一种参数处理方法流程示意图;
图8为本申请实施例提供的一种参数处理装置;
图9A是本申请实施例提供的一种组合处理装置的示意图;
图9B是本申请实施例提供的另一种组合处理装置的结构图;
图10是本申请实施例提供的一种板卡的结构示意图;
图11为一种神经网络构架的结构示意图;
图12是本申请实施例提供的一种神经网络量化方法的流程示意图;
图13A是本申请提供的卷积层的权值结构示意图;
图13B是本申请提供的卷积层的权值的一个通道的数据结构示意图;
图14是本申请一个实施例提供的量化运算装置的流程示意图;
图15是本申请实施例提供的一种电子设备的结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结果或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请中的人工智能处理装置可以包括智能手机(如Android手机、iOS手机、Windows Phone手机等)、平板电脑、掌上电脑、笔记本电脑、移动互联网设备MID(Mobile Internet Devices,简称:MID)或穿戴式设备等,上述电子设备仅是举例,而非穷举,包含但不限于上述人工智能处理装置。
首先,参阅图1,图1为本申请实施例提供的一种网络离线模型的处理方法的流程示意图,该方法应用于网络离线模型,该网络离线模型包括通用处理器和人工智能处理器,该方法包括如步骤S101~S102中所示的内容:
步骤S101、获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型。
可选的,在子网络的运行单元类型为人工智能处理单元类型时,所述运行单元信息还包括该子网络的入口函数信息,该入口函数信息用于在人工智能处理单元运行该子网络时,通过该入口函数调取与该子网络对应的离线指令,通过预先编译好部分子网络的离线指令,加快了网络离线模型的运行速度。
其中,通用处理器中可以包括中央处理单元CPU(Central Processing Unit,简称:CPU)、图形处理单元GPU(Graphics Processing Unit,简称:GPU)和/或图像处理单元IPU(Image Processing Unit,简称:IPU)中的一种或几种的组合,该人工智能处理器包括机器学习处理器单元MLU(Machine Learning Processing Unit,简称:MLU),其中,人工智能处理器可由多个MLU集成,组成为一个具有多核的人工智能处理器。
可选的,在获取网络离线模型中各子网络的运行单元信息之前,首先确定该网络离线模型的多个网络层是否可以融合,如是,将可以融合的多个网络层融合为一个子网络,将不能融合的网络层作为一个单独的子网络,在对该网络离线模型执行融合操作后,得到与该网络离线模型对应的若干个子网络。故每个子网络可以是一个单独的网络层,也可由几个网络层融合得到一个子网络,举例来说,如该网络离线模型中包含卷积层Convolution、归一化层BatchNorm以及缩放层Scale时,可以将该网络离线模型中的卷积层Convolution,归一化层BatchNorm以及缩放层Scale融合,得到一个子网络。
可选的,在将该网络离线模型融合后,获取该网络离线模型中各子网络的运行单元信息,以确定每个子网络的运行单元类型,以在构建该网络离线模型时,在与网络的运行单元类型对应的字段中定义各个子网络的运行单元类型。
步骤S102、根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
可选的,该人工智能装置利用预先安装的机器学习框架构建网络离线模型,下面以卷积神经网络框架caffe(Convolutional Architecture for Fast Feature Embedding,简称:caffe)为例对构建网络离线模型做具体说明。
对于caffe框架来说,生成离线文件需要两个输入,一个是包含网络信息的prototxt文件,另一个是包含已经训练好的权重和偏置的caffemodel文件。在生成离线文件时,首先caffe先调用底层库接口创建一个离线文件,然后caffe会将输入的prototxt整个网络根据每一层是否可以在人工智能处理器上运行划分为若干个子网络,然后若干子网络可以在人工智能处理器上执行。caffe框架则会调用底层库接口将该子网络编译成能在人工智能处理器上运行的离线指令。接着caffe框架通过调用底层库提供的接口将生成的离线指令保存到预先生成好的离线文件中,同时对于像卷积和全连接等需要权重的层,caffe会先从已经训练好的caffemodel中将权重和偏置数据读出并存放在相应的blob中,其中blob为caffe里面 定义的一种数据结构,用于层与层之间传递数据。这些权重和偏置数据会在caffe调用底层库生成离线指令的时候一同传递给底层库,然后caffe调用底层库的相关接口将离线指令,权重以及偏置数据一起存放到离线文件中。另外,在caffe调用底层库编译子网络生成离线指令的时候,caffe可以指定当运行该子网络时可以在几个核上运行,也就是所谓的指定模型并行度,该子网络可当成一个模型。
离线文件中除了存放离线指令和权重、偏置等数据外,还会存放自定义的单元信息,每个子网络对应一个单元信息。单元信息可以通过protobuf机制生成,并且caffe可以通过调用protobuf提供的相关接口将该单元信息追加到离线文件的后面,这些信息用于后面运行离线文件时使用。
可选的,在本申请的一个实施例中,可以预先定义格式为.SegmentInfoUnit单元信息,其用于保存每个子网络的子网络运行参数。其中,该子网络运行参数包括子网络名称、运行单元类型和子网络参数信息,该子网络参数信息可以用于指示在执行该子网络时对处理器的资源调度。举例来说,子网络参数信息可以包括卷积核信息等,可以用于表示运行该子网络需要调配的人工智能处理单元的资源信息。
可选的,该单元信息还可以保存与各子网络对应的离线指令的索引标识以及计算参数的索引标识,该索引标识便于从离线文件中读取与各子网络对应的离线指令以及计算参数,然后,将该单元信息追加在该离线文件caffemodel中,以便基于该索引标识,通过caffe的底层接口从该离线文件中读取每个子网络的子网络运行参数以及与该子网络对应的离线指令以及计算参数。
其中,该计算参数为与每个子网络运算相关的参数数据,例如,当该子网络为卷积层时,该计算参数为权值和偏置,如该卷积层无偏置时,偏置为零,再如,如该子网络为激活层时,该计算参数为激活函数。
在一可能的示例中,将每个子网络的子网络运行参数保存在与每个子网络对应的数据结构中可以为:基于Protocol Buffers机制,获取预先设置的BP Message,通过Protocol Buffers机制中的编译器将每个子网络的layer(子网络中的层)中的符合该BP Message中的字段编译成二进制文件,将该二进制文件保存在格式为.SegmentInfoUnit的数据结构中。当然,Protocol Buffers机制仅为示例性说明,本申请不对保存子网络运行参数的网络信息做唯一限定。
可以看出,在本申请实施例中,通过获取子网络的运行单元信息,在构建网络离线模型时,定义每个子网络的运行参数,使构建好的离线模型的离线文件中保存有各个子网络的运行单元类型,提供了一种新型保存网络离线模型的方法;而且,基于保存的各个子网络的运行单元类型,可以由不同的运行单元来运行不同的网络层,当模型中有新的层时,通过灵活指定新增层的运行单元,可以使网络离线模型的运行更加灵活,更兼容的应用到各种人工智能装置中。
参阅图2,图2为本申请实施例提供的另一种网络离线模型的处理方法的流程示意图,该方法应用于人工智能装置,该人工智能装置可以包括通用处理器和人工智能处理器,该方法包括如步骤S201~S205中所示的内容:
步骤S201、获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子 网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型。
步骤S202、根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
步骤S203、根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络。
步骤S204、在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型。
可选的,将所述目标子网络在对应的运行单元上运行的实现过程可以为:通过机器学习框架的接口依次遍历该数据结构读取网络离线模型的网络运行参数,依据该网络运行参数确定执行该目标子网络的运行单元,以及与该目标子网络连接的上一个子网络以及下一个子网络的运行单元,即完成前向推理过程,指示该目标子网络的运行单元从上一个子网络的运行单元处获取输入数据,并将目标子网络的输出结果作为输入数据发送给下一个子网络的运行单元,举例来说,如该目标子网络的网络运行参数中的运行单元类型为人工智能处理单元类型,上一个子网络的运行单元类型为通用处理单元类型,下一个子网络的运行单元类型为通用处理单元类型,则指示人工智能处理单元从通用处理单元获取数据,将获取到的数据作为输入数据,并将得到的输出结果发送给通用处理单元,完成对该网络离线模型的前向推理过程,按照该网络离线模型的运行顺序运行。
可以看出,在本申请实施例中,在人工智能处理装置中设置有通用处理单元和人工智能处理单元,基于每个子网络的运行参数判断出每个子网络的运行单元,然后,由相应的运行单元运行该子网络,从而实现在人工智能处理单元不支持该子网络的运算时,由通用处理单元来运行该子网络的运算,即利用通用处理单元和人工智能处理单元协同工作,能够兼容运行所有类型的网络离线模型,从而提高网络离线模型的应用范围,而且通用处理单元和人工智能处理单元协同工作,将能在人工智能处理单元运行的网络层放到人工智能处理单元上运行,相对于将整个网络离线模型全部放在通用处理单元执行来说,加速了整个离线网络的推理过程,而且,对可以在人工智能处理单元上运行的网络层预先生成离线指令,节省了边执行边生成离线指令所消耗的时间;另外可以由通用处理单元执行网络离线模型的部分或全部运算,降低人工智能处理单元的工作压力。
在一可能的示例中,在根据所述子网络运行参数,确定目标子网络对应的运行单元的实现过程可以为:获取所述网络离线模型的模型并行度;根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
在上述可能的示例中,在确定所述目标子网络对应的人工智能处理单元时,从该网络离线模型的离线文件中读取与该目标子网络对应的离线指令,解析该离线指令,得到该离线指令中蕴含的模型并行度,依据该模型并行度,得到运行该目标子网络时所需的人工智能处理单元的数量,获取人工智能处理单元的调度机制,依据该调度机制从人工智能处理器中调配与该数量对应的多个人工智能处理单元,将与该数量对应的多个人工智能处理单元指定为运行该目标子网络的人工智能处理单元,将与该子网络对应的离线指令以及计算参数分发给该多个人工智能处理单元,以完成该目标子网络的运算。在本示例中,可预先 设定每个子网络的模型并行度,即指定运行该子网络所需的人工智能处理单元的数量,以便在人工智能处理器上,实现多核人工智能处理单元共同执行与该子网络对应的运算,提高该子网络的运行速度。
在一可能的示例中,当每个人工智能处理单元中有多个处理线程时,即每个人工智能处理单元中包含多个数据传输通道时,所述将所述目标子网络在对应的运行单元上执行,以运行所述网络离线模型的实现过程可以为:获取调用底层库时的接口指令;解析该接口指令,得到该接口指令中包含的通道标识;根据所述通道标识确定所述人工智能处理单元传输数据的通道;通过所述通道将所述目标子网络在所述人工智能处理单元上运行,以运行所述网络离线模型。在本示例中,每个目标人工智能处理单元包含多个数据传输通道,在调用底层库时,通过接口指令指定相应的通道向目标人工智能处理单元传输离线指令以及计算参数,从而加快该人工智能处理单元的读写速度,加速网络离线模型的推理过程。
参阅图3,图3为本申请实施例提供的一种网络离线模型的人工智能装置的结构示意图,如图3所示,该人工智能装置300包括通用处理器和人工智能处理器、存储器、通信接口以及一个或多个程序,其中,上述一个或多个程序不同于上述一个或多个应用程序,且上述一个或多个程序被存储在上述存储器中,并且被配置给上述处理器执行,上述程序包括用于执行以下步骤的指令:
获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;
根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
其中,各子网络包括融合后的多个网络层;
其中,子网络运行参数包括子网络名称、运行单元类型信息和子网络参数信息。
在一可能的示例中,上述程序还包括用于执行以下步骤的指令:
执行所述构建后的网络离线模型,具体包括用于执行以下步骤的指令:
根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络;
在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型。
在一可能的示例中,若所述目标子网络对应的运行单元为人工智能处理单元,在根据所述子网络运行参数,确定目标子网络对应的运行单元时,上述程序具体包括用于执行以下步骤的指令:
获取所述网络离线模型的模型并行度;
根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
在一可能的示例中,若所述目标子网络对应的运行单元为人工智能处理单元,在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型时,上述程序具体包括用于执行以下步骤的指令:
在调用底层库接口时,获取从所述底层接口传入的通道标识;
根据所述通道标识确定所述人工智能处理单元传输数据的通道;
通过所述通道将所述目标子网络在所述人工智能处理单元上执行,以运行所述网络离线模型。
参阅图4,图4示出了上述实施例中所涉及的网络离线模型的人工智能装置400的一种可能的功能单元组成框图,人工智能装置400包括:获取模块410、构建模块420;
获取模块410,用于获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;
构建模块420,用于根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
其中,各个子网络包括融合后的多个网络层。
其中,所述子网络运行参数包括子网络名称、运行单元类型信息和子网络参数信息。
在一可能的示例中,人工智能装置400还包括:执行模块430;
执行模块430,用于运行所述构建后的网络离线模型,具体用于:
根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络;
在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型。
在一可能的示例中,若所述目标子网络对应的运行单元为人工智能处理单元,在根据所述子网络运行参数,确定目标子网络对应的运行单元方面,执行模块430具体用于:
获取所述网络离线模型的模型并行度;
根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
在一可能的示例中,若所述目标子网络对应的运行单元为人工智能处理单元,在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型方面,执行模块430具体用于:
在调用底层库接口时,获取从所述底层接口传入的通道标识;
根据所述通道标识确定所述人工智能处理单元传输数据的通道;
通过所述通道将所述目标子网络在所述人工智能处理单元上运行,以运行所述网络离线模型。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于存储计算机程序,其中,该计算机程序被处理器执行,以实现如上述方法实施例中记载的任何一种离线模型的处理方法的部分或全部步骤。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种离线模型的处理方法的部分或全部步骤。
另外,随着人工智能行业的发展,越来越多的深度学习框架被大家再开发和使用。而在深度学习框架配套人工智能芯片开发使用过程中,通常需要用户对框架设定一些参数来达到更好的计算效果,或者获得框架中的一些参数来监测框架的运行状态。
目前深度学习框架没有针对人工智能芯片相关的参数设定机制和方式,导致用户无法针对人工智能芯片进行参数设定或芯片运行相关数据的获取。如何对这一现状进行改进成了亟待解决的问题。
有鉴于此,本公开的目的在于提供一种参数处理方法及相关产品,通过新增容器,然后将用于描述深度学习框架并行程度的第一参数写入容器中,再将容器中的第一参数与深度学习框架其他模块结合获得用于监测并行运算性能的第二参数,提升了深度学习框架的计算效果,同时增加了并行运算性能的可监测性。
请参阅图5A,图5A为本申请实施例提供的一种人工智能芯片,如图5A所示,人工智能芯片10包括上层语言接口101和深度学习框架100,上层语言接口用于接入编程语言,深度学习框架中包括容器和其他深度学习框架的模块,容器能够与深度学习框架的模块进行数据交互,深度学习框架的模块包括有graph executor模块、各个算子模块以及engine模块等。可选的,上层语言接口101也可以部署在其他芯片或装置上,其他芯片或装置与人工智能芯片连接,两者之间也能进行信息交互。另外,人工智能芯片10也可以包括底层库模块102,底层库模块包括底层运行时库和驱动模块等。深度学习框架100中还包括载体,用于进行容器与深度学习框架其他模块或者底层库模块之间的数据传递。
请参阅图5B,图5B是申请实施例公开的一种参数处理方法流程示意图,本参数处理方法应用于如图5A所示的人工智能芯片,如图5B所示,本方法具体包括如下步骤:
S111、所述上层语言接口将第一参数写入容器中,其中所述第一参数用于描述所述深度学习框架的并行程度。
深度学习框架是用于进行深度学习项目的代码框架,目前流行的深度学习框架包括Tensorflow、Caffe、Theano、MXNet、Torch和PyTorch等。接口是***中两个独立的部件进行信息交换的共享边界。上层语言接口与深度学习框架是两个独立部件,因此它们之间存在接口,用于进行信息交互。上层语言例如Python,R语言等,都能够用于深度学习中,常规情况下,上层语言接口与深度学习框架直接连接。但是,这个接口中缺少相关的参数设定机制,使得用户无法对人工智能芯片进行参数设定和参数获取,因此,在上层语言接口的下层新增容器,用于进行参数设定和相关数据的获取。对于在容器中进行参数设定和参数获取的参数数据字段,可以在容器中新增,也可以在其他模块新增,然后指定参数设定和参数获取的位置为容器位置。
容器是用于存放数据的类或结构体,属于深度学习框架中的一个模块。深度学习框架中的容器可以是深度学习框架中的原生类或结构体,然后在该类或结构体中新增用于进行参数设定和参数获取的字段,例如graphexecutor类。或者,深度学习框架中的容器也可以是用户为人工智能芯片中的参数处理方法独立创建的类或结构,例如mludevice设备类。
可选的,该方法还包括:所述容器中包括参数数据字段,所述参数数据字段用于指向第一参数和第二参数。
具体地,在容器中创建参数数据字段之前,整个人工智能芯片中没有关于第一参数和第二参数的数据字段,因此也就无法进行第一参数的设定和第二参数的获取。在容器中创建涉及第一参数和第二参数的参数数据字段,用于指示第一参数和第二参数的获取方式、与其他模块或接口的交互方式,以及数据存储位置等,也便于对第一参数和第二参数进行管理。另外,也可以在别的位置创建参数数据字段,但是通过容器进行数据存储。
可选的,第一参数包括数据并行度和模型并行度。
可选的,该实施例中的深度学习框架为MXNet深度学习框架。
数据并行(data parallelism或data parallel processing,DP)是指不同内核或处理单元对数据进行并行处理,数据并行度是指对数据进行并行处理时,并行执行的最大数目;模型并行(model parallelism或model parallel processing,MP)是指一个算子或模型在多个内核上进行并行处理,模型并行度是指对模型或算子进行并行处理时,并行执行的最大数目。当MXNet深度学习框架在人工智能芯片上运行时,运算量庞大,为了减少运算时间,提高运算效率,需要采用DP或MP,或者同时采用两种并行运算。而为了达到更好的运算效果,需要对数据并行度和模型并行度进行设置,一方面要使设置的并行度参数能够与人工智能芯片的硬件基础相匹配,另一方面,当输入数据的规模、稀疏度或者其他特征不同时,也需要设置不同的并行度参数。将设定的数据并行度和/或模型并行度通过编程语言写入,然后通过上层语言接口注入容器中,即完成第一参数的设定。
MXNet是一个深度学习框架,支持C++,Python,R,Scala,Julia,Matlab以及JavaScript等语言,支持命令和符号编程,可以运行在包括人工智能芯片的任何硬件上,是目前最优秀的深度学习框架之一。因此采用MXNet深度学习框架能够很好地与本申请实施例的方法相结合,完成第一参数的设置和第二参数的获取。
S112、所述深度学习框架从所述容器中获取所述第一参数,将所述第一参数与所述深度学习框架的模块数据进行交互,获得第二参数,并将所述第二参数传递到所述容器中,所述第二参数用于监测所述第一参数描述的深度学习框架的并行运算的性能。
第一参数设定完成并注入容器中后,深度学习框架的模块从容器中获取第一参数,深度学习框架的模块包括graph executor模块、各个算子模块以及engine模块等。例如各个算子模块如果需要进行并行运算,则需要获取第一参数,然后根据第一参数结合算子模块中的其他参数,例如数据尺寸等,即可获得第二参数,第二参数是用于监测并行运算性能的参数,获得的第二参数需要传回容器中。
可选的,第二参数包括通道消失时间和通道消失时间总和。
可选的,将第一参数与深度学习框架的模块数据进行交互,获得第二参数,包括:将数据并行度传递到深度学习框架的模块进行数据交互,获得数据并行度对应的通道消失时间(CET)和通道消失时间总和(CETS);将模型并行度传递到深度学习框架的模块进行数据交互,获得数据并行度对应的CET和CETS,其中CETS和CET用于统计算子的计算时间。
具体地,在深度学习框架采用DP或MP时,都有多个并行通道,通道消失时间(Channel Elapsed Time,CET)和通道消失时间总和(Channel Elapsed Time Sum,CETS),都是用来描述多个并行通道进行并行运算的性能参数,用于统计算子的计算时间。将根据第一参数和深度学习框架的模块获得的单个模块或者整个深度学习框架的第二参数传递到容器中, 即完成第二参数的获取。
S113、所述上层语言接口从所述容器中获取第二参数。
上层语言接口与容器能够从容器中获取第二参数并进行暴露,那么第二参数对于用户来说是可见的,用户可以通过第二参数监测深度学习框架的运算性能,进而可以通过修改第一参数或其他参数对第二参数进行调整或改进,提升深度学习框架的运算效果。
可选的,深度学习框架还包括载体,该方法还包括:容器与深度学习框架的模块通过载体进行数据传递交互。
载体是深度学习框架中用来进行数据传递交互的类或结构体,容器与深度学习的其他模块没有直接关联,即可通过载体进行数据传递。例如MXNet框架中的载体可以是算子的上下文类OpContext,容器在注入第一参数后,可以将第一参数赋值给载体,载体再将第一参数传递给深度学习框架的模块。同样的,第二参数也可以由载体从深度学习框架的模块传递到容器。
可选的,人工智能芯片还包括底层库模块,该方法还包括:通过所述载体进行所述容器与所述底层库模块之间的参数传递交互,所述参数包括第一参数和第二参数。
具体地,底层库模块包括底层运行时库和驱动模块等,这些底层库的参数也可能影响到深度学习框架的并行性能或其他性能,因此容器也可以通过载体与底层库模块进行数据交互,以便获取并行运算性能参数或其他性能参数。
可见,在本申请实施例中,人工智能芯片中部署了上层语言接口和深度学习框架,深度学习框架中包括容器,容器与上层语言接口连接,首先上层语言接口将第一参数写入容器中,然后深度学习框架从容器中获取第一参数,结合第一参数和深度学习框架的模块参数获得第二参数,并将第二参数传递到容器中,最后上层语言接口从容器中获取第二参数并提供给用户。因为第一参数用于描述深度学习框架的并行程度,第二参数用于监测并行运算的性能,因此这个过程通过向容器中写入第一参数,提升了深度学习框架中的并行运算效果,通过统计并获取第二参数,提升了并行运算性能的可监测性。
与上述一致的,请参阅图6,图6是本申请实施例提供的另一种参数处理方法流程示意图,如图6所示,所述参数处理方法包括:
S211、在容器中创建人工智能芯片相关的参数数据字段,所述参数数据字段涉及第一参数和第二参数;
S212、上层语言接口将所述第一参数注入所述容器中,其中所述第一参数用于描述所述深度学习框架的并行程度;
S213、所述深度学习框架还包括载体,所述深度学习框架从所述容器中获取所述第一参数,通过所述载体将所述第一参数与深度学习框架的模块数据进行交互,获得第二参数;
S214、所述深度学习框架通过所述载体将所述第二参数传递到所述容器中,所述第二参数用于监测并行运算的性能;
S215、人工智能芯片还包括底层库模块,所述容器与所述底层库模块通过所述载体进行参数的传递交互,所述参数包括第一参数和第二参数。
其中,上述S211-S215的具体描述可以参照S111-S113所描述的参数处理方法的相应描述,在此不再赘述。
可见本申请实施例中,通过在深度学习框架中新增容器,然后通过载体进行深度学习框架和容器之间的参数交互,以及底层库模块与容器之间的参数交互,因为第一参数用于描述深度学习框架的并行程度,第二参数用于监测并行运算的性能,因此这个过程通过向容器中写入第一参数,提升了深度学习框架中的并行运算效果,通过统计并获取第二参数,提升了并行运算性能的可监测性。
与上述一致的,请参阅图7,图7是本申请实施例提供的另一种参数处理方法流程示意图,如图7所示,所述参数处理方法包括:
S311、设定数据并行度,所述数据并行度用于描述不同内核处理数据的不同部分时,并行执行的最大数目;
S312、设定模型并行度,所述模型并行度用于描述一个算子或模型在多个内核上进行运算时,并行执行的最大数目;
S313、通过所述上层语言接口将所述数据并行度和/或所述模型并行度注入所述容器中;
S314、将所述数据并行度传递到深度学习框架的模块进行数据交互,获得所述数据并行度对应的CET和CETS,所述CETS和所述CET用于统计算子的计算时间;
S315、将所述模型并行度传递到深度学习框架的模块进行数据交互,获得所述数据并行度对应的CET和CETS;
S316、将所述数据并行度和/或所述模型并行度对应的CETS和CET传递到所述容器中;
S317、所述上层语言接口从所述容器中获取所述数据并行度和/或所述模型并行度对应的CETS和CET。
其中,上述步骤S311-步骤S317的具体描述可以参照S111-S113所描述的参数处理方法的相应描述,在此不再赘述。
可见本申请实施例中,通过在深度学习框架中新增容器,然后通过载体进行深度学习框架和容器之间的参数交互,以及底层库模块与容器之间的参数交互,通过设置数据并行度和/或所述模型并行度,提升了深度学习框架中的并行运算效果,通过统计并获取第二参数,通过获取CETS和CET提升了并行运算性能的可监测性。
请参阅图8,图8为本申请实施例提供的一种参数处理装置,应用于如图5A所示的人工智能芯片,如图8所示,本参数处理装置410包括:
写入模块411,用于通过所述上层语言接口将第一参数写入容器中,其中所述第一参数用于描述所述深度学习框架的并行程度;
计算模块412,用于通过所述深度学习框架从所述容器中获取所述第一参数,将所述第一参数与所述深度学习框架的模块的数据进行交互,获得第二参数,并将所述第二参数传递到所述容器中,所述第二参数用于监测并行运算的性能;
获取模块413,用于通过所述上层语言接口从所述容器中获取第二参数。
其中,上述参数处理装置的具体描述可以参照S111-S113所描述的参数处理方法的相应描述,在此不再赘述。
可见,在本申请实施例的参数处理装置中,首先上层语言接口将第一参数写入容器中,然后深度学习框架从容器中获取第一参数,结合第一参数和深度学习框架的模块参数获得第二参数,并将第二参数传递到容器中,最后上层语言接口从容器中获取第二参数并提供给用户。因为第一参数用于描述深度学习框架的并行程度,第二参数用于监测并行运算的性能,因此这个过程通过向容器中写入第一参数,提升了深度学习框架中的并行运算效果,通过统计并获取第二参数,提升了并行运算性能的可监测性。
在一种可选的实施例中,所述写入模块还用于:
在所述容器中包括参数数据字段,所述参数数据字段用于指向第一参数和第二参数。
在一种可选的实施例中,所述第一参数包括数据并行度和模型并行度。
在一种可选的实施例中,所述第二参数为通道消失时间和通道消失时间总和。
在一种可选的实施例中,所述计算模块具体用于:
将所述数据并行度传递到深度学习框架的模块进行数据交互,获得所述数据并行度对应的通道消失时间(CET)和通道消失时间总和(CETS),所述CETS和所述CET用于统计算子的计算时间;
将所述模型并行度传递到深度学习框架的模块进行数据交互,获得所述数据并行度对应的CET和CETS。
在一种可选的实施例中,所述深度学习框架为MXNet深度学习框架。
在一种可选的实施例中,所述深度学习框架还包括载体,所述计算模块还用于:
通过所述载体进行所述容器与所述深度学习框架的模块之间的参数传递交互,所述参数包括第一参数和第二参数。
在一种可选的实施例中,所述人工智能芯片还包括底层库模块,所述计算模块还用于:
通过所述载体进行所述容器与所述底层库模块之间的参数传递交互,所述参数包括第一参数和第二参数。
在一种可选的实施例中,所述容器包括所述深度学习框架中的原生类或结构体,或者针对所述人工智能芯片在所述深度学习框架中独立创建的类或结构体。
本申请还揭露了一个组合处理装置,其包括上述的参数处理装置,通用互联接口,和其他处理装置。参数处理装置与其他处理装置进行交互,共同完成用户指定的操作。图9A为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为参数处理装置与外部数据进行交互控制的接口,实现例如数据搬运等功能,完成对本参数处理装置的开启、停止等基本控制;其他处理装置也可以和参数处理装置协作共同完成运算任务。
通用互联接口,用于在所述参数处理装置与其他处理装置间传输数据和控制指令。该参数处理装置从其他处理装置中获取所需的输入数据,写入参数处理装置片上的存储装置;可以从其他处理装置中获取控制指令,写入参数处理装置片上的控制缓存;也可以读取参数处理装置的存储模块中的数据并传输给其他处理装置。
可选的,如图9B所示的另一种组合处理装置的结构图,还可以包括存储装置,存储装置分别与所述参数处理装置和所述其他处理装置连接。存储装置用于保存所述参数处理装置和所述其他处理装置的数据,尤其适用于在本参数处理装置或其他处理装置的内部存储中无法全部保存的运算数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上***,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还申请了一种芯片,其包括了上述参数处理装置。
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图10,图10提供了一种板卡,上述板卡除了包括上述芯片以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件710、接收装置720和控制器件730;
所述存储器件710与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元711。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理核或多个处理电路的工 作状态的调控。
在一些实施例里,申请了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
另外,神经网络在***辨识、模式识别、智能控制等领域有着广泛而吸引人的前景,特别在智能控制中,人们对神经网络的自学习功能尤其感兴趣,并且把神经网络这一重要特点看作是解决自动控制中控制器适应能力这个难题的关键钥匙之一。
现有的神经网络构架均是基于多比特的构架,例如目前常用的32Bit构架,现有的神经网络构架的数据占用的比特位较多,需要较高的存储空间以及处理带宽,提高了成本。
有鉴于此,本申请实施例提供了一种神经网络量化方法及相关产品,可降低神经网络构架的比特位数,降低存储空间以及处理带宽,降低成本。
参阅图11,图11提供了一种神经网络构架示意图,如图11所示,神经网络构架可以包括多层结构,该多层结构如图11所示,可以包括:输入层、卷积层1、批规范化(batchnorm)层、卷积层2、中间层(依据不同功能的神经网络构架具有不同的中间层,该中间层可以为至少一层)、卷积层n、全连接层1、激活(例如激活函数:softmax)层。对于神经网络构架,对于计算量较大的层可以称为计算层,例如卷积层、全连接层等等,当然在实际应用中,上述计算层还可以包含其他类型的层,另外,本申请提供的图11中的神经网络构架仅仅是为了举例说明,本申请中的神经网络并不局限如图11所示的构架。
参阅图12,图12提供了一种神经网络量化方法,该方法可以在如图11所示的神经网络构架下实现,当然在实际应用中,也可以在其他的神经网络构架下实现,如图12所示的方法并不限制神经网络构架的结构。如图12所示的方法可以由神经网络芯片执行,当然在实际应用中,也可以采用通用芯片或包含芯片的电子设备来实现,该通用芯片例如中央处理器CPU,图形处理器GPU等等。该方法如图12所示,包括如下步骤:
步骤S221、获取原始神经网络的目标量化层的权值和输入数据;其中,所述目标量化层为所述原始神经网络的计算层中的至少一层;
上述步骤S221中的原始神经网络可以为已知的神经网络,例如已完成训练的神经网络模型,该神经网络模型包含输入层的输入数据。
上述至少一层具体可以包括一层或多层。
可选的,上述计算层可以包括:卷积层、全连接层、LRN归一化层,反卷积层、Reorg层,Normalize归一化层中的至少一种。当然在实际应用中,上述计算层还可以是其他层,本申请并不局限计算层的具体表现形式。
步骤S222、利用所述原始神经网络的目标量化层的权值确定对应层的权值的量化参数;利用所述原始神经网络的目标量化层的输入数据确定对应层的输入数据的量化参数;
上述步骤S222在确定量化参数时,采用绝对值最大值不失真原则,即目标量化层的权值和输入数据均采用绝对值最大值不失真原则。
步骤S223、根据所述权值的量化参数和所述输入数据的量化参数对所述原始神经网络的目标量化层进行量化。
上述步骤S223的实现方法具体可以包括:将该权值的量化参数以及输入数据的量化参数存储至该目标量化层的ini配置文件内,如该目标量化层为神经网络的第一层,上述ini配置文件还可以包括:均值和方差。
本申请提供的技术方案将原神经网络的目标量化层执行量化得到权值的量化参数以及输入数据的量化参数,然后依据该量化参数完成目标量化层的量化。这样量化后的目标量化层在执行运算时,由于该输入数据以及权值均为量化后的数据,因此其减少了权值的存储空间以及输入数据的存储空间,并且较少比特位的运算量也相应减少,因此其具有减少运算量,提高运算速度,降低功耗的优点。
可选的,上述利用所述原始神经网络的目标量化层的权值确定对应层的权值的量化参数具体可以包括:
获取所述目标量化层中的每一层的权值的绝对值的最大值,根据所述目标量化层中的每一层的权值的绝对值的最大值确定对应层的权值的第一量化参数和第二量化参数。
上述权值的绝对值的最大值具体可以为:权值的所有元素中绝对值最大的值,例如权值包含5个元素,其值分别为α1、α2、α3、α4、α5,则权值的绝对值的最大值为|α1|、|α2|、|α3|、|α4|、|α5|中的最大值。
可选的,上述利用所述原始神经网络的目标量化层的输入数据确定对应层的输入数据的量化参数的具体可以包括:
获取所述目标量化层中的每一层的输入数据的绝对值的最大值;
根据所述目标量化层中的每一层的输入数据的绝对值的最大值确定对应层的输入数据的第一量化参数和第二量化参数。
上述输入数据的绝对值的最大值具体可以为:输入数据的所有元素中绝对值最大值。
可选的,上述方法还可以包括:
对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理,具体可以包括:对目标量化层中的每一层的权值采用第一量化方法、第二量化方法或第三量化方法进行处理得到权值量化结果,当然还可以包括:对目标量化层中的每一层的输入数据采用第一量化方法、第二量化方法或第三量化方法进行处理得到输入数据量化结果。
其中,
上述第一量化方法可以包括:利用所述目标量化层中的每一层的权值的第一量化参数 将对应层的权值进行量化,获得对应层的权值量化结果;利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果。
该第一量化方法具体可以为:
fp32的数据=fix8的数据*2 position
其中,fp32的数据可以为权值或输入数据的一个元素值,fix8数据可以为该一个元素值的权值量化结果或输入数据量化结果的对应量化值,position可以为第一量化参数。position表达式为:
Figure PCTCN2019087631-appb-000001
其中,abs_max权值的最大绝对值,由于fix8数据为8比特数据,有8位,其中一位是符号位,整数位占7位,小数位占0位,表示的整数最大值为2 7-1,计算position时取127。
上述第二量化方法可以包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;根据所述权值量化中间参数获得对应层的权值量化结果;利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;根据输入数据量化中间参数获得对应层的输入数据量化结果。
该第二量化方法具体可以为:
fp32的数据=fix8的数据/new_scale
其中,new_scale=2 -position*scale;scale=127*2 position/abs_max。
上述new_scale可以为量化中间参数,scale可以为第二量化参数;当该fp32的数据为权值的一个元素值时,该new_scale可以为权值量化中间参数,当该fp32的数据为输入数据的一个元素值时,该new_scale可以为输入数据量化中间参数。
可选的,上述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
上述第三量化方法具体可以为:
fp32的数据=(fix8的数据*2 position)/scale。
上述第一量化方法、第二量化方法、第三量化方法在实际应用中,芯片可以依据实际情况进行选择,即在同一层的输入数据的量化方法可以采用第一量化方法,权值的量化方法可以采用第二量化方法或第三量化方法,当然在实际应用中,还可以采用其他的三种量化方法的组合方式,本申请并不局限上述输入数据以及权值的量化具体采用哪种方法。
可选的,上述方法还可以包括:
利用所述目标量化层中的每一层的每个通道的第一权值量化参数和第二权值量化参数获得对应通道的权值量化中间参数;其中,所述目标量化层包括卷积层和/或全连接层,利用每个通道的权值量化中间参数获得对应通道的权值量化结果,所述目标量化层中的每一层的每个通道的权值量化结果构成对应层的权值量化结果;
利用所述目标量化层中的每一层的第一输入数据量化参数和第二输入数据量化参数获 得对应层的输入数据量化中间参数;利用所述目标量化层中的每一层的输入数据量化中间参数获得对应层的输入数据量化结果。
上述目标量化层中的每一层的每个通道可以包含该层的权值或输入数据的一个数据块,该目标量化层以卷积层为例,该权值可以为如图13A所示四维数据M、KH、KW、C,该卷积层的每个通道可以包含如图13A所示的一个三维数据块KH、KW、C(如图13B所示),每个数据块对应一个position和一个scale,那么该卷积层有n个通道,就有n个数据块,该卷积层的权值就对应n个position和n个scale。根据new_scale=2 -position*scale能够获得n个new_scale作为权值量化中间参数,然后编译器利用n个new_scale进行转换,获得n个position’和n个scale’,从n个position’中选取最大值对n个scale’进行补偿,最后利用下式获取每个数据块的权值量化结果。公式为:
fp32的数据=(fix8的数据*2 position′-max)/scale″。
上式中,positio n′-max为n个position’中选取的最大值,scale″为对scale’补偿结果。
将每个数据块对应的权值量化结果组成当前卷积层的权值量化结果。对于当前卷积层来说,不管有多少个通道或者多少个数据块,有且仅有一个输入数据,那么就对应1个position和1个scale。根据new_scale=2 -position*scale能够获得1个new_scale作为输入数据量化中间参数。根据fp32的数据=fix8的数据/new_scale最终获得输入数据量化结果。
上述目标量化层为全连接层和/或卷积层时,该神经网络的其他层的量化还可以采用上述第一量化方法、第二量化方法或第三量化方法来进行量化。
参阅图14,图14一种神经网络量化装置,所述装置包括:
数据读取单元421,用于获取原始神经网络的目标量化层的权值和输入数据;其中,所述目标量化层为所述原始神经网络的计算层中的至少一层;
量化参数确定单元422,用于利用所述原始神经网络的目标量化层的权值确定对应层的权值的量化参数;利用所述原始神经网络的目标量化层的输入数据确定对应层的输入数据的量化参数;其中,所述目标量化层的权值和输入数据均采用绝对值最大值不失真原则;
量化单元423,用于根据所述权值的量化参数和所述输入数据的量化参数对所述原始神经网络的目标量化层进行量化。
可选的,上述量化参数确定单元422,具体用于获取所述目标量化层中的每一层的权值的绝对值的最大值;根据所述目标量化层中的每一层的权值的绝对值的最大值确定对应层的权值的第一量化参数和第二量化参数。
可选的,上述量化参数确定单元422,具体用于获取所述目标量化层中的每一层的输入数据的绝对值的最大值;根据所述目标量化层中的每一层的输入数据的绝对值的最大值确定对应层的输入数据的第一量化参数和第二量化参数。
可选的,装置还包括:
处理单元424,用于对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理;其中,
所述第一量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数将对应层的权值进行量化,获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果;
所述第二量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;
根据所述权值量化中间参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;
根据输入数据量化中间参数获得对应层的输入数据量化结果;
所述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
可选的,处理单元424,用于利用所述目标量化层中的每一层的每个通道的第一权值量化参数和第二权值量化参数获得对应通道的权值量化中间参数;其中,所述目标量化层包括卷积层和/或全连接层,
利用每个通道的权值量化中间参数获得对应通道的权值量化结果,所述目标量化层中的每一层的每个通道的权值量化结果构成对应层的权值量化结果;
利用所述目标量化层中的每一层的第一输入数据量化参数和第二输入数据量化参数获得对应层的输入数据量化中间参数;
利用所述目标量化层中的每一层的输入数据量化中间参数获得对应层的输入数据量化结果。
可选的,处理单元424,还用于对所述原始神经网络的目标量化层中的每一层采用第一量化方法、第二量化方法或第三量化方法进行处理;其中,所述目标量化层还包括所述原始神经网络的计算层中除了卷积层和/或全连接层之外的其他至少一层;
所述第一量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数将对应层的权值进行量化,获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数将对应层的输入数据进行量化,获得对应层的输入数据量化结果;
所述第二量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化中间参数;
根据所述权值量化中间参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化中间参数;
根据输入数据量化中间参数获得对应层的输入数据量化结果;
所述第三量化方法包括:
利用所述目标量化层中的每一层的权值的第一量化参数和第二量化参数获得对应层的权值量化结果;
利用所述目标量化层中的每一层的输入数据的第一量化参数和第二量化参数获得对应层的输入数据量化结果。
上述第一量化方法、第二量化方法以及第三量化方法的具体实现方法可以参见如图2所示的方法实施例的描述,这里不在赘述。
参阅图15,图15提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如图12所示的方法以及细化方案。
上述处理器具体可以为通用处理器,例如中央处理器CPU、图像处理器GPU,当然在实际应用中,上述处理器还可以为神经网络专用处理器,例如脉冲阵列机、机器学习处理器等等,当然上述处理器还可以为通用处理器与神经网络专用处理器结合的处理器,本申请并不局限上述处理器的具体表现形式。
上述电子设备可以包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
上述交通工具包括飞机、轮船和/或车辆;上述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
本申请实施例还提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如图12所示的方法以及细化方案。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如图12所示的方法以及细化方案。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接, 可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (15)

  1. 一种网络离线模型的处理方法,其特征在于,所述方法包括:
    获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;
    根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
  2. 根据权利要求1所述的方法,其特征在于,各子网络包括融合后的多个网络层。
  3. 根据权利要求1所述的方法,其特征在于,所述子网络运行参数包括子网络名称、运行单元类型信息和子网络参数信息。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    执行所述构建后的网络离线模型,具体包括:
    根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络;
    在所述目标子网络对应的运行单元运行所述目标子网络,以实现执行所述网络离线模型。
  5. 根据权利要求4所述的方法,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,所述根据所述子网络运行参数,确定目标子网络对应的运行单元,包括:
    获取所述网络离线模型的模型并行度;
    根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
  6. 根据权利要求4所述的方法,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,所述在所述每一个子网络对应的运行单元运行所述每一个子网络,以实现运行所述网络离线模型,包括:
    在调用底层库接口时,获取从所述底层接口传入的通道标识;
    根据所述通道标识确定所述人工智能处理单元传输数据的通道;
    通过所述通道将所述目标子网络在所述人工智能处理单元上执行,以运行所述网络离线模型。
  7. 一种人工智能处理装置,其特征在于,所述装置包括:
    获取模块,用于获取网络离线模型中各子网络的运行单元信息,所述运行单元信息包括子网络与运行单元类型之间的对应关系,所述运行单元类型包括通用处理单元类型或人工智能处理单元类型;
    构建模块,用于根据所述运行单元信息,在构建的所述网络离线模型中定义子网络运行参数,得到构建后的网络离线模型,所述子网络运行参数用于表示各子网络的运行单元类型。
  8. 根据权利要求7所述的装置,其特征在于,各个子网络包括融合后的多个网络层。
  9. 根据权利要求7所述的装置,其特征在于,所述子网络运行参数包括子网络名称、运行单元类型信息和子网络参数信息。
  10. 根据权利要求7所述的装置,其特征在于,所述装置还包括:执行模块;
    所述执行模块,用于执行所述构建后的网络离线模型,具体用于:
    根据所述子网络运行参数,确定目标子网络对应的运行单元,所述目标子网络为所述网络离线模型的任一子网络;
    在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型。
  11. 根据权利要求10所述的装置,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,在根据所述子网络运行参数,确定目标子网络对应的运行单元,所述执行模块,具体用于:
    获取所述网络离线模型的模型并行度;
    根据人工智能处理单元调度机制、所述模型并行度和所述子网络运行参数,确定所述目标子网络对应的人工智能处理单元。
  12. 根据权利要求10所述的装置,其特征在于,若所述目标子网络对应的运行单元为人工智能处理单元,在所述目标子网络对应的运行单元运行所述目标子网络,以实现运行所述网络离线模型方面,所述执行模块,具体用于:
    在调用底层库接口时,获取从所述底层接口传入的通道标识;
    根据所述通道标识确定所述人工智能处理单元传输数据的通道;
    通过所述通道将所述目标子网络在所述人工智能处理单元上执行,以运行所述网络离线模型。
  13. 一种计算机设备,包括存储器、处理器,所述存储器上存储有可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至6中任一项所述方法的步骤。
  14. 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至6中任一项所述方法的步骤。
  15. 一种组合处理装置,其特征在于,所述组合处理装置包括如权利要求7所述的人工智能处理装置,通用互联接口和其它处理装置;
    所述人工智能处理装置与所述其它处理装置进行交互,共同完成用户指定的计算操作。
PCT/CN2019/087631 2018-12-21 2019-05-20 网络离线模型的处理方法、人工智能处理装置及相关产品 WO2020124948A1 (zh)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201811570061.6A CN109739514B (zh) 2018-12-21 2018-12-21 参数处理方法及相关产品
CN201811570061.6 2018-12-21
CN201811654179.7 2018-12-29
CN201811646109.7 2018-12-29
CN201811646109.7A CN109754072B (zh) 2018-12-29 2018-12-29 网络离线模型的处理方法、人工智能处理装置及相关产品
CN201811654179.7A CN109754074A (zh) 2018-12-29 2018-12-29 一种神经网络量化方法、装置以及相关产品

Publications (1)

Publication Number Publication Date
WO2020124948A1 true WO2020124948A1 (zh) 2020-06-25

Family

ID=71101016

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087631 WO2020124948A1 (zh) 2018-12-21 2019-05-20 网络离线模型的处理方法、人工智能处理装置及相关产品

Country Status (1)

Country Link
WO (1) WO2020124948A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650913A (zh) * 2016-12-31 2017-05-10 中国科学技术大学 一种基于深度卷积神经网络的车流密度估计方法
CN106886023A (zh) * 2017-02-27 2017-06-23 中国人民解放军理工大学 一种基于动态卷积神经网络的雷达回波外推方法
CN108615046A (zh) * 2018-03-16 2018-10-02 北京邮电大学 一种储粮害虫检测识别方法及装置
US20180365562A1 (en) * 2017-06-20 2018-12-20 Battelle Memorial Institute Prediction of social media postings as trusted news or as types of suspicious news
CN109754072A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 网络离线模型的处理方法、人工智能处理装置及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650913A (zh) * 2016-12-31 2017-05-10 中国科学技术大学 一种基于深度卷积神经网络的车流密度估计方法
CN106886023A (zh) * 2017-02-27 2017-06-23 中国人民解放军理工大学 一种基于动态卷积神经网络的雷达回波外推方法
US20180365562A1 (en) * 2017-06-20 2018-12-20 Battelle Memorial Institute Prediction of social media postings as trusted news or as types of suspicious news
CN108615046A (zh) * 2018-03-16 2018-10-02 北京邮电大学 一种储粮害虫检测识别方法及装置
CN109754072A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 网络离线模型的处理方法、人工智能处理装置及相关产品

Similar Documents

Publication Publication Date Title
CN106951926B (zh) 一种混合架构的深度学习方法及装置
CN109711540B (zh) 一种计算装置及板卡
US20240111536A1 (en) Data processing apparatus and related products
CN110458285B (zh) 数据处理方法、装置、计算机设备和存储介质
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
WO2020124948A1 (zh) 网络离线模型的处理方法、人工智能处理装置及相关产品
CN116434040A (zh) 一种面向risc-v体系架构的实时目标检测方法及***
CN112766475B (zh) 处理部件及人工智能处理器
CN108052315A (zh) 医疗软件通信方法及装置
CN111949317B (zh) 指令处理方法、装置及相关产品
CN115373646A (zh) 扩展信息方法、装置和相关产品
WO2023232079A1 (zh) 数据存储、访问、运算方法及相关产品
US20240220819A1 (en) Compiling method, running method, and related product
CN111078125B (zh) 运算方法、装置及相关产品
CN111079915B (zh) 运算方法、装置及相关产品
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
CN111078285B (zh) 运算方法、***及相关产品
WO2023236929A1 (zh) 基于指令读取数据中的目标数据的方法及其设备
CN114648437A (zh) 对图片数据进行卷积的方法、装置、存储介质以及板卡
WO2023134588A1 (zh) 计算***、方法、装置及加速设备
WO2022135049A1 (zh) 规约多维向量的方法、电子设备以及存储介质
CN114185667A (zh) 数据处理方法及装置以及相关产品
WO2020073874A1 (zh) 机器学习运算的分配***及方法
CN115543329A (zh) 对运行于人工智能芯片上的区域候选网络进行优化的编译方法及其相关产品
CN113435591A (zh) 数据处理方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19898361

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19898361

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/01/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19898361

Country of ref document: EP

Kind code of ref document: A1