WO2020192587A1 - 人工智能计算装置及相关产品 - Google Patents

人工智能计算装置及相关产品 Download PDF

Info

Publication number
WO2020192587A1
WO2020192587A1 PCT/CN2020/080447 CN2020080447W WO2020192587A1 WO 2020192587 A1 WO2020192587 A1 WO 2020192587A1 CN 2020080447 W CN2020080447 W CN 2020080447W WO 2020192587 A1 WO2020192587 A1 WO 2020192587A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
storage
calculation
load
preset
Prior art date
Application number
PCT/CN2020/080447
Other languages
English (en)
French (fr)
Inventor
王楠
陈小兵
孙咏哲
赵永威
陈黎明
武志辉
佟亨文
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910226552.7A external-priority patent/CN111723920B/zh
Priority claimed from CN201910226678.4A external-priority patent/CN111723921B/zh
Priority claimed from CN201910316537.1A external-priority patent/CN110070176A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US17/440,529 priority Critical patent/US11983535B2/en
Publication of WO2020192587A1 publication Critical patent/WO2020192587A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3808Instruction prefetching for instruction reuse, e.g. trace cache, branch target cache
    • G06F9/381Loop buffering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • This application relates to the field of information processing technology, in particular to an artificial intelligence computing device and related products.
  • Artificial neural network is a powerful algorithm, which has been applied to various fields such as image and language in recent years.
  • the emergence of artificial intelligence computing devices can enable neural networks to be supported by hardware and perform calculations more efficiently.
  • Artificial intelligence computing devices generally have their own instruction set.
  • the instruction set will contain more instructions to be executed. It takes a long time to execute all the instructions in the instruction set, and the efficiency is affected. It will also contain instructions that are repeatedly executed, for example, when data is being executed. In the process of loading, if the data size is large, it needs to be transported multiple times to complete the address space conversion, for example, repeated addition and multiplication operations in template operations.
  • the repeated calculation of the count is directly performed in the normal operation, and each instruction will correspond to a section of execution code, and the code corresponding to the repeated instruction will occupy more storage space.
  • the embodiments of the present application provide an artificial intelligence computing device and related products, which can reduce the code amount of instruction information of instructions and improve instruction calculation efficiency.
  • an artificial intelligence computing device which includes a controller unit and an execution unit; wherein,
  • the controller unit is configured to obtain a first instruction set to be executed; and, obtain a second instruction set;
  • the controller unit is also used to determine whether a loop body is formed between the first instruction set and the second instruction set;
  • the execution unit is configured to execute the instructions in the second instruction set according to the instruction information of the first instruction set when a loop body is formed between the first instruction set and the second instruction set.
  • an embodiment of the present application provides an artificial intelligence computing method, which is applied to an artificial intelligence computing device, and the method includes:
  • the instructions in the second instruction set are executed according to the instruction information of the first instruction set.
  • embodiments of the present application provide a machine learning computing device, which includes one or more artificial intelligence computing devices described in the first aspect.
  • the machine learning arithmetic device is used to obtain the data to be calculated and control information from other processing devices, execute the specified machine learning calculation, and transmit the execution result to the peripheral device through the I/O interface;
  • the multiple computing devices can be linked through a specific structure and transmit data
  • a plurality of said computing devices are interconnected and transmit data through a PCIE bus to support larger-scale machine learning operations; a plurality of said computing devices share the same control system or have their own control systems; multiple said calculations The devices share memory or have their own memory; the interconnection mode of the multiple computing devices is any interconnection topology.
  • an embodiment of the present application provides a combined processing device, which includes the machine learning computing device described in the third aspect, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with the above-mentioned other processing devices to jointly complete the operation specified by the user.
  • the combined processing device may also include a storage device, which is respectively connected to the machine learning operation device and the other processing device, and is used to store data of the machine learning operation device and the other processing device.
  • embodiments of the present application provide a neural network chip, which includes the computing device described in the first aspect, the machine learning computing device described in the third aspect, or the neural network chip described in the fourth aspect. Combination processing device.
  • embodiments of the present application provide a neural network chip packaging structure, which includes the neural network chip described in the fifth aspect;
  • an embodiment of the present application provides a board, which includes the neural network chip packaging structure described in the sixth aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method steps described in the second aspect.
  • embodiments of the present application provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the computer program as in the second aspect The described method steps.
  • an embodiment of the present application provides an electronic device, which includes the neural network chip described in the fifth aspect or the board described in the seventh aspect.
  • an artificial intelligence computing device includes a controller unit, a storage unit, and an execution unit; the storage unit is connected to an external storage device, and the execution unit includes a load execution unit, a calculation execution unit Unit and storage execution unit;
  • the controller unit is configured to obtain a first instruction set to be executed, where the first instruction set includes a first load instruction, a first calculation instruction, and a first storage instruction; determining the first load instruction, the first instruction set Whether there is an association relationship between a calculation instruction and the first store instruction, if there is no association relationship between the first load instruction, the first calculation instruction, and the first store instruction, the first Sending a load instruction, the first calculation instruction, and the first storage instruction to the execution unit;
  • the execution unit is configured to execute the first load instruction, the first calculation instruction, and the first storage instruction in parallel in a first time slice; wherein, the storage execution unit is configured to execute according to the first time slice.
  • the storage instruction transmits the first calculation result corresponding to the first input data in the first calculation task from the storage unit to the external storage device, and the calculation execution unit is configured to perform the calculation on the second calculation task according to the first calculation instruction.
  • the second input data is calculated to obtain the second calculation result; the load execution unit is configured to transmit the third input data in the third operation task from the external storage device to the storage according to the first load instruction unit.
  • the controller unit is specifically configured to:
  • the controller unit is specifically configured to:
  • the artificial intelligence computing device further includes a storage unit, the storage unit is connected to an external storage device, and the execution unit includes a load execution unit, a calculation execution unit, and a storage execution unit.
  • the storage execution unit is configured to correspond to the first input data in the first operation task according to the first storage instruction
  • the first calculation result is transmitted from the storage unit to the external storage device, and the calculation execution unit is configured to calculate the second input data in the second calculation task according to the first calculation instruction to obtain the second calculation result;
  • the load execution unit is configured to transmit the third input data in the third operation task from the external storage device to the storage unit according to the first load instruction.
  • the storage unit includes a first storage area and a second storage area, and the third input data in the third operation task is transmitted from the external storage device to the external storage device according to the first load instruction.
  • the load execution unit is specifically configured to:
  • the controller unit is further configured to:
  • the second instruction set includes a second load instruction, a second calculation instruction, and a second store instruction
  • the second store instruction is used to transfer the second calculation result from the storage unit
  • An instruction to the external storage device the second calculation instruction is an instruction used to calculate the third input data in the third calculation task and obtain a third calculation result
  • the second load instruction Is an instruction to transfer the fourth input data in the fourth operation task from the external storage device to the storage unit;
  • the execution unit is further configured to execute the second load instruction, the second calculation instruction, and the second store instruction in parallel in a second time slice, where the second time slice is later than the first time sheet;
  • the storage execution unit is used to transmit the second calculation result from the storage unit to the external storage device according to the second storage instruction; the calculation execution unit is used to transfer the second calculation result in the second time slice
  • the third input data is acquired from the first storage area according to the second calculation instruction, and the calculation is performed according to the third input data to obtain the third calculation result;
  • the load execution unit is used for In the second time slice, the fourth input data is transmitted from the external storage device to the second storage area in a ping-pong operation according to the second load instruction.
  • the third input data includes a plurality of third input sub-data
  • the third input data in the third operation task is transmitted from the external storage device to the first storage area in a ping-pong operation.
  • the loading execution unit is specifically configured to:
  • the multiple third input sub-data corresponding to the multiple target storage durations are transmitted to the first storage area in descending order of storage duration, and stored from both ends of the first storage area to the middle.
  • the execution unit executes the second load instruction, the second load instruction, and the second load instruction in parallel in the second time slice.
  • the controller unit is further configured to:
  • the loop body is formed between the first instruction set and the second instruction set, jump to the operation code storage area of the instruction corresponding to the first instruction set according to the jump instruction, and obtain from the operation code storage area
  • the operation code of the first load instruction use the operation code as the operation code of the second load instruction, and obtain the operation domain corresponding to the second load instruction, wherein the operation code includes the first
  • the identifier of the calculation instruction the operation field includes the storage address of the fourth input data.
  • the controller unit is specifically configured to:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • the embodiments of the present application provide an artificial intelligence computing method, which is applied to an artificial intelligence computing device.
  • the artificial intelligence computing device includes a controller unit, a storage unit, and an execution unit; the storage unit is connected to an external storage device ,
  • the execution unit includes a load execution unit, a calculation execution unit, and a storage execution unit; the method includes:
  • the controller unit obtains a first instruction set to be executed, the first instruction set includes a first load instruction, a first calculation instruction, and a first storage instruction; determines the first load instruction and the first calculation instruction Whether there is an association relationship with the first store instruction, if there is no association relationship between the first load instruction, the first calculation instruction and the first store instruction, the first load instruction, Sending the first calculation instruction and the first storage instruction to the execution unit;
  • the execution unit executes the first load instruction, the first calculation instruction, and the first store instruction in parallel in a first time slice; wherein the store execution unit executes the first load instruction, the first calculation instruction, and the first store instruction according to the first store instruction.
  • the first calculation result corresponding to the first input data in the calculation task is transmitted from the storage unit to the external storage device, and the calculation execution unit calculates the second input data in the second calculation task according to the first calculation instruction , Obtaining a second calculation result; the load execution unit transmits the third input data in the third operation task from the external storage device to the storage unit according to the first load instruction.
  • the determining whether there is an association relationship between the first load instruction, the first calculation instruction, and the first store instruction includes:
  • the determining whether there is an association relationship between the first load instruction, the first calculation instruction, and the first store instruction includes:
  • the artificial intelligence computing device includes a storage unit, the storage unit is connected to an external storage device, and the first load instruction, the first calculation instruction, and the first time slice are executed in parallel in a first time slice.
  • a storage instruction including:
  • the third input data in the third operation task is transmitted from the external storage device to the storage unit according to the first load instruction.
  • the storage unit includes a first storage area and a second storage area, and the third input data in the third operation task is transferred from the external storage device to the storage according to the first load instruction.
  • Unit includes:
  • the method further includes:
  • the second instruction set includes a second load instruction, a second calculation instruction, and a second store instruction
  • the second store instruction is used to transfer the second calculation result from the storage unit
  • An instruction to the external storage device the second calculation instruction is an instruction used to calculate the third input data in the third calculation task and obtain a third calculation result
  • the second load instruction Is an instruction to transfer the fourth input data in the fourth operation task from the external storage device to the storage unit;
  • the second load instruction, the second calculation instruction, and the second store instruction are executed in parallel in a second time slice, and the second time slice is later than the first time slice; wherein the store executes
  • the unit is used for transmitting the second calculation result from the storage unit to the external storage device according to the second storage instruction; the calculation execution unit is used for transmitting the second calculation result according to the second time slice in the second time slice
  • the calculation instruction obtains the third input data from the first storage area, and performs calculations according to the third input data to obtain the third calculation result;
  • the load execution unit is configured to perform a calculation in the second time slice And transmitting the fourth input data from the external storage device to the second storage area in a ping-pong operation according to the second load instruction.
  • the third input data includes a plurality of third input sub-data
  • the third input data in the third operation task is transmitted from the external storage device to the first storage by performing a ping-pong operation.
  • Area including:
  • the multiple third input sub-data corresponding to the multiple target storage durations are transmitted to the first storage area in descending order of storage duration, and stored from both ends of the first storage area to the middle.
  • the execution unit executes the second load instruction, the second load instruction, and the second load instruction in parallel in the second time slice.
  • the method further includes:
  • the loop body is formed between the first instruction set and the second instruction set, jump to the operation code storage area of the instruction corresponding to the first instruction set according to the jump instruction, and store from the operation code
  • the area obtains the operation code of the first load instruction, uses the operation code as the operation code of the second load instruction, and obtains the operation domain corresponding to the second load instruction, wherein the operation code includes the The identifier of the first calculation instruction; the operation field includes the storage address of the fourth input data.
  • the determining whether a loop body is formed between the first instruction set and the second instruction set includes:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • an embodiment of the present application provides a machine learning computing device.
  • the machine learning computing device includes one or more artificial intelligence computing devices described in the eleventh aspect.
  • the machine learning arithmetic device is used to obtain the data to be calculated and control information from other processing devices, execute the specified machine learning calculation, and transmit the execution result to the peripheral device through the I/O interface;
  • the multiple computing devices can be linked through a specific structure and transmit data
  • a plurality of said computing devices are interconnected and transmit data through a PCIE bus to support larger-scale machine learning operations; a plurality of said computing devices share the same control system or have their own control systems; multiple said calculations The devices share memory or have their own memory; the interconnection mode of the multiple computing devices is any interconnection topology.
  • embodiments of the present application provide a combined processing device, which includes the machine learning computing device as described in the thirteenth aspect, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with the above-mentioned other processing devices to jointly complete the operation specified by the user.
  • the combined processing device may also include a storage device, which is respectively connected to the machine learning operation device and the other processing device, and is used to store data of the machine learning operation device and the other processing device.
  • an embodiment of the present application provides a neural network chip, which includes the computing device described in the first aspect, the machine learning computing device described in the thirteenth aspect, or the fourteenth aspect described above The combined processing device.
  • embodiments of the present application provide a neural network chip packaging structure, which includes the neural network chip described in the fifteenth aspect;
  • an embodiment of the present application provides a board, which includes the neural network chip packaging structure described in the sixteenth aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method steps described in the twelfth aspect.
  • the embodiments of the present application provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program.
  • the computer program is operable to cause a computer to execute The method steps described in the second aspect.
  • an embodiment of the present application provides an electronic device, which includes the neural network chip according to the fifteenth aspect or the board according to the seventeenth aspect.
  • the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • the equipment includes a nuclear magnetic resonance instrument, a B-ultrasound instrument and/or an electrocardiograph.
  • an embodiment of the present application provides a method for processing a network offline model, where:
  • the function set corresponding to the version information in the machine learning library is called to generate an offline model corresponding to the version information.
  • the machine learning library includes an interface function, and the interface function is used to call different versions The function set corresponding to the information, according to the model information and the version information, calling the function set corresponding to the version information in the machine learning library to generate an offline model corresponding to the version information, including:
  • an offline model corresponding to the version information is generated.
  • the machine learning library includes environment variables, and the environment variables are used to call different versions
  • the function set corresponding to the information, the calling the function set corresponding to the version information in the machine learning library to generate an offline model corresponding to the version information includes:
  • an offline model corresponding to the version information is generated.
  • the method further includes:
  • the model information includes: model structure information, weight data, and input and output data.
  • the function set includes: a set of general operators and a set of function operators.
  • the method further includes:
  • an embodiment of the present application provides an offline model processing device, wherein:
  • the obtaining unit is used to obtain the version information of the runtime library running the offline model and the model information of the offline model;
  • the generating unit is configured to call the function set corresponding to the version information in the machine learning library according to the model information and the version information to generate an offline model corresponding to the version information.
  • the machine learning library includes interface functions, and the interface functions are used to call different versions The function set corresponding to the information, in the aspect of calling the function set corresponding to the version information in the machine learning library according to the model information and the version information to generate an offline model corresponding to the version information, the generating The unit is specifically configured to call the function set corresponding to the version information in the machine learning library through the interface function; generate the function set corresponding to the version information according to the function set corresponding to the version information and the model information Offline model.
  • the machine learning library is called
  • the generating unit is specifically configured to call the function corresponding to the version information in the machine learning library through the environment variable Collection; according to the function collection corresponding to the version information and the model information, an offline model corresponding to the version information is generated.
  • the generating unit is also used to provide a runtime library for running offline models According to the model information of the offline model, call the function set corresponding to the latest version information of the runtime library in the machine learning library to generate the offline model corresponding to the latest version information of the runtime library.
  • the model information includes: model structure information, weight data, and input and output data.
  • the function set includes: a general operator set and a function operator set.
  • the device further includes:
  • the running unit is configured to run the offline model based on the runtime library corresponding to the version information.
  • an embodiment of the present application provides an artificial intelligence processing device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory, And is configured to be executed by the processor, and the program includes a method for executing the method according to the twenty-first aspect.
  • an embodiment of the present application provides a computer-readable storage medium having a computer program stored thereon, and the computer program implements the method described in the twenty-first aspect when the computer program is executed by a processor.
  • an embodiment of the present application provides a combined processing device, characterized in that the combined processing device includes the offline model processing device as described in the twenty-second aspect, a universal interconnection interface, and other processing devices;
  • the processing device of the offline model interacts with the other processing devices to jointly complete the calculation operation specified by the user.
  • Figure 1-1 is a schematic structural diagram of an artificial intelligence computing device provided by an embodiment of the present application.
  • Figure 1-2A is a schematic flowchart of an artificial intelligence calculation method provided by an embodiment of the present application.
  • FIG. 1-2B is a schematic diagram illustrating a parallel execution of instructions in a neural network instruction set provided by an embodiment of the present application
  • Figure 1-2C is a schematic diagram of a demonstration of arranging instructions in an instruction set according to a tree structure provided by an embodiment of the present application;
  • FIGS 1-3 are structural diagrams of a combined processing device provided by an embodiment of the present application.
  • FIGS 1-4 are structural diagrams of another combined processing device provided by an embodiment of the present application.
  • Figures 1-5 are schematic diagrams of the structure of a board provided by an embodiment of the application.
  • Figure 2-1 is a schematic structural diagram of an artificial intelligence computing device provided by an embodiment of the present application.
  • 2-2A is a schematic flowchart of an artificial intelligence calculation method provided by an embodiment of the present application.
  • 2-2B is a schematic diagram of a demonstration of a parallel execution of instructions in a neural network instruction set provided by an embodiment of the present application
  • 2-2C is a schematic diagram of a demonstration of arranging instructions in an instruction set according to a tree structure provided by an embodiment of the present application;
  • Figure 2-3 is a structural diagram of a combined processing device provided by an embodiment of the present application.
  • FIGS 2-4 are structural diagrams of another combined processing device provided by an embodiment of the present application.
  • 2-5 is a schematic diagram of the structure of a board provided by an embodiment of the application.
  • Figure 3-1 is a schematic flowchart of an offline model processing method provided by an embodiment of the application.
  • 3-3 is a schematic structural diagram of an offline model processing device provided by an embodiment of the application.
  • Figure 3-4 is a schematic structural diagram of an artificial intelligence processing device provided by an embodiment of the application.
  • Figures 3-5 are schematic structural diagrams of a combined processing apparatus provided by an embodiment of the application.
  • an artificial intelligence computing device which is used to perform machine learning calculations.
  • the computing device includes: a controller unit 11, a storage unit 10, and an execution unit 12, wherein the storage The unit 10 is connected to an external storage device.
  • the execution unit 12 includes a load execution unit 121, a calculation execution unit 122, and a storage execution unit 123; among them,
  • the controller unit is configured to obtain a first instruction set to be executed; and, obtain a second instruction set;
  • the controller unit is also used to determine whether a loop body is formed between the first instruction set and the second instruction set;
  • the execution unit is configured to execute the instructions in the second instruction set according to the instruction information of the first instruction set when a loop body is formed between the first instruction set and the second instruction set.
  • the execution unit in terms of executing the instructions in the second instruction set according to the instruction information of the first instruction set, is specifically configured to:
  • the operation code is used as the operation code of the second instruction, wherein the operation code includes the identification of the first instruction.
  • the first instruction set includes the first load instruction, the first calculation instruction, and the first store instruction of the first operation task; the second instruction set includes the second load instruction of the second operation task. Instructions, a second calculation instruction, and a second storage instruction; in the aspect of determining whether a loop body is formed between the first instruction set and the second instruction set, the controller unit is specifically configured to:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • the first instruction set includes a first storage instruction of a first computing task, a second computing instruction of a second computing task, and a third load instruction corresponding to the third computing task;
  • the second The instruction set includes the second storage instruction of the second operation task, the third calculation instruction of the third operation task and the fourth load instruction of the fourth operation task; in the determination of the first instruction set and the second instruction set Regarding whether it constitutes a loop body, the controller unit is specifically used for:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • controller unit is further used for:
  • the execution unit is further configured to execute the first storage in parallel in the first time slice when there is no association relationship between the first storage instruction, the second calculation instruction, and the third load instruction. Instructions, the second calculation instruction, and the third load instruction.
  • the controller unit is specifically configured to:
  • the controller unit is specifically configured to:
  • the artificial intelligence computing device further includes a storage unit connected to an external storage device;
  • the execution unit includes a load execution unit, a calculation execution unit, and a storage execution unit;
  • the store execution unit is configured to execute the The first calculation result corresponding to the first input data in the first calculation task is transmitted from the storage unit to the external storage device, and the calculation execution unit is configured to perform calculations in the second calculation task according to the second calculation instruction.
  • the second input data is calculated to obtain the second calculation result;
  • the load execution unit is configured to transmit the third input data in the third operation task from the external storage device to the third load instruction according to the third load instruction.
  • the storage unit includes a first storage area and a second storage area, and the third input data in the third operation task is transferred from the outside according to the third load instruction.
  • the loading execution unit is specifically configured to:
  • the third input data includes a plurality of third input sub-data
  • the third input data in the third operation task is subjected to a ping-pong operation and transferred from the external storage device to the third input sub-data.
  • the loading execution unit is specifically configured to:
  • the multiple third input sub-data corresponding to the multiple target storage durations are transmitted to the first storage area in descending order of storage duration, and stored from both ends of the first storage area to intermediate.
  • Figure 1-2A is a schematic flow diagram of an artificial intelligence computing method provided by an embodiment of the application, which is applied to an artificial intelligence computing device, which includes a controller unit, a storage unit, and Execution unit; the storage unit is connected to an external storage device, the execution unit includes a load execution unit, a calculation execution unit, and a storage execution unit; the method includes:
  • multiple instructions in the instruction set of the neural network can be divided into input and output instructions and calculation instructions.
  • the input and output instructions can be divided into load instructions and store instructions.
  • the execution unit of the artificial intelligence computing device is used to The load instruction transfers the input data from the external storage device to the storage unit on the artificial intelligence computing device, and then directly obtains the input data from the storage unit according to the calculation instruction, and calculates according to the input data to obtain the calculation result, and caches the calculation result to the storage unit , And finally transfer the calculation result from the storage unit to the external storage device according to the storage instruction.
  • the division of the instruction set of the neural network may not be limited to the division of the three stages of load instructions, calculation instructions, and storage instructions, and instructions may also be divided into other standards, which is not limited in the embodiment of the present application.
  • the first instruction set may include the first load instruction, the first calculation instruction, and the first storage instruction of the first operation task; the second instruction set may include the second load instruction and the second calculation instruction of the second operation task And the second store instruction.
  • the first load instruction is used to transfer the first input data in the first operation task from the external storage device to the storage unit
  • the first calculation instruction is used to calculate and combine the first input data in the first operation task.
  • the first storage instruction is used to transfer the first calculation result from the storage unit to the external storage device;
  • the second load instruction is used to transfer the second input data in the second calculation task from the external storage device To the storage unit, the second calculation instruction is used to calculate the second input data in the second calculation task and obtain the second calculation result, and the second storage instruction is used to transfer the second calculation result from the storage unit to the external storage device.
  • the first instruction set may include the first storage instruction of the first operation task, the second calculation instruction of the second operation task, and the third load instruction of the third operation task; the second instruction set includes the information of the second operation task.
  • the first storage instruction is used to transfer the first calculation result from the storage unit to the external storage device, and the second calculation instruction is used to calculate the second input data in the second calculation task and obtain the second calculation result.
  • the load instruction is used to transfer the third input data in the third calculation task from the external storage device to the storage unit; the second storage instruction is used to transfer the second calculation result from the storage unit to the external storage device, and the third calculation instruction It is used to calculate the third input data in the third calculation task and obtain the third calculation result, and the fourth load instruction is used to transfer the fourth input data in the fourth calculation task from the external storage device to the storage unit.
  • the first instruction set includes a first load instruction, a first calculation instruction, and a first store instruction of a first computing task;
  • the second instruction set includes a second load instruction, a second Computing instructions and second storage instructions; in step 202, determining whether the first instruction set and the second instruction set constitute a loop body may include the following steps:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • the preset instruction information may include at least one of the following information: instruction type, remaining execution times, and whether the parity is reversed.
  • the instruction type means that the instruction is a load instruction, a calculation instruction, or a store instruction, and when the instruction is a calculation instruction, the type of operator included in the calculation instruction.
  • the operator type can include at least one of the following: addition, subtraction, multiplication, and division , Convolution, and the combination of the above-mentioned multiple operators, etc.
  • the remaining execution times refer to the remaining execution times of repeated operations that need to be executed multiple times in an operation.
  • the first load instruction, the first calculation instruction, and the first storage instruction in the first operation task can be corresponding to the second load instruction, the second calculation instruction, and the second storage instruction of the second operation task.
  • Y 2 wx 2 + b calculation
  • the remaining calculation times of the corresponding second calculation instruction are 98. It can be seen that between the first instruction set corresponding to the first calculation task and the instructions in the second instruction set corresponding to the second calculation task, the first load instruction and the second load The instruction type is the same, the remaining load times are different, the first storage instruction is the same as the second storage instruction type, and the remaining storage times are different.
  • the operator types in the first calculation instruction and the second calculation instruction both include addition and multiplication operators, and operations The order is the same, only the number of remaining calculations is different. Therefore, it can be determined that the first instruction set and the second instruction set are loop bodies.
  • the first instruction set includes the first storage instruction of the first operation task, the second calculation instruction of the second operation task, and the third load instruction of the third operation task;
  • the second instruction set includes the second The second storage instruction of the computing task, the third computing instruction of the third computing task, and the fourth load instruction of the fourth computing task; in step 202, it is determined whether the first instruction set and the second instruction set are between Forming a loop body can include the following steps:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • the instructions in the instruction set of the neural network can be arranged in a tree structure.
  • Figure 1-2B is an example of this application, which arranges the instructions in the instruction set according to the tree.
  • the first level of numbers in the tree structure is used to represent chip information, for example, "1" represents the first chip
  • the second level of numbers is used to represent time
  • "1" means the first time slice
  • "2" means the second time slice
  • the third level of letters means load instructions, calculation instructions, and store instructions in each time slice
  • L stands for load Instruction
  • C stands for calculation instruction
  • S storage instruction
  • the loop body corresponding to the instruction set of each time slice can be analyzed in advance to obtain the preset instruction information of each node in the tree structure.
  • the first time can be determined Whether the first instruction set corresponding to the slice and the second instruction set corresponding to the second time slice constitute a loop body, specifically, the fifth preset instruction information corresponding to the first storage instruction of the first operation task is combined with the information of the second operation task.
  • the sixth preset instruction information corresponding to the second storage instruction is compared; the seventh preset instruction information corresponding to the second calculation instruction of the second operation task is compared with the eighth preset corresponding to the third calculation instruction of the third operation task Compare instruction information; and compare the ninth preset instruction information corresponding to the third load instruction of the third operation task with the tenth preset instruction information corresponding to the fourth load instruction of the fourth operation task; If the remaining execution times of the instructions are different, and the remaining execution times of the instructions corresponding to the second time slice are small, and the remaining information is exactly
  • the second instruction set in the second time slice also contains load instructions, calculation instructions and store instructions.
  • the calculation instructions include addition and multiplication operators, and the remaining operations of load instructions The number of times is 4, the number of remaining operations of the calculation instruction is 8, and the number of remaining operations of the storage instruction is 2, and it can be determined that the first instruction set corresponding to the first time slice corresponds to the second time slice to which the calculation instruction belongs
  • the second instruction set constitutes the loop body.
  • multiple instruction sets corresponding to multiple consecutive time slices constitute a loop body. If multiple instruction sets corresponding to multiple consecutive time slices constitute a loop body, it indicates that the multiple consecutive time slices have the same type. Instructions are repeated instructions.
  • the starting point of the loop body is the time slice where the node with the largest number of remaining operations is located, and the length of the loop body is the difference between the farthest time slice that satisfies the loop condition and the start time slice .
  • the above-mentioned instruction information may include the operation code and operation field of the instruction.
  • the operation code of the instruction in the first instruction set and The operation domain is stored, and then, when the instruction in the second instruction set is executed, it directly jumps to the operation code of the instruction corresponding to the instruction in the second instruction set, and then operates according to the instruction in the first instruction set
  • the code executes the instructions in the second instruction set.
  • the operation code corresponding to the first calculation instruction corresponding to the first time slice can be stored in the operation code storage area, and there is no need to repeatedly store 100
  • the operation code of multiple instructions corresponding to the second Y i wx i +b operation.
  • the jump instruction can be used to jump to the operation code storage area to obtain the first instruction set corresponding to the second instruction set.
  • the operation code of the instruction of one instruction set can reuse the operation code of the operation code storage area, save the storage space of the operation code, can reduce the code amount of each instruction in the instruction set in the second time slice, and can also save the instruction storage space , Improve computing efficiency.
  • Y 1 wx 1 + b operation corresponding to multiple preset instruction information
  • Y 2 wx 2 + b operation corresponding
  • the remaining calculation times of the first calculation instruction are 98. It can be seen that between the first instruction set corresponding to the first time slice and the instructions in the second instruction set in the second time slice, the first load instruction and the second load instruction type Same, the remaining load times are different, the first storage instruction and the second storage instruction are of the same type, and the remaining storage times are different.
  • the operator types in the first calculation instruction and the second calculation instruction both include addition and multiplication operators, and the order of operations is both Same, only the number of remaining calculations is different. Therefore, it can be determined that the first instruction set and the second instruction set are loop bodies.
  • executing the instructions in the second instruction set according to the instruction information of the first instruction set may include the following steps:
  • the operation code is used as the operation code of the second instruction, wherein the operation code includes the identification of the first instruction.
  • A1 determines whether there is an association relationship between the first store instruction, the second calculation instruction, and the third load instruction
  • parallel execution can be performed between load instructions and store instructions, between load instructions and calculation instructions, between store instructions and calculation instructions, between load instructions and load instructions, between calculation instructions and calculation instructions, and between The instruction and the storage instruction cannot be executed in parallel and need to be executed serially.
  • the execution of one instruction requires the data of another instruction, it indicates that there is an association relationship between the two instructions, for example, if the execution of a calculation instruction requires
  • the data loaded by a load instruction indicates that the calculation instruction needs to be executed after the load instruction is executed. It can be determined that the load instruction has an association relationship with the calculation instruction. Therefore, the association relationship between the instructions to be executed can be determined. If the multiple instructions to be executed do not have an association relationship, the load execution unit, the calculation execution unit, and the storage execution unit in the execution unit execute in parallel two or three instructions that do not have an association relationship.
  • Figure 1-2C is a parallel execution neural network provided by the embodiment of this application.
  • the demonstration diagram of the instructions in the instruction set of the network is shown in Figure 1-2C.
  • L stands for load instructions
  • C stands for calculation instructions
  • S stands for storage instructions.
  • each horizontal line of load instructions, calculation instructions and store instructions corresponds to one operation Tasks can load input data, calculate calculation results, and store the results; each column of load instructions, calculation instructions, and storage instructions corresponds to a time slice, indicating that there will be no associated load instructions, calculation instructions, and Store instructions are executed in parallel. It can be seen that by executing instructions that do not have an association relationship in parallel, multiple computing tasks that do not have an association relationship can be executed in parallel, thereby saving calculation time and improving calculation efficiency.
  • determining whether there is an association relationship between the first store instruction, the second calculation instruction, and the third load instruction may include the following steps:
  • determining whether there is an association relationship between the first store instruction, the second calculation instruction, and the third load instruction may include the following steps:
  • the artificial intelligence computing device further includes a storage unit connected to an external storage device; in step A2, the first storage instruction and the second calculation are executed in parallel in the first time slice.
  • the instruction and the third load instruction may include the following steps:
  • the storage unit includes a first storage area and a second storage area.
  • the third input data in the third operation task is transferred from the external storage device according to the third load instruction.
  • the transmission to the storage unit may include the following steps:
  • the storage unit can be divided into a first storage area and a second storage area.
  • a ping-pong operation can be performed to transfer input data from the external storage device to the first storage area and the second storage area in turn.
  • the second storage area is used for storage.
  • the third input data can be transferred and stored to the first storage area according to the third load instruction
  • the second time slice can be stored according to the fourth load instruction.
  • the four input data is stored in the second storage area.
  • the third calculation instruction can be executed in parallel, and the third input data is obtained from the first storage area according to the third calculation instruction for calculation, and the calculation result is obtained.
  • the download An input data is stored in the first storage area, and the next calculation instruction corresponding to the fourth load instruction is executed in parallel, and so on.
  • the storage space of the storage unit can be saved.
  • the third input data includes a plurality of third input sub-data
  • the third input data in the third operation task is subjected to a ping-pong operation and transferred from the external storage device to the first storage area , which can specifically include the following steps:
  • the target storage duration of each third input sub-data is then stored from both ends of the first storage area to the middle in ascending order of the storage duration.
  • the reading time length of the third input sub-data corresponding to the larger target storage time length can be reduced, thereby improving the calculation efficiency.
  • the input data from the external storage device to the second storage area can also be stored from the two ends of the second storage area to the middle according to the storage duration in descending order.
  • w and b are the data that will be read repeatedly. It can be determined that the storage time corresponding to w and b is longer, and w and b can be stored At both ends of the first storage area or the second storage area, x i is stored in the middle of the first storage area or the second storage area, so that when data is read from the first storage area or the second storage area, every The time length of reading w and b is small, which can reduce the time consumption of reading data.
  • each horizontal row of load instructions, calculation instructions, and store instructions corresponds to an operation task.
  • the first operation task may include the first load instruction La, the first calculation instruction Ca, and
  • the first storage instruction Sa can load the input data from the external storage device to the a1 area of the storage unit on the artificial intelligence computing device through the first load instruction La; then the input data is read from the a1 area through the first calculation instruction Ca, and the input data The data is calculated to obtain the calculation result, and the calculation result is stored in the a2 area of the storage unit on the artificial intelligence computing device; finally, the calculation result is read from the a2 area through the first storage instruction Sa, and the calculation result is read from the a2 area of the storage unit Transfer to an external storage device.
  • the second operation task may include the second load instruction Lb, the second calculation instruction Cb, and the second storage instruction Sb
  • the third operation task may include the third load instruction Lc and the third calculation instruction.
  • the instruction Cc and the third storage instruction Sc, and the fourth operation task may include a fourth load instruction Ld, a fourth calculation instruction Cd, and a fourth storage instruction Sd. It can be seen that in the first time slice, if there is no correlation between the first storage instruction Sa of the first operation task, the second calculation instruction Cb of the second operation task, and the third load instruction Lc of the third operation task. , The first storage instruction Sa, the second calculation instruction Cb, and the third load instruction Lc can be executed in parallel in the first time slice.
  • the second storage instruction Sb of the second operation task, the third calculation instruction Cc of the third operation task, and the fourth load instruction Ld of the fourth operation task can also be executed in parallel in the second time slice.
  • the second instruction set composed of the third calculation instruction Cc and the fourth load instruction Ld constitutes the loop body, and when the instruction in the instruction set corresponding to the second time slice is executed, jump to the instruction corresponding to the first instruction set according to the jump instruction
  • the operation code storage area of the operation code specifically, the first operation code of the third load instruction Lc, the second operation code of the second calculation instruction Cb, and the third operation code of the first storage instruction Sa are obtained from the operation code storage area; then, The first operation code is used as the operation code of the fourth load instruction Ld, the second operation code is used as the operation code of the third calculation instruction Cc, and the third operation code is used as the operation code of the second storage instruction Sb; in addition, the first operation code can be obtained.
  • the four load instruction Ld corresponds to the first operation domain, the third calculation instruction
  • the technical solution provided by this application reduces the amount of code expanded by repeated instructions by folding the repeated instructions in the instruction set of the neural network, executing the repeated instructions by jumping instructions, and storing the data in the neural network in different divided areas. , Improve the efficiency of obtaining data, thereby improving the computational efficiency of the neural network.
  • This application also discloses a machine learning computing device, which includes one or more artificial intelligence computing devices mentioned in this application, which is used to obtain data to be computed and control information from other processing devices, and perform specified machine learning computing.
  • the execution result is passed to the peripheral device through the I/O interface.
  • Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
  • the artificial intelligence computing devices can be linked and transmit data through a specific structure, for example, interconnect and transmit data through a PCIE bus to support larger-scale machine learning operations.
  • the same control system can be shared, or there can be separate control systems; the memory can be shared, or each accelerator can have its own memory.
  • the interconnection mode can be any interconnection topology.
  • the machine learning computing device has high compatibility and can be connected to various types of servers through the PCIE interface.
  • the application also discloses a combined processing device, which includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with other processing devices to jointly complete operations specified by the user.
  • Figures 1-3 are schematic diagrams of the combined processing device.
  • processing devices include one or more types of general/special processors such as central processing unit CPU, graphics processing unit GPU, neural network processor, etc.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, completing basic controls such as opening and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • the universal interconnection interface is used to transmit data and control instructions between the machine learning computing device and other processing devices.
  • the machine learning computing device obtains the required input data from other processing devices and writes them to the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them to the on-chip control buffer of the machine learning computing device; also The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
  • the structure is shown in FIGS. 1-4, and may also include a storage device, which is respectively connected to the machine learning computing device and the other processing device.
  • the storage device is used to store data in the machine learning operation device and the other processing device, and is especially suitable for data that cannot be fully stored in the internal storage of the machine learning operation device or other processing device.
  • the combined processing device can be used as an SOC system-on-chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also disclosed, which includes the aforementioned machine learning computing device or combination processing device.
  • a chip packaging structure which includes the aforementioned chip.
  • a board card which includes the chip packaging structure described above. Refer to Figure 1-5.
  • Figure 1-5 provides a board.
  • the board may also include other supporting components, including but not limited to: storage device 390, interface device 391 And control device 392;
  • the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used for storing data.
  • the storage device may include multiple groups of storage units 393. Each group of the storage unit is connected to the chip through a bus. It can be understood that each group of the storage unit may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips). In an embodiment, the chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip for controlling the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the chip in the chip packaging structure.
  • the interface device is used to implement data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. This application does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the chip can be in different working states such as multiple load and light load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
  • an electronic device is applied, which includes the above board.
  • Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • artificial neural network is a powerful algorithm, which has been applied to various fields such as image and language in recent years.
  • Artificial intelligence computing devices can enable neural networks to be supported by hardware and perform calculations more efficiently.
  • Artificial intelligence computing devices generally have their own instruction set.
  • the instruction set will contain more instructions to be executed. It takes a long time to execute all the instructions in the instruction set, and the efficiency is affected. It will also contain instructions that are repeatedly executed, for example, when data is being executed. During the loading process, if the data size is large, it needs to be transported multiple times to complete the address space conversion. For example, the repeated addition and multiplication operations in the template operation, etc., result in a decrease in calculation efficiency.
  • an artificial intelligence computing device which is used to perform machine learning calculations.
  • the computing device includes: a controller unit 21, a storage unit 20, and an execution unit 22, wherein the storage The unit 20 is connected to an external storage device, and the execution unit 22 includes a load execution unit 221, a calculation execution unit 222, and a storage execution unit 223; among them,
  • the controller unit is configured to obtain a first instruction set to be executed, the first instruction set including a first load instruction, a first calculation instruction, and a first storage instruction; determine the first load instruction, the first instruction set Whether there is an association relationship between a calculation instruction and the first store instruction, if there is no association relationship between the first load instruction, the first calculation instruction, and the first store instruction, the first Sending a load instruction, the first calculation instruction, and the first storage instruction to the execution unit;
  • the execution unit is configured to execute the first load instruction, the first calculation instruction, and the first store instruction in parallel in a first time slice.
  • the controller unit is specifically configured to:
  • the controller unit is specifically configured to:
  • the artificial intelligence computing device further includes a storage unit connected to an external storage device, and the execution unit includes a load execution unit, a calculation execution unit, and a storage execution unit.
  • the store execution unit is configured to perform the first operation in the first operation task according to the first store instruction.
  • the first calculation result corresponding to the input data is transmitted from the storage unit to the external storage device, and the calculation execution unit is used to calculate the second input data in the second calculation task according to the first calculation instruction to obtain the first calculation result.
  • the load execution unit is configured to transmit the third input data in the third operation task from the external storage device to the storage unit according to the first load instruction.
  • the storage unit includes a first storage area and a second storage area, and the third input data in the third operation task is transferred from the external storage device according to the first load instruction.
  • the loading execution unit is specifically configured to:
  • the controller unit is further configured to:
  • the second instruction set includes a second load instruction, a second calculation instruction, and a second store instruction
  • the second store instruction is used to transfer the second calculation result from the storage unit
  • An instruction to the external storage device the second calculation instruction is an instruction used to calculate the third input data in the third calculation task and obtain a third calculation result
  • the second load instruction Is an instruction to transfer the fourth input data in the fourth operation task from the external storage device to the storage unit;
  • the execution unit is further configured to execute the second load instruction, the second calculation instruction, and the second store instruction in parallel in a second time slice, where the second time slice is later than the first time sheet;
  • the storage execution unit is used to transmit the second calculation result from the storage unit to the external storage device according to the second storage instruction; the calculation execution unit is used to transfer the second calculation result in the second time slice
  • the third input data is acquired from the first storage area according to the second calculation instruction, and the calculation is performed according to the third input data to obtain the third calculation result;
  • the load execution unit is used for In the second time slice, the fourth input data is transmitted from the external storage device to the second storage area in a ping-pong operation according to the second load instruction.
  • the third input data includes a plurality of third input sub-data
  • the third input data in the third operation task is transmitted from the external storage device to the ping-pong operation.
  • the loading execution unit is specifically configured to:
  • the multiple third input sub-data corresponding to the multiple target storage durations are transmitted to the first storage area in descending order of storage duration, and stored from both ends of the first storage area to the middle.
  • the execution unit executes the second load instruction, before the second calculation instruction and the second storage instruction, the controller unit is further configured to:
  • the loop body is formed between the first instruction set and the second instruction set, jump to the operation code storage area of the instruction corresponding to the first instruction set according to the jump instruction, and obtain from the operation code storage area
  • the operation code of the first load instruction use the operation code as the operation code of the second load instruction, and obtain the operation domain corresponding to the second load instruction, wherein the operation code includes the first
  • the identifier of the calculation instruction the operation field includes the storage address of the fourth input data.
  • the controller unit is specifically configured to:
  • preset instruction information corresponding to each instruction in the first instruction set and the second instruction set to obtain multiple preset instruction information, where the preset instruction information includes at least one of the following: instruction type, remaining execution times , Whether the parity is flipped;
  • Figure 2-2A is a schematic flowchart of an artificial intelligence calculation method provided by an embodiment of the application, which is applied to an artificial intelligence computing device, and the artificial intelligence computing device includes a controller unit, a storage unit, and Execution unit; the storage unit is connected to an external storage device, the execution unit includes a load execution unit, a calculation execution unit, and a storage execution unit; the method includes:
  • a first instruction set to be executed where the first instruction set includes a first load instruction, a first calculation instruction, and a first store instruction; determine the first load instruction, the first calculation instruction, and the Whether there is an association relationship between the first store instruction, if there is no association relationship between the first load instruction, the first calculation instruction and the first store instruction, the first load instruction, the first A calculation instruction and the first storage instruction are sent to the execution unit;
  • multiple instructions in the instruction set of the neural network can be divided into input and output instructions and calculation instructions.
  • the input and output instructions can be divided into load instructions and store instructions.
  • the execution unit of the artificial intelligence computing device is used to The load instruction transfers the input data from the external storage device to the storage unit on the artificial intelligence computing device, and then directly obtains the input data from the storage unit according to the calculation instruction, and calculates according to the input data to obtain the calculation result, and caches the calculation result to the storage unit , And finally transfer the calculation result from the storage unit to the external storage device according to the storage instruction.
  • the load instruction and the storage instruction, the load instruction and the calculation instruction, the storage instruction and the calculation instruction can be executed in parallel, the load instruction and the load instruction, the calculation instruction and the calculation instruction, the storage instruction and the storage instruction They cannot be executed in parallel, and need to be executed serially.
  • the execution of one instruction requires the data of another instruction, it indicates that there is an association relationship between the two instructions, for example, if the execution of a calculation instruction requires
  • the data loaded by a load instruction indicates that the calculation instruction needs to be executed after the load instruction is executed. It can be determined that the load instruction has an association relationship with the calculation instruction. Therefore, the association relationship between the instructions to be executed can be determined. If the multiple instructions to be executed do not have an association relationship, the load execution unit, the calculation execution unit, and the storage execution unit in the execution unit execute in parallel two or three instructions that do not have an association relationship.
  • FIG. 2-2B is a parallel execution neural network provided by an embodiment of the application.
  • the demonstration diagram of the instructions in the instruction set of the network is shown in Figure 2-2B.
  • L stands for load instructions
  • C stands for calculation instructions
  • S stands for storage instructions.
  • each horizontal line of load instructions, calculation instructions and store instructions corresponds to one operation Tasks can load input data, calculate calculation results, and store the results; each column of load instructions, calculation instructions, and storage instructions corresponds to a time slice, indicating that there will be no associated load instructions, calculation instructions, and Store instructions are executed in parallel. It can be seen that by executing instructions that do not have an association relationship in parallel, multiple computing tasks that do not have an association relationship can be executed in parallel, thereby saving calculation time and improving calculation efficiency.
  • the division of the instruction set of the neural network may not be limited to the division of the three stages of load instructions, calculation instructions, and storage instructions, and instructions may also be divided into other standards, which is not limited in the embodiment of the present application.
  • determining whether there is an association relationship between the first load instruction, the first calculation instruction, and the first storage instruction may include the following steps:
  • the controller unit extracts the first storage address range of the data required in the first load instruction according to the first load instruction, and extracts the required data in the first calculation instruction according to the first calculation instruction The second storage address interval of, according to the first storage instruction to extract the third storage address interval of the data required in the first storage instruction;
  • determining whether there is an association relationship between the first load instruction, the first calculation instruction, and the first storage instruction may include the following steps:
  • the controller unit extracts the first write area corresponding to the first load instruction according to the first load instruction, and extracts the first read area corresponding to the first calculation instruction according to the first calculation instruction And a second writing area, extracting a second reading area corresponding to the first storage instruction according to the first storage instruction;
  • the artificial intelligence computing device includes a storage unit, and the storage unit is connected to an external storage device.
  • the first load instruction, the first calculation instruction, and the first time slice are executed in parallel in the first time slice.
  • the first storage instruction may include the following steps:
  • the storage unit According to the first storage instruction, transmit the first calculation result corresponding to the first input data in the first calculation task from the storage unit to the external storage device.
  • the storage unit includes a first storage area and a second storage area.
  • the third input data in the third operation task is transferred from the external storage device according to the first load instruction.
  • the transmission to the storage unit may include the following steps:
  • the method further includes:
  • A22 Acquire a second instruction set, where the second instruction set includes a second load instruction, a second calculation instruction, and a second store instruction, and the second store instruction is used to store the second calculation result from the store.
  • An instruction transmitted by the unit to the external storage device the second calculation instruction is an instruction used to calculate the third input data in the third calculation task and obtain a third calculation result
  • the second The load instruction is an instruction to transfer the fourth input data in the fourth operation task from the external storage device to the storage unit;
  • A23 Execute the second load instruction, the second calculation instruction, and the second store instruction in parallel in a second time slice, where the second time slice is later than the first time slice; wherein, the The storage execution unit is configured to transmit the second calculation result from the storage unit to the external storage device according to the second storage instruction; the calculation execution unit is configured to perform according to the second time slice according to the The second calculation instruction obtains the third input data from the first storage area, and performs calculations according to the third input data to obtain the third calculation result; the load execution unit is used to perform the calculation in the second The fourth input data is transmitted from the external storage device to the second storage area in a ping-pong operation according to the second load instruction in a time slice.
  • the storage unit can be divided into a first storage area and a second storage area.
  • a ping-pong operation can be performed to transfer input data from the external storage device to the first storage area and the second storage area in turn.
  • the second storage area is used for storage.
  • the third input data can be transferred and stored to the first storage unit according to the first load instruction
  • the second time slice the first data can be stored according to the second load instruction.
  • the four input data is stored in the second storage area.
  • the second calculation instruction can be executed in parallel, and the third input data can be obtained from the first storage area for calculation according to the second calculation instruction to obtain the calculation result.
  • the download An input data is stored in the first storage area, and the next calculation instruction corresponding to the second operation instruction is executed in parallel, and so on.
  • the storage space of the storage unit can be saved.
  • the third input data includes a plurality of third input sub-data
  • the third input data in the third computing task is ping-ponged and transferred from the external storage device to the
  • the first storage area may include the following steps:
  • the load execution unit estimates the target storage duration of each third input sub-data in the first storage area among the plurality of third input sub-data to obtain multiple target storage durations;
  • A212 Transmit the multiple third input sub-data corresponding to the multiple target storage durations to the first storage area in descending order of storage duration, and store them from both ends of the first storage area to intermediate.
  • the target storage duration of each third input sub-data is then stored from both ends of the first storage area to the middle in ascending order of the storage duration.
  • the reading time length of the third input sub-data corresponding to the larger target storage time length can be reduced, thereby improving the calculation efficiency.
  • the input data from the external storage device to the second storage area can also be stored from the two ends of the second storage area to the middle according to the storage duration in descending order.
  • w and b are the data that will be read repeatedly. It can be determined that the storage time corresponding to w and b is longer, and w and b can be stored At both ends of the first storage area or the second storage area, x i is stored in the middle of the first storage area or the second storage area, so that when data is read from the first storage area or the second storage area, every The time length of reading w and b is small, which can reduce the time consumption of reading data.
  • the execution unit executes the second load instruction, the second load instruction, and the second load instruction in parallel in the second time slice.
  • the following steps may be further included:
  • the instructions in the instruction set of the neural network can be arranged in a tree structure.
  • the first layer of numbers in the tree structure is used to indicate chip information, for example, "1" indicates the first chip
  • the second layer of numbers is used to indicate time
  • “1" means the first time slice
  • "2" means the second time slice
  • the third level of letters means load instructions, calculation instructions, and store instructions in each time slice.
  • the preset instruction information may include at least one of the following information: instruction type, remaining execution times, and whether parity is reversed, where the instruction type refers to whether the instruction is a load instruction, a calculation instruction, or a storage instruction. And when the instruction is a calculation instruction, the type of operator contained in the calculation instruction, the operator type may include at least one of the following: addition, subtraction, multiplication, division, convolution, and a combination of the above multiple operators, etc. , The remaining execution times refers to the remaining execution times of repeated operations that need to be executed multiple times in an operation.
  • the operation code corresponding to the first calculation instruction corresponding to the first time slice can be stored in the operation code storage area, and there is no need to repeatedly store 100
  • the operation code of multiple instructions corresponding to the second Y i wx i +b operation.
  • the jump instruction can be used to jump to the operation code storage area to obtain the instruction corresponding to the second instruction set
  • the operation code can reuse the operation code in the operation code storage area, save the storage space of the operation code, reduce the code amount of each instruction in the instruction set in the second time slice, and also save the instruction storage space and improve the operation efficiency.
  • the second storage instruction corresponding to the operation Y 2 wx 2 +b
  • the second calculation instruction corresponding to Y 3 wx 3 +b
  • the first load instruction and the second load instruction are of the same type, and the remaining load times are different.
  • the second storage instruction types are the same, but the remaining storage times are different.
  • the operator types in the first calculation instruction and the second calculation instruction both include addition and multiplication operators, and the operation sequence is the same, only the remaining calculation times are different. Therefore, it can be determined that the first instruction set and the second instruction set are loop bodies.
  • determining whether a loop body is formed between the first instruction set and the second instruction set may include the following steps:
  • the loop body corresponding to the instruction set of each time slice can be parsed in advance to obtain the preset instruction information of each node in the tree structure.
  • the preset instruction information is compared with the sixth preset instruction information corresponding to the second storage instruction; if it is satisfied that the remaining execution times are different, and the remaining execution times of the instruction corresponding to the second time slice are smaller, the remaining information is exactly the same, It can be determined that the second instruction set corresponding to the second time slice and the first instruction set corresponding to the first time slice constitute
  • the calculation instructions include The operators are addition and multiplication.
  • the remaining number of operations of the load instruction is 5 times
  • the remaining number of operations of the calculation instruction is 9 times
  • the remaining number of operations of the store instruction is 3 times.
  • the second instruction set in the second time slice also includes load Instructions, calculation instructions and storage instructions.
  • the calculation instructions include addition and multiplication.
  • the remaining operation times of the load instruction are 4 times
  • the remaining operation times of the calculation instruction are 8 times
  • the remaining operation times of the storage instruction are 2 times.
  • the first instruction set corresponding to the first time slice and the second instruction set corresponding to the second time slice to which the calculation instruction belongs constitute a loop body, so that it can be determined whether multiple instruction sets corresponding to consecutive multiple time slices constitute Loop body, if multiple instruction sets corresponding to multiple consecutive time slices constitute a loop body, it indicates that the instructions of the same type in the multiple consecutive time slices are instructions that are repeatedly executed.
  • the starting point of the loop body is the remaining For the time slice where the node with the largest number of operations is located, the length of the loop body is the difference between the farthest time slice that satisfies the cycle condition and the start time slice.
  • each row of load instructions, calculation instructions, and storage instructions in the horizontal direction corresponds to a calculation task.
  • the first calculation task may include load instructions La, calculation instructions Ca, and storage instructions Sa.
  • the input data can be loaded from the external storage device to the a1 area of the storage unit on the artificial intelligence computing device through the load instruction La; then the input data is read from the a1 area through the calculation instruction Ca, the input data is calculated, and the calculation result is obtained.
  • the result is stored in the a2 area of the storage unit on the artificial intelligence computing device; finally, the calculation result is read from the a2 area through the storage instruction Sa, and the calculation result is transferred from the a2 area of the storage unit to the external storage device.
  • the second The operation task may include load instruction Lb, calculation instruction Cb, and storage instruction Sb.
  • the third operation task may include load instruction Lc, calculation instruction Cc, and storage instruction Sc.
  • the fourth operation task may include load instruction Ld, calculation instruction Cd, and Store instruction Sd. It can be seen that in the first time slice, if there is no correlation between the storage instruction Sa of the first operation task, the calculation instruction Cb of the second operation task and the load instruction Lc of the third operation task, it can be The storage instruction Sa, the calculation instruction Cb and the load instruction Lc are executed in parallel on the chip.
  • the second operation task, the third operation task and the fourth operation task are not related, they can also be executed in parallel in the second time slice.
  • the storage instruction Sb of the second operation task, the calculation instruction Cc of the third operation task, and the load instruction Ld of the fourth operation task are not related, they can also be executed in parallel in the second time slice.
  • the second instruction set constitutes the loop body.
  • the jump instruction is used to jump to the operation code storage area of the instruction corresponding to the first instruction set, specifically, from the operation code
  • the storage area acquires the first operation code of the load instruction Lc, the second operation code of the calculation instruction Cb, and the third operation code of the storage instruction Sa; then, the first operation code is used as the operation code of the load instruction Ld, and the second operation code is As the operation code of the calculation instruction Cc, the third operation code is used as the operation code of the storage instruction Sb; in addition, the first operation domain corresponding to the load instruction Ld can be obtained, the second operation domain corresponding to the calculation instruction Cc, and the storage instruction Sb corresponding The third operating domain.
  • the technical solution provided by this application reduces the instruction execution time and improves the operation efficiency of the neural network by executing the instructions that do not have an association relationship in parallel, and folds the repeated instructions in the instruction set of the neural network, and executes the repetition through jump instructions
  • the instruction reduces the amount of code expanded by repeated instructions.
  • This application also discloses a machine learning computing device, which includes one or more artificial intelligence computing devices mentioned in this application, which is used to obtain data to be computed and control information from other processing devices, and perform specified machine learning computing.
  • the execution result is passed to the peripheral device through the I/O interface.
  • Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
  • the artificial intelligence computing devices can be linked and transmit data through a specific structure, for example, interconnect and transmit data through a PCIE bus to support larger-scale machine learning operations.
  • the same control system can be shared, or there can be separate control systems; the memory can be shared, or each accelerator can have its own memory.
  • the interconnection mode can be any interconnection topology.
  • the machine learning computing device has high compatibility and can be connected to various types of servers through the PCIE interface.
  • the application also discloses a combined processing device, which includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with other processing devices to jointly complete operations specified by the user.
  • Figure 2-3 is a schematic diagram of the combined processing device.
  • processing devices include one or more types of general/special processors such as central processing unit CPU, graphics processing unit GPU, neural network processor, etc.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, completing basic controls such as opening and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • the universal interconnection interface is used to transmit data and control instructions between the machine learning computing device and other processing devices.
  • the machine learning computing device obtains the required input data from other processing devices and writes them to the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them to the on-chip control buffer of the machine learning computing device; also The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
  • the structure is shown in FIGS. 2-4, and may also include a storage device, which is respectively connected to the machine learning computing device and the other processing device.
  • the storage device is used to store data in the machine learning operation device and the other processing device, and is especially suitable for data that cannot be fully stored in the internal storage of the machine learning operation device or other processing device.
  • the combined processing device can be used as an SOC system-on-chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also disclosed, which includes the aforementioned machine learning computing device or combination processing device.
  • a chip packaging structure which includes the aforementioned chip.
  • a board card which includes the chip packaging structure described above. Refer to Figure 2-5. Figure 2-5 provides a board card. In addition to the chip 589 described above, the board card may also include other supporting components, including but not limited to: storage device 590, interface device 591 And control device 592;
  • the storage device 590 is connected to the chip in the chip packaging structure through a bus for storing data.
  • the storage device may include multiple groups of storage units 593. Each group of the storage unit is connected to the chip through a bus. It can be understood that each group of the storage unit may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips). In an embodiment, the chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage unit, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip for controlling the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the chip in the chip packaging structure.
  • the interface device is used to implement data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. This application does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the chip can be in different working states such as multiple load and light load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
  • an electronic device is applied, which includes the above board.
  • Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • the artificial intelligence processing device can load the offline model corresponding to the neural network, and realize different neural network tasks by running the offline model.
  • the operating environment of the artificial intelligence processing device itself is different, for example, the version of the loaded runtime library is different, causing the artificial intelligence processing device to only run the offline model of the corresponding version. If the artificial intelligence processing device cannot update the runtime library in time, it will not be able to run a higher version of the offline model.
  • the artificial intelligence processing device in this application may include servers, smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablet computers, handheld computers, desktop computers, notebook computers, and mobile Internet Devices (MID) ) Or wearable devices.
  • smart phones such as Android phones, iOS phones, Windows Phone phones, etc.
  • tablet computers handheld computers
  • desktop computers notebook computers
  • mobile Internet Devices MID
  • wearable devices MID are only examples, not exhaustive, including but not limited to the above artificial intelligence processing devices.
  • the processor of the artificial intelligence processing device may include a general-purpose processor and an artificial intelligence processor.
  • the general-purpose processor may include one of the central processing unit CPU (Central Processing Unit, CPU), the graphics processing unit GPU (Graphics Processing Unit, GPU), and/or the image processing unit IPU (Image Processing Unit, IPU) or Several combinations.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • IPU Image Processing Unit
  • the artificial intelligence processor includes a machine learning processor unit MLU (Machine Learning Processing Unit, MLU), where the artificial intelligence processor can be integrated by multiple MLUs to form an artificial intelligence processor with multiple cores.
  • MLU Machine Learning Processing Unit
  • FIG. 3-1 is a schematic flowchart of an offline model processing method provided by an embodiment of the application, and the method is applied to an artificial intelligence processing device. Specifically, the method includes the content shown in steps S101 to S102, wherein:
  • Step S101 Obtain the version information of the runtime library for running the offline model and the model information of the offline model.
  • the offline model includes compiled binary machine instructions, which can be directly run on the artificial intelligence processor.
  • the offline model may include model structure information, weight data, input and output data, and the offline model may also include model information such as version information of the offline model, version information of machine learning processing instructions, etc., which are not limited herein.
  • the model structure information may include the layer structure corresponding to the neural network model.
  • the offline model includes a convolutional layer, a normalization layer, a zoom layer, and a fully connected layer.
  • the weight data includes weights corresponding to each layer.
  • the input and output data may include the input and output data scale, for example, the input size of the image data is 50mm*50mm, and the pixel value range is (-1024, 3071).
  • the input and output data may also include input and output quantity information, that is, define several input data, several output data, and so on.
  • the offline model can be generated through a machine learning library (Machine learning Library) and a runtime library (Runtime Library).
  • Machine learning Library Machine learning Library
  • Runtime Library a runtime library
  • the machine learning library and runtime library pack a series of data such as data and instructions used to execute neural network model calculations to generate an offline model.
  • the machine learning library is also used to accelerate various machine learning or deep learning algorithms on the artificial intelligence processor.
  • the machine learning library provides a set of efficient, general, flexible, and extensible programming interfaces.
  • the upper-layer machine learning applications can directly use the programming interfaces of various programming frameworks (such as TensorFlow, Caffe, MXNet, etc.), or use them
  • the interface provided by the machine learning library is directly programmed.
  • the runtime library is also used to complete the interaction between the general-purpose processor and the artificial intelligence processor.
  • the runtime library provides a set of interfaces for the artificial intelligence processor. This application does not limit the interface of the runtime library. For example, the interface for loading an offline model can be called to make the artificial intelligence processor load offline. Model etc.
  • the runtime library can be separated from the machine learning library and use the offline model file alone to complete the calculation of the neural network.
  • the mobile phone only includes a runtime library.
  • the offline model included in the artificial intelligence application is run through the runtime library in the mobile phone.
  • the offline model in the artificial intelligence application is generated through the runtime library and the machine learning library in the artificial intelligence processing device on the development end, and then the offline model is packaged into the artificial intelligence application through the runtime library.
  • the version information of the runtime library that runs the offline model is the version information of the runtime library of other artificial intelligence processing devices that are to run the offline model.
  • Step S102 According to the model information and the version information, call the function set corresponding to the version information in the machine learning library to generate an offline model corresponding to the version information.
  • the function set may include different arithmetic processing function sets.
  • the function set includes a general operator set and a function operator set.
  • the set of general operators includes operation processing functions that are required for each version in the offline model, such as: addition, multiplication, and other more common operations processing functions.
  • the set of function operators includes the operation processing units required for the specified version in the offline model, such as: convolution, vector inner product, sorting and other less common operation processing units.
  • the function sets corresponding to different version information can be defined in advance.
  • the function set corresponding to the version information of 1 is the first function set
  • the function set corresponding to the version information of 2 is the second function set
  • the version information is 3.
  • the corresponding function set is the third function set. If the version information is 2, the second function set in the machine library can be called. In this way, the second function set combined with the model information upgrade of the offline model can generate the offline model corresponding to the version information, thereby improving the generation efficiency of the offline model .
  • This application does not limit the method of generating the offline model corresponding to the version information.
  • call the machine learning library corresponding to the version information A function set, generating an offline model corresponding to the version information, including: calling a function set in the machine learning library corresponding to the version information through an interface function; according to the function set corresponding to the version information and the model information To generate an offline model corresponding to the version information.
  • the machine learning library includes interface functions, and the interface functions are used to call function sets corresponding to different version information.
  • the function set corresponding to version information 1 is the first function set
  • the function set corresponding to version information 2 is the second function set
  • the function set corresponding to version information 3 is the third function set.
  • the second function set is called through an interface function (for example: cnrtGetModelLevelFromFile()), so that the offline model corresponding to the version information can be generated based on the second function set and the model information of the offline model.
  • an interface function for example: cnrtGetModelLevelFromFile()
  • the invoking a set of functions in the machine learning library corresponding to the version information to generate an offline model corresponding to the version information includes: calling and A function set corresponding to the version information; and an offline model corresponding to the version information is generated according to the function set corresponding to the version information and the model information.
  • the machine learning library includes environment variables, and the environment variables are used to call function sets corresponding to different version information.
  • the function set corresponding to each version information is called through the environment variable.
  • the function set corresponding to the target version can be directly called through the interface function to improve the efficiency of generating the offline model.
  • the function set corresponding to version information 1 is the first function set
  • the function set corresponding to version information 2 is the second function set
  • the function set corresponding to version information 3 is the third function set, where the environment variable Corresponding version information.
  • the version information is 2
  • the machine learning library can be called and the version information of the model information and the version information.
  • the function set corresponding to the version information generates an offline model corresponding to the version information.
  • the embodiment of the present application can generate corresponding offline models according to different versions of the runtime library running the offline models, which can improve the applicability of the generated offline models.
  • the version information of the machine learning library in the artificial intelligence processing device of the developer is 8
  • the version information of the runtime library is 5
  • the version information of the runtime library in the client is 2. It can be seen that if the version of the runtime library corresponding to the client is lower than the version of the artificial intelligence processing device, the client cannot directly run the offline model corresponding to the latest version information of the runtime library in the artificial intelligence processing device.
  • an offline model corresponding to the version information of the runtime library in the client can be generated. Since the version information of the runtime library in the client is equal to the version information of the newly generated offline model, the runtime library in the client can run the newly generated offline model, which can improve the applicability of the generated offline model .
  • the method further includes: running the offline model based on the runtime library corresponding to the version information.
  • the offline model is run using the runtime library corresponding to the version information. Since the version information is consistent, the offline model can be used normally to complete the neural network heterogeneous calculation.
  • Figure 3-2 is a schematic flowchart of another offline model processing method provided by an embodiment of the application.
  • the method is applied to an artificial intelligence processing device. Specifically, the method includes steps S3201 to S3202 Content shown:
  • Step S3201 when the version information of the runtime library running the offline model is not given, obtain the model information of the offline model.
  • step S101 For the model information of the offline model, refer to the description of step S101, which will not be repeated here.
  • step S3202 the function set corresponding to the latest version information of the runtime library in the machine learning library is called to generate an offline model corresponding to the latest version information of the runtime library.
  • the description of the function set can also refer to the description of step S102, which will not be repeated here.
  • the latest version information of the runtime library is the highest version of the runtime library in the current artificial intelligence processing device.
  • the method of generating the offline model corresponding to the latest version information of the runtime library is not limited, and the method of generating the offline model corresponding to the version information can be referred to, which will not be repeated here.
  • the offline model processing method shown in Figure 3-2 when the version information of the runtime library running the offline model is not obtained, the function set corresponding to the latest version information of the runtime library in the machine learning library is directly called Generate an offline model corresponding to the version information. That is to say, by default, the offline model corresponding to the runtime library of the latest version information is directly generated, and the offline model corresponding to the latest version information can be used to improve operation efficiency.
  • FIG. 3-3 shows a block diagram of a possible functional unit composition of the offline model processing device 300 involved in the above embodiment.
  • the offline model processing device 300 includes:
  • the obtaining unit 301 is configured to obtain the version information of the runtime library that runs the offline model, and the model information of the offline model;
  • the generating unit 302 is configured to call the function set corresponding to the version information in the machine learning library according to the model information and the version information to generate an offline model corresponding to the version information.
  • the machine learning library includes an interface function
  • the interface function is used to call the function set corresponding to different version information, in the said model information and the version information, call machine learning
  • the generating unit 302 is specifically configured to call the machine learning library and the version information through the interface function.
  • Corresponding function set according to the function set corresponding to the version information and the model information, an offline model corresponding to the version information is generated.
  • the machine learning library includes an interface function
  • the interface function is used to call the function set corresponding to different version information, in the said model information and the version information, call machine learning
  • the generating unit 302 is specifically configured to call the machine learning library and the version information through the environment variable.
  • Corresponding function set according to the function set corresponding to the version information and the model information, an offline model corresponding to the version information is generated.
  • the generating unit 302 is also used for when the version information of the runtime library running the offline model is not given, according to the model information of the offline model, call the machine learning library and the runtime library.
  • the function collection corresponding to the latest version information generates an offline model corresponding to the latest version information of the runtime library.
  • the model information includes: model structure information, weight data, and input and output data.
  • the function set includes: a general operator set and a function operator set.
  • the device 300 further includes:
  • the running unit 303 is configured to run the offline model based on the runtime library corresponding to the version information.
  • Figure 3-4 is a schematic structural diagram of an artificial intelligence processing device provided by an embodiment of the application.
  • the artificial intelligence processing device includes a processor, a memory, a communication interface, and one or more Procedures.
  • the aforementioned processors include general-purpose processors and artificial intelligence processors.
  • the foregoing one or more programs are different from the foregoing one or more application programs, and the foregoing one or more programs are stored in the foregoing memory and configured to be executed by the foregoing processor, and the foregoing programs include instructions for executing the following steps:
  • the function set corresponding to the version information in the machine learning library is called to generate an offline model corresponding to the version information.
  • the machine learning library includes an interface function
  • the interface function is used to call the function set corresponding to different version information, in the said model information and the version information, call machine learning Regarding the collection of functions in the library corresponding to the version information and generating an offline model corresponding to the version information
  • the above program is specifically used to execute the instructions of the following steps:
  • an offline model corresponding to the version information is generated.
  • the machine learning library includes an interface function
  • the interface function is used to call the function set corresponding to different version information, in the said model information and the version information, call machine learning Regarding the collection of functions in the library corresponding to the version information and generating an offline model corresponding to the version information
  • the above program is specifically used to execute the instructions of the following steps:
  • an offline model corresponding to the version information is generated.
  • the above program is also used to execute the instructions of the following steps:
  • the model information includes: model structure information, weight data, and input and output data.
  • the function set includes: a general operator set and a function operator set.
  • the above program is also used to execute the instructions of the following steps:
  • the embodiments of the present application also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program for storing a computer program, where the computer program is executed by a processor to implement any of the methods described in the foregoing method embodiments. Part or all of the steps of an offline model processing method.
  • the embodiments of the present application also provide a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the method described in the foregoing method embodiment Part or all of the steps of any offline model processing method.
  • An embodiment of the present application also provides a combined processing device, which includes the aforementioned offline model processing device, a universal interconnection interface, and other processing devices.
  • the processing device of the offline model interacts with the other processing devices to jointly complete the operation specified by the user.
  • Figure 3-5 is a schematic diagram of the combined processing device.
  • processing devices include one or more types of general-purpose/special-purpose processors such as CPU, GPU, and neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the processing device of the offline model and external data and control, including data transfer, and complete the basic control of the processing device of the offline model such as opening and stopping; other processing devices can also be completed in cooperation with the processing device of the offline model. Computing tasks.
  • the universal interconnection interface is used to transmit data and control instructions between the processing device of the offline model and other processing devices.
  • the processing device of the offline model obtains the required input data from other processing devices and writes it to the on-chip storage device of the processing device of the offline model; it can obtain control instructions from other processing devices and write the control on-chip of the processing device of the offline model Cache; the data in the storage module of the processing device of the offline model can also be read and transmitted to other processing devices.
  • the combined processing device may further include a storage device, and the storage device is respectively connected to the processing device of the offline model and the other processing device.
  • the storage device is used to store the data in the processing device of the offline model and the other processing devices, and is especially suitable for data that needs to be calculated and cannot be fully saved in the processing device of the offline model or the internal storage of other processing devices. .
  • the combined processing device can be used as an on-chip system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wireless fidelity (Wireless-Fidelity, Wi-Fi) interface.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be realized in the form of hardware or software program module.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Executing Machine-Instructions (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种人工智能计算装置及相关产品,该人工智能计算装置用于用于执行机器学习计算,本申请实施例针对构成循环体的两个以上指令集中的指令,通过将针对重复的指令使用操作码存储区域的同一操作码,节省操作码的存储空间,可缩减第二时间片中的指令集中各指令的代码量,也可节省指令存储空间,提高运算效率。

Description

人工智能计算装置及相关产品 技术领域
本申请涉及信息处理技术领域,具体涉及一种人工智能计算装置及相关产品。
背景技术
人工神经网络是一种功能强大的算法,近年来被应用于图像、语言等各种领域。而人工智能计算装置的出现可以使神经网络得到硬件的支持,更高效地进行计算。人工智能计算装置一般有自己的指令集,指令集中会包含较多的待执行指令,执行指令集中的所有指令耗时较长,效率受到影响,也会包含重复执行的指令,例如,在进行数据加载的过程中,若数据规模较大,则需要多次搬运才能完成地址空间转换,又例如,模板运算中重复的加法乘法运算等。这里计数的重复计算在正常的操作中是直接进行直接的展开计算,每一指令会对应一段执行代码,重复的指令对应的代码会占用较多的存储空间。
发明内容
本申请实施例提供了一种人工智能计算装置及相关产品,可减少指令的指令信息的代码量,提高指令计算效率。
第一方面,提供一种人工智能计算装置,所述人工智能计算装置包括控制器单元和执行单元;其中,
所述控制器单元,用于获取待执行的第一指令集;以及,获取第二指令集;
所述控制器单元,还用于确定所述第一指令集与所述第二指令集之间是否构成循环体;
所述执行单元,用于在所述第一指令集与所述第二指令集之间构成循环体时,根据所述第一指令集的指令信息执行所述第二指令集中的指令。
第二方面,本申请实施例提供了一种人工智能计算方法,应用于人工智能计算装置,所述方法包括:
获取待执行的第一指令集;以及,获取第二指令集;
确定所述第一指令集与所述第二指令集之间是否构成循环体;
在所述第一指令集与所述第二指令集之间构成循环体时,根据所述第一指令集的指令信息执行所述第二指令集中的指令。
第三方面,本申请实施例提供了一种机器学习运算装置,该机器学习运算装置包括一个或者多个第一方面所述的人工智能计算装置。该机器学习运算装置用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给***设备;
当所述机器学习运算装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行链接并传输数据;
其中,多个所述计算装置通过PCIE总线进行互联并传输数据,以支持更大规模的机 器学习的运算;多个所述计算装置共享同一控制***或拥有各自的控制***;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
第四方面,本申请实施例提供了一种组合处理装置,该组合处理装置包括如第三方面所述的机器学习运算装置、通用互联接口,和其他处理装置。该机器学习运算装置与上述其他处理装置进行交互,共同完成用户指定的操作。该组合处理装置还可以包括存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。
第五方面,本申请实施例提供了一种神经网络芯片,该神经网络芯片包括上述第一方面所述的计算装置、上述第三方面所述的机器学习运算装置或者上述第四方面所述的组合处理装置。
第六方面,本申请实施例提供了一种神经网络芯片封装结构,该神经网络芯片封装结构包括上述第五方面所述的神经网络芯片;
第七方面,本申请实施例提供了一种板卡,该板卡包括上述第六方面所述的神经网络芯片封装结构。
第八方面,本申请实施例提供了一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如第二方面所述的方法步骤。
第九方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如第二方面所述的方法步骤。
第十方面,本申请实施例提供了一种电子装置,该电子装置包括上述第五方面所述的神经网络芯片或者上述第七方面所述的板卡。
第十一方面,提供一种人工智能计算装置,所述人工智能计算装置包括控制器单元、存储单元和执行单元;所述存储单元连接外部存储装置,所述执行单元包括加载执行单元、计算执行单元和存储执行单元;
所述控制器单元,用于获取待执行的第一指令集,所述第一指令集包含第一加载指令、第一计算指令和第一存储指令;确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,若所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系,将所述第一加载指令、所述第一计算指令和所述第一存储指令发送至所述执行单元;
所述执行单元,用于在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令;其中,所述存储执行单元用于根据所述第一存储指令将第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置,所述计算执行单元用于根据所述第一计算指令对第二运算任务中第二输入数据进行计算,得到第二计算结果;所述加载执行单元用于根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
可选地,在所述确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系方面,所述控制器单元具体用于:
依据所述第一加载指令提取所述第一加载指令中所需数据的第一存储地址区间,依据所述第一计算指令提取所述第一计算指令中所需数据的第二存储地址区间,依据所述第一存储指令提取所述第一存储指令中所需数据的第三存储地址区间,若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
可选地,在所述确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系方面,所述控制器单元具体用于:
依据所述第一加载指令提取所述第一加载指令对应的第一写入区域,依据所述第一计算指令提取所述第一计算指令对应的第一读取区域和第二写入区域,依据所述第一存储指令提取所述第一存储指令对应的第二读取区域;
若所述第一写入区域、所述第一读取区域、所述第二写入区域和所述第二读取区域之间均不存在重叠区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
可选地,所述人工智能计算装置还包括存储单元,所述存储单元连接外部存储装置,所述执行单元包括加载执行单元、计算执行单元和存储执行单元,在所述在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令方面,所述存储执行单元用于根据所述第一存储指令将第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置,所述计算执行单元用于根据所述第一计算指令对第二运算任务中第二输入数据进行计算,得到第二计算结果;所述加载执行单元用于根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
可选地,所述存储单元包括第一存储区域和第二存储区域,在所述根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元方面,所述加载执行单元具体用于:
在所述第一时间片内根据所述第一加载指令将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域;
所述执行单元在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令后,所述控制器单元还用于:
获取第二指令集,所述第二指令集包含第二加载指令、第二计算指令和第二存储指令,所述第二存储指令为用于将所述第二计算结果从所述存储单元传输至所述外部存储装置的指令,所述第二计算指令为用于对所述第三运算任务中的所述第三输入数据进行计算并得到第三计算结果的指令,所述第二加载指令为将第四运算任务中的第四输入数据从所述外部存储装置传输至所述存储单元的指令;
所述执行单元,还用于在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令,所述第二时间片晚于所述第一时间片;
其中,所述存储执行单元用于根据所述第二存储指令将所述第二计算结果从所述存储单元传输至所述外部存储装置;所述计算执行单元用于在所述第二时间片内根据所述第二 计算指令从所述第一存储区域获取所述第三输入数据,并根据所述第三输入数据进行计算,得到所述第三计算结果;所述加载执行单元用于在所述第二时间片内根据所述第二加载指令将所述第四输入数据进行乒乓操作从所述外部存储装置传输至所述第二存储区域。
可选地,所述第三输入数据包括多个第三输入子数据,在将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域方面,所述加载执行单元具体用于:
预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至第一存储区域,并从所述第一存储区域的两端存储至中间。
可选地,在所述控制器单元获取第二加载指令、第二计算指令和第二存储指令之后,所述执行单元在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令之前,所述控制器单元还用于:
确定所述第一指令集与所述第二指令集之间是否构成循环体;
若所述第一指令集与所述第二指令集之间构成循环体,根据跳转指令跳转至所述第一指令集对应的指令的操作码存储区域,从所述操作码存储区域获取所述第一加载指令的操作码,将所述操作码作为所述第二加载指令的操作码,并获取所述第二加载指令对应的操作域,其中,所述操作码包括所述第一计算指令的标识;所述操作域包括所述第四输入数据的存储地址。
可选地,在所述确定所述第一指令集与所述第二指令集之间是否构成循环体方面,所述控制器单元具体用于:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
第十二方面,本申请实施例提供了一种人工智能计算方法,应用于人工智能计算装置,所述人工智能计算装置包括控制器单元、存储单元和执行单元;所述存储单元连接外部存储装置,所述执行单元包括加载执行单元、计算执行单元和存储执行单元;所述方法包括:
所述控制器单元获取待执行的第一指令集,所述第一指令集包含第一加载指令、第一计算指令和第一存储指令;确定所述第一加载指令、所述第一计算指令和所述第一存储指 令之间是否存在关联关系,若所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系,将所述第一加载指令、所述第一计算指令和所述第一存储指令发送至所述执行单元;
所述执行单元在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令;其中,所述存储执行单元根据所述第一存储指令将第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置,所述计算执行单元根据所述第一计算指令对第二运算任务中第二输入数据进行计算,得到第二计算结果;所述加载执行单元根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
可选地,所述确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,包括:
依据所述第一加载指令提取所述第一加载指令中所需数据的第一存储地址区间,依据所述第一计算指令提取所述第一计算指令中所需数据的第二存储地址区间,依据所述第一存储指令提取所述第一存储指令中所需数据的第三存储地址区间;
若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
可选地,所述确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,包括:
依据所述第一加载指令提取所述第一加载指令对应的第一写入区域,依据所述第一计算指令提取所述第一计算指令对应的第一读取区域和第二写入区域,依据所述第一存储指令提取所述第一存储指令对应的第二读取区域;
若所述第一写入区域、所述第一读取区域、所述第二写入区域和所述第二读取区域之间均不存在重叠区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
可选地,所述人工智能计算装置包括存储单元,所述存储单元连接外部存储装置,所述在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令,包括:
根据所述第一存储指令将第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置;
根据所述第一计算指令对第二运算任务中第二输入数据进行计算,得到第二计算结果;
根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
可选地,所述存储单元包括第一存储区域和第二存储区域,所述根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元,包括:
在所述第一时间片内根据所述第一加载指令将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域;
在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令后,所述方法还包括:
获取第二指令集,所述第二指令集包含第二加载指令、第二计算指令和第二存储指令,所述第二存储指令为用于将所述第二计算结果从所述存储单元传输至所述外部存储装置的指令,所述第二计算指令为用于对所述第三运算任务中的所述第三输入数据进行计算并得到第三计算结果的指令,所述第二加载指令为将第四运算任务中的第四输入数据从所述外部存储装置传输至所述存储单元的指令;
在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令,所述第二时间片晚于所述第一时间片;其中,所述存储执行单元用于根据所述第二存储指令将所述第二计算结果从所述存储单元传输至所述外部存储装置;所述计算执行单元用于在所述第二时间片内根据所述第二计算指令从所述第一存储区域获取所述第三输入数据,并根据所述第三输入数据进行计算,得到所述第三计算结果;所述加载执行单元用于在所述第二时间片内根据所述第二加载指令将所述第四输入数据进行乒乓操作从所述外部存储装置传输至所述第二存储区域。
可选地,所述第三输入数据包括多个第三输入子数据,所述将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域,包括:
预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至第一存储区域,并从所述第一存储区域的两端存储至中间。
可选地,在所述控制器单元获取第二加载指令、第二计算指令和第二存储指令之后,所述执行单元在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令之前,所述方法还包括:
确定所述第一指令集与所述第二指令集之间是否构成循环体;
若所述所述第一指令集与所述第二指令集之间构成循环体,根据跳转指令跳转至所述第一指令集对应的指令的操作码存储区域,从所述操作码存储区域获取所述第一加载指令的操作码,将所述操作码作为所述第二加载指令的操作码,并获取所述第二加载指令对应的操作域,其中,所述操作码包括所述第一计算指令的标识;所述操作域包括所述第四输入数据的存储地址。
可选地,所述确定所述第一指令集与所述第二指令集之间是否构成循环体,包括:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;以及将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
第十三方面,本申请实施例提供了一种机器学习运算装置,该机器学习运算装置包括一个或者多个第十一方面所述的人工智能计算装置。该机器学习运算装置用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给***设备;
当所述机器学习运算装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行链接并传输数据;
其中,多个所述计算装置通过PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述计算装置共享同一控制***或拥有各自的控制***;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
第十四方面,本申请实施例提供了一种组合处理装置,该组合处理装置包括如第十三方面所述的机器学习运算装置、通用互联接口,和其他处理装置。该机器学习运算装置与上述其他处理装置进行交互,共同完成用户指定的操作。该组合处理装置还可以包括存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。
第十五方面,本申请实施例提供了一种神经网络芯片,该神经网络芯片包括上述第一方面所述的计算装置、上述第十三方面所述的机器学习运算装置或者上述第十四方面所述的组合处理装置。
第十六方面,本申请实施例提供了一种神经网络芯片封装结构,该神经网络芯片封装结构包括上述第十五方面所述的神经网络芯片;
第十七方面,本申请实施例提供了一种板卡,该板卡包括上述第十六方面所述的神经网络芯片封装结构。
第十八方面,本申请实施例提供了一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如第十二方面所述的方法步骤。
第十九方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如第十二方面所述的方法步骤。
第二十方面,本申请实施例提供了一种电子装置,该电子装置包括上述第十五方面所述的神经网络芯片或者上述第十七方面所述的板卡。
在一些实施例中,所述电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
在一些实施例中,所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、 空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
第二十一方面,本申请实施例提供了一种网络离线模型的处理方法,其中:
获取运行离线模型的运行时库的版本信息,以及所述离线模型的模型信息;
根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型。
结合本发明实施例第二十一方面,在本发明实施例第二十一方面的第一种可能的实现方式中,所述机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合,所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型,包括:
通过所述接口函数调用所述机器学习库中与所述版本信息对应的功能集合;
根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
结合本发明实施例第二十一方面,在本发明实施例第二十一方面的第二种可能的实现方式中,所述机器学习库中包括环境变量,所述环境变量用于调用不同版本信息对应的功能集合,所述调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型,包括:
通过所述环境变量调用所述机器学习库中与所述版本信息对应的功能集合;
根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
结合本发明实施例第二十一方面,在本发明实施例第二十一方面的第三种可能的实现方式中,所述方法还包括:
当不给出运行离线模型的运行时库的版本信息时,根据离线模型的模型信息,调用机器学习库中与运行时库的最新版本信息对应的功能集合,生成与运行时库的最新版本信息对应的离线模型。
结合本发明实施例第二十一方面,在本发明实施例第二十一方面的第四种可能的实现方式中,所述模型信息包括:模型结构信息、权值数据、输入输出数据。
结合本发明实施例第二十一方面,在本发明实施例第二十一方面的第五种可能的实现方式中,所述功能集合包括:通用算子集合、功能算子集合。
结合本发明实施例第二十一方面,在本发明实施例第二十一方面的第六种可能的实现方式中,所述方法还包括:
基于所述版本信息对应的运行时库,运行所述离线模型。
第二十二方面,本申请实施例提供一种离线模型的处理装置,其中:
获取单元,用于获取运行离线模型的运行时库的版本信息,以及所述离线模型的模型信息;
生成单元,用于根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型。
结合本发明实施例第二十二方面,在本发明实施例第二十二方面的第一种可能的实现方式中,所述机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合,在所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型方面,所述生成单元,具体用于通过所述接口函数调用所述机器学习库中与所述版本信息对应的功能集合;根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
结合本发明实施例第二十二方面,在本发明实施例第二十二方面的第二种可能的实现方式中,在所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型方面,所述生成单元,具体用于通过所述环境变量调用所述机器学习库中与所述版本信息对应的功能集合;根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
结合本发明实施例第二十二方面,在本发明实施例第二十二方面的第三种可能的实现方式中,所述生成单元,还用于当不给出运行离线模型的运行时库的版本信息时,根据离线模型的模型信息,调用机器学习库中与运行时库的最新版本信息对应的功能集合,生成与运行时库的最新版本信息对应的离线模型。
结合本发明实施例第二十二方面,在本发明实施例第二十二方面的第四种可能的实现方式中,所述模型信息包括:模型结构信息、权值数据、输入输出数据。
结合本发明实施例第二十二方面,在本发明实施例第二十二方面的第五种可能的实现方式中,所述功能集合包括:通用算子集合、功能算子集合。
结合本发明实施例第二十二方面,在本发明实施例第二十二方面的第六种可能的实现方式中,所述装置还包括:
运行单元,用于基于所述版本信息对应的运行时库,运行所述离线模型。
第二十三方面,本申请实施例提供一种人工智能处理装置,包括处理器、存储器、通信接口以及一个或多个程序,其中,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如第二十一方面所述的方法。
第二十四方面,本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如第二十一方面所述的方法。
第五方面,本申请实施例提供一种组合处理装置,其特征在于,所述组合处理装置包括如第二十二方面所述的离线模型的处理装置,通用互联接口和其它处理装置;
所述离线模型的处理装置与所述其它处理装置进行交互,共同完成用户指定的计算操作。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1-1是本申请实施例提供的一种人工智能计算装置的结构示意图;
图1-2A是本申请实施例提供的一种人工智能计算方法的流程示意图;
图1-2B是本申请实施例提供的一种并行执行神经网络的指令集中的指令的演示示意图;
图1-2C是本申请实施例提供的一种将指令集中的指令按照树型结构进行排布的演示示意图;
图1-3是本申请实施例提供的一种组合处理装置的结构图;
图1-4是本申请实施例提供的另一种组合处理装置的结构图;
图1-5为本申请实施例提供的一种板卡的结构示意图;
图2-1是本申请实施例提供的一种人工智能计算装置的结构示意图;
图2-2A是本申请实施例提供的一种人工智能计算方法的流程示意图;
图2-2B是本申请实施例提供的一种并行执行神经网络的指令集中的指令的演示示意图;
图2-2C是本申请实施例提供的一种将指令集中的指令按照树型结构进行排布的演示示意图;
图2-3是本申请实施例提供的一种组合处理装置的结构图;
图2-4是本申请实施例提供的另一种组合处理装置的结构图;
图2-5为本申请实施例提供的一种板卡的结构示意图;
图3-1为本申请实施例提供的一种离线模型的处理方法的流程示意图;
图3-2为本申请实施例提供的另一种离线模型的处理方法的流程示意图;
图3-3为本申请实施例提供的一种离线模型的处理装置的结构示意图;
图3-4为本申请实施例提供的一种人工智能处理装置的结构示意图;
图3-5为本申请实施例提供的一种组合处理装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐 式地理解的是,本文所描述的实施例可以与其它实施例相结合。
首先介绍本申请使用的计算装置。参阅图1-1,提供了一种人工智能计算装置,该人工智能计算装置用于执行机器学习计算,该计算装置包括:控制器单元11、存储单元10和执行单元12,其中,所述存储单元10连接外部存储装置,所述执行单元12包括加载执行单元121、计算执行单元122和存储执行单元123;其中,
所述控制器单元,用于获取待执行的第一指令集;以及,获取第二指令集;
所述控制器单元,还用于确定所述第一指令集与所述第二指令集之间是否构成循环体;
所述执行单元,用于在所述第一指令集与所述第二指令集之间构成循环体时,根据所述第一指令集的指令信息执行所述第二指令集中的指令。
在一个可能的实施例中,在所述根据所述第一指令集的指令信息执行所述第二指令集中的指令方面,所述执行单元具体用于:
根据跳转指令跳转至所述第一指令集中与所述第二指令集中的第二指令对应的第一指令的操作码存储区域,从所述操作码存储区域获取所述第一指令的操作码,将所述操作码作为所述第二指令的操作码,其中,所述操作码包括所述第一指令的标识。
在一个可能的实施例中,所述第一指令集包含第一运算任务的第一加载指令、第一计算指令和第一存储指令;所述第二指令集包含第二运算任务的第二加载指令、第二计算指令和第二存储指令;在所述确定所述第一指令集与所述第二指令集之间是否构成循环体方面,所述控制器单元具体用于:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
在一个可能的实施例中,所述第一指令集包含第一运算任务的第一存储指令,第二运算任务的第二计算指令和第三运算任务对应的第三加载指令;所述第二指令集包含第二运算任务的第二存储指令,第三运算任务的第三计算指令和第四运算任务的第四加载指令;在所述确定所述第一指令集与所述第二指令集之间是否构成循环体方面,所述控制器单元具体用于:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;将所述第二计算指令对应的第七预设指令信息与所述第三计算指令对应的第八预设指令信息进行比对;将所述第三加载指令对应的第九预设指令信息与所述第四加载指令对应的第十预设指令信息进行比对;
若所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,所述第七预设指令信息与第八预设指令信息之间仅存在操作次数的差异,且所述第九预设指令信息与所述第十预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
在一个可能的实施例中,所述控制器单元还用于:
确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系;
所述执行单元,还用于在所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系时,在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令。
在一个可能的实施例中,在所述确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系方面,所述控制器单元具体用于:
提取所述第一存储指令中所需数据的第一存储地址区间,提取所述第二计算指令中所需数据的第二存储地址区间,提取所述第三加载指令中所需数据的第三存储地址区间,若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
在一个可能的实施例中,在所述确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系方面,所述控制器单元具体用于:
提取所述第一存储指令对应的第一写入区域,提取所述第二计算指令对应的第二读取区域和第二写入区域,提取所述第三加载指令对应的第三读取区域;
若所述第一写入区域、所述第二读取区域、所述第二写入区域和所述第三读取区域之间均不存在重叠区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
在一个可能的实施例中,所述人工智能计算装置还包括存储单元,所述存储单元与外部存储装置连接;所述执行单元包括加载执行单元、计算执行单元和存储执行单元;
在所述在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令方面,所述存储执行单元用于根据所述第一存储指令将所述第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置,所述计算执行单元用于根据所述第二计算指令对所述第二运算任务中第二输入数据进行计算,得到第二计算结果;所述加载执行单元用于根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
在一个可能的实施例中,所述存储单元包括第一存储区域和第二存储区域,在所述根 据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元方面,所述加载执行单元具体用于:
在所述第一时间片内根据所述第三加载指令将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域。
在一个可能的实施例中,所述第三输入数据包括多个第三输入子数据,在将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域方面,所述加载执行单元具体用于:
预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至所述第一存储区域,并从所述第一存储区域的两端存储至中间。
如图1-2A所示,图1-2A为本申请实施例提供的一种人工智能计算方法的流程示意图,应用于人工智能计算装置,所述人工智能计算装置包括控制器单元、存储单元和执行单元;所述存储单元连接外部存储装置,所述执行单元包括加载执行单元、计算执行单元和存储执行单元;所述方法包括:
201、获取待执行的第一指令集;以及,获取第二指令集。
本申请实施例中,可将神经网络的指令集中的多个指令划分为输入输出指令和计算指令,输入输出指令可划分为加载指令和存储指令,其中,人工智能计算装置的执行单元用于根据加载指令将输入数据从外部存储装置传输到人工智能计算装置上的存储单元,然后根据计算指令从存储单元直接获取输入数据,并根据输入数据进行计算,得到计算结果,将计算结果缓存至存储单元,最后根据存储指令将计算结果从存储单元传输到外部存储装置。
其中,神经网络的指令集的划分可以不局限于加载指令、计算指令和存储指令三个阶段的划分,还可以其他标准划分指令,本申请实施例不做限定。
可选地,第一指令集可包括第一运算任务的第一加载指令、第一计算指令和第一存储指令;第二指令集可包括第二运算任务的第二加载指令、第二计算指令和第二存储指令。其中,第一加载指令用于将第一运算任务中的第一输入数据从所述外部存储装置传输至存储单元,第一计算指令用于对第一运算任务中的第一输入数据进行计算并得到第一计算结果,第一存储指令用于将第一计算结果从存储单元传输至外部存储装置;第二加载指令用于将第二运算任务中的第二输入数据从所述外部存储装置传输至存储单元,第二计算指令用于对第二运算任务中的第二输入数据进行计算并得到第二计算结果,第二存储指令用于将第二计算结果从存储单元传输至外部存储装置。
可选地,第一指令集可包括第一运算任务的第一存储指令,第二运算任务的第二计算指令和第三运算任务的第三加载指令;第二指令集包含第二运算任务的第二存储指令,第三运算任务的第三计算指令和第四运算任务的第四加载指令。其中,第一存储指令用于将第一计算结果从存储单元传输至外部存储装置,第二计算指令用于对第二运算任务中的第二输入数据进行计算并得到第二计算结果,第三加载指令用于将第三运算任务中的第三输 入数据从所述外部存储装置传输至存储单元;第二存储指令用于将第二计算结果从存储单元传输至外部存储装置,第三计算指令用于对第三运算任务中的第三输入数据进行计算并得到第三计算结果,第四加载指令用于将第四运算任务中的第四输入数据从所述外部存储装置传输至存储单元。
202、确定所述第一指令集与所述第二指令集之间是否构成循环体。
可选地,所述第一指令集包含第一运算任务的第一加载指令、第一计算指令和第一存储指令;所述第二指令集包含第二运算任务的第二加载指令、第二计算指令和第二存储指令;上述步骤202中,确定所述第一指令集与所述第二指令集之间是否构成循环体,可包括以下步骤:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
其中,预设指令信息中可包括以下至少一种信息:指令类型、剩余执行次数、是否奇偶性翻转。指令类型是指该指令为加载指令、计算指令或者存储指令,以及当指令为计算指令时,计算指令中包含的运算符类型,运算符类型可包括以下至少一种:加、减、乘、除、卷积、以及上述多种运算符之间的组合等等,剩余执行次数是指针对一个运算中需要执行多次的重复运算的剩余执行次数。
本申请实施例中,可将第一运算任务中的第一加载指令、第一计算指令和第一存储指令与第二运算任务的第二加载指令、第二计算指令和第二存储指令对应的预设指令信息进行比对确定第一指令集与第二指令集之间构成循环体,例如,在运算Y i=∑(wx i+b),i=1,2,3,...100时,假定Y 1=wx 1+b为第一运算任务,Y 2=wx 2+b为第二运算任务,Y 1=wx 1+b运算的第一加载指令、第一计算指令和第一存储指令对应第一指令集,Y 2=wx 2+b运算的第二加载指令、第二计算指令和第二存储指令对应第二指令集。其中,Y 1=wx 1+b运算对应的多个预设指令信息中,Y 1=wx 1+b运算对应的第一计算指令的剩余计算次数为99次,Y 2=wx 2+b运算对应的第二计算指令的剩余计算次数为98次,可见,第一运算任务对应的第一指令集与第二运算任务对应的第二指令集中的指令之间,第一加载指令与第二加载指令类型相同,剩余加载次数不同,第一存储指令与第二存储指令类型相同,剩余存储次数不同,第一计算指令与第二计算指令中的运算符类型都包括加法和乘法运算符,且运算顺序都相同,仅仅是剩余计算次数不同。因此,可确定第一指令集与第二指令集为循环体。
可选地,所述第一指令集包含第一运算任务的第一存储指令,第二运算任务的第二计 算指令和第三运算任务的第三加载指令;所述第二指令集包含第二运算任务的第二存储指令,第三运算任务的第三计算指令和第四运算任务的第四加载指令;上述步骤202中,确定所述第一指令集与所述第二指令集之间是否构成循环体,可包括以下步骤:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;将所述第二计算指令对应的第七预设指令信息与所述第三计算指令对应的第八预设指令信息进行比对;将所述第三加载指令对应的第九预设指令信息与所述第四加载指令对应的第十预设指令信息进行比对;
若所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,所述第七预设指令信息与第八预设指令信息之间仅存在操作次数的差异,且所述第九预设指令信息与所述第十预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
本申请实施例中,可将神经网络的指令集中的指令按照树型结构进行排布,请参阅图1-2B,图1-2B为本申请实施例提供的一种将指令集中的指令按照树型结构进行排布的演示示意图,如图1-2B所示,树型结构中第一层数字用于表示芯片信息,例如,“1”表示第一个芯片,第二层数字用于表示时间片,例如“1”表示第一时间片,“2”表示第二时间片,以此类推,第三层字母表示每一时间片内的加载指令、计算指令、存储指令,其中,L代表加载指令、C代表计算指令、S代表存储指令,每一指令对应一个预设指令信息,例如,在运算Y i=∑(wx i+b),i=1,2,3,...100时,i的取值会从1变化到100,则该运算要重复执行Y i=wx i+b的总次数为100,第一时间片中,则该运算要重复执行Y i=wx i+b的总次数为100,每一次都要执行加法和乘法运算,因此,可确定该运算中100次Y i=wx i+b的运算为一个循环体。
其中,可预先对各个时间片的指令集对应的循环体进行解析,得到树型结构中每一节点的预设指令信息,针对紧邻的第一时间片和第二时间片,可判断第一时间片对应的第一指令集与第二时间片对应的第二指令集是否构成循环体,具体地,将第一运算任务的第一存储指令对应的第五预设指令信息与第二运算任务的第二存储指令对应的第六预设指令信息进行比对;将第二运算任务的第二计算指令对应的第七预设指令信息与第三运算任务的第三计算指令对应的第八预设指令信息进行比对;以及将第三运算任务的第三加载指令对应的第九预设指令信息与第四运算任务的第四加载指令对应的第十预设指令信息进行比对;若满足除了的剩余执行次数不同,且第二时间片对应的指令的剩余执行次数较小,其余信息完全相同,则可确定第二时间片对应的第二指令集合与第一时间片对应的第一指令集合构成循环体,例如,若第一时间片中包含加载指令、计算指令和存储指令,计算指令包括的运算符为加法和乘法,加载指令的剩余操作次数为5次,计算指令的剩余操作次数为9次,存储指令的剩余操作次数为3次,第二时间片中的第二指令集中也包含加载指令、计算指令和存储指令,计算指令包括的运算符为加法和乘法,加载指令的剩余操作次数为4次,计算指令的剩余操作次数为8次,存储指令的剩余操作次数为2次,可确定第一时 间片对应的第一指令集与所述计算指令所属的第二时间片对应的第二指令集构成循环体。
进一步地,可确定连续的多个时间片对应的多个指令集是否构成循环体,若连续的多个时间片对应的多个指令集构成循环体,表明该连续多个时间片中类型相同的指令为重复执行的指令,在该循环体中,循环体的起点为剩余操作次数最大的节点所在的时间片,循环体的长度为满足循环条件的最远时间片与起始时间片的差值。
203、在所述第一指令集与所述第二指令集之间构成循环体时,根据所述第一指令集的指令信息执行所述第二指令集中的指令。
其中,上述指令信息可包括指令的操作码和操作域,具体实现中,若第一指令集与所述第二指令集之间构成循环体时,可将第一指令集中的指令的操作码和操作域进行存储,然后,在执行第二指令集中的指令时,直接跳转至第一指令集中与第二指令集中的指令相对应的指令的操作码,进而根据第一指令集的指令的操作码执行第二指令集中的指令。
例如,在运算Y i=∑(wx i+b),i=1,2,3,...100时,i的取值会从1变化到100,则该运算要重复执行Y i=wx i+b的总次数为100,第一时间片中,则该运算要重复执行Y i=wx i+b的总次数为100,每一次都要执行加法和乘法运算,因此,可确定该运算中100次Y i=wx i+b的运算为一个循环体,本申请实施例中,可将第一时间片对应的第一计算指令对应的操作码存储在操作码存储区域,无需重复存储100次Y i=wx i+b运算对应的多个指令的操作码,在执行第二时间片的过程中,可通过跳转指令,跳转至操作码存储区域,获取第二指令集对应的第一指令集的指令的操作码,从而可重复使用操作码存储区域的操作码,节省操作码的存储空间,可缩减第二时间片中的指令集中各指令的代码量,也可节省指令存储空间,提高运算效率。
可选地,本申请实施例中,假定Y 1=wx 1+b为第一运算任务,Y 2=wx 2+b为第二运算任务,Y 3=wx 3+b为第三运算任务,第一指令集包括Y 1=wx 1+b运算对应的第一存储指令,Y 2=wx 2+b运算对应的第一计算指令和Y 3=wx 3+b对应的第一加载指令,第二指令集包括Y 2=wx 2+b运算对应的第二存储指令、Y 3=wx 3+b对应的第二计算指令以及Y 4=wx 4+b运算对应的第二加载指令。其中,Y 1=wx 1+b运算对应的多个预设指令信息中,Y 1=wx 1+b运算对应的计算指令的剩余计算次数为99次;Y 2=wx 2+b运算对应的第一计算指令的剩余计算次数为98次,可见,第一时间片对应的第一指令集与第二时间片内的第二指令集中的指令之间,第一加载指令与第二加载指令类型相同,剩余加载次数不同,第一存储指令与第二存储指令类型相同,剩余存储次数不同,第一计算指令与第二计算指令中的运算符类型都包括加法和乘法运算符,且运算顺序都相同,仅仅是剩余计算次数不同。因此,可确定第一指令集与第二指令集为循环体。
可选地,上述步骤203中,根据所述第一指令集的指令信息执行所述第二指令集中的指令,可包括以下步骤:
根据跳转指令跳转至所述第一指令集中与所述第二指令集中的第二指令对应的第一指令的操作码存储区域,从所述操作码存储区域获取所述第一指令的操作码,将所述操作码作为所述第二指令的操作码,其中,所述操作码包括所述第一指令的标识。
可选地,本申请实施例中,还可包括以下步骤:
A1确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联 关系;
A2、在所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系时,在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令。
本申请实施例中,加载指令与存储指令之间、加载指令与计算指令之间、存储指令与计算指令之间可以并行执行,加载指令与加载指令之间、计算指令与计算指令之间、存储指令与存储指令之间不可并行执行,需要串行执行。
其中,在执行指令的过程中,在两条指令之间,若执行一条指令需要用到另一条指令的数据,表明该两条指令之间存在关联关系,例如,若执行一条计算指令需要用到一条加载指令加载的数据,表明该计算指令需要在该加载指令执行完才能执行,可确定该加载指令与该计算指令具有关联关系,因此,可确定待执行的指令之间的关联关系,若确定待执行的多条指令不存在关联关系,则通过执行单元中的加载执行单元、计算执行单元和存储执行单元并行执行不存在关联关系的两条或者三条指令,本申请实施例中,可并行执行指令的情况包括以下几种:加载指令与存储指令之间可并行执行、加载指令与计算指令之间可并行执行、存储指令与计算指令之间可并行执行、加载指令计算指令与存储指令之间可并行执行。因此,本申请实施例中,可将神经网络的指令集中的多个指令按照流水线的方式进行排布,请参阅图1-2C,图1-2C为本申请实施例提供的一种并行执行神经网络的指令集中的指令的演示示意图,如图1-2C所示,L代表加载指令、C代表计算指令、S代表存储指令,其中,横向的每一行加载指令、计算指令和存储指令对应一个运算任务,可对输入数据进行加载、计算得到计算结果,将结果进行存储;纵向的每一列加载指令、计算指令和存储指令对应的一个时间片,表示将不存在关联关系的加载指令、计算指令和存储指令进行并行执行。可见,通过将不存在关联关系的指令进行并行执行,可以让不存在关联关系的多个运算任务并行执行,从而节省了计算时间,提高了计算效率。
可选地,上述步骤A1中,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系,可包括以下步骤:
A11、提取所述第一存储指令中所需数据的第一存储地址区间,提取所述第二计算指令中所需数据的第二存储地址区间,提取所述第三加载指令中所需数据的第三存储地址区间;
A12、若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
可选地,上述步骤A1中,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系,可包括以下步骤:
A13、提取所述第一存储指令对应的第一写入区域,提取所述第二计算指令对应的第二读取区域和第二写入区域,提取所述第三加载指令对应的第三读取区域;
A14、若所述第一写入区域、所述第二读取区域、所述第二写入区域和所述第三读取区域之间均不存在重叠区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
可选地,所述人工智能计算装置还包括存储单元,所述存储单元与外部存储装置连接;上述步骤A2中,在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令,可包括以下步骤:
B1、根据所述第一存储指令将所述第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置;
B2、根据所述第二计算指令对所述第二运算任务中第二输入数据进行计算,得到第二计算结果;
B3、根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
可选地,所述存储单元包括第一存储区域和第二存储区域,上述步骤B3中,根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元,可包括以下步骤:
在所述第一时间片内根据所述第三加载指令将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域。
其中,可将存储单元划分为第一存储区域和第二存储区域,在执行神经网络的指令集中的加载指令时,可进行乒乓操作轮流将输入数据从外部存储装置传输到第一存储区域和第二存储区域进行存储,具体地,在第一时间片内,可根据第三加载指令将第三输入数据传存储至第一存储区域,在第二时间片内,可根据第四加载指令将第四输入数据存储至第二存储区域,此时可并行执行第三计算指令,根据第三计算指令从第一存储区域获取第三输入数据进行计算,得到计算结果,在下一时间片,可将下一输入数据存储至第一存储区域,且并行执行第四加载指令对应的下一计算指令,如此循环。从而,可以节省存储单元的存储空间。
可选地,所述第三输入数据包括多个第三输入子数据,将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域,具体可包括以下步骤:
C1、预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
C2、按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至所述第一存储区域,并从所述第一存储区域的两端存储至中间
其中,将输入数据第一存储区域中,存储的位置越靠近中间,计算时读取输入数据所需要的时间越长,因此,可在存储上述多个第三输入子数据的过程中,先确定每一第三输入子数据的目标存储时长,然后按照存储时长从大到小的顺序从所述第一存储区域的两端存储至中间,如此,在获取第三输入数据进行计算的过程中,可减少较大的目标存储时长对应的第三输入子数据的读取时长,进而提高运算效率。
类似地,在将输入数据从外部存储装置传输至第二存储区域的过程中,也可按照存储时长从大到小的顺序从所述第二存储区域的两端存储至中间。
举例说明,执行运算Y i=∑(wx i+b)的过程中中,w和b为会重复读取操作的数据, 可确定w和b对应的存储时长较长,可将w和b存储在第一存储区域或者第二存储区域的两端,将x i存储在第一存储区域或者第二存储区域的中间,从而,在从第一存储区域或者第二存储区域读取数据时,每次读取w和b的时长较小,从而可减少读取数据的耗时。
再举例说明,如图1-2C所示,横向的每一行加载指令、计算指令和存储指令对应一个运算任务,例如,第一个运算任务可包括第一加载指令La、第一计算指令Ca和第一存储指令Sa,可通过第一加载指令La从外部存储装置将输入数据加载到人工智能计算装置上存储单元的a1区域;然后通过第一计算指令Ca从a1区域读取输入数据,对输入数据进行计算,得到计算结果,将计算结果存储在人工智能计算装置上存储单元的a2区域;最后,通过第一存储指令Sa从a2区域读取计算结果,并将计算结果从存储单元的a2区域传输至外部存储装置,类似地,第二个运算任务可包括第二加载指令Lb、第二计算指令Cb和第二存储指令Sb,第三个运算任务可包括第三加载指令Lc、第三计算指令Cc和第三存储指令Sc,第四个运算任务可包括第四加载指令Ld、第四计算指令Cd和第四存储指令Sd。可以看出,在第一时间片内,若第一运算任务的第一存储指令Sa、第二运算任务的第二计算指令Cb和第三运算任务的第三加载指令Lc之间不存在关联关系,可在第一时间片内并行执行第一存储指令Sa、第二计算指令Cb以及第三加载指令Lc,此外,若第二运算任务、第三运算任务与第四运算任务之间不具有关联关系,还可在第二时间片内并行执行第二运算任务的第二存储指令Sb、第三个运算任务的第三计算指令Cc以及第四个运算任务的第四加载指令Ld。
进一步地,若第一时间片内并行执行的第一存储指令Sa、第二计算指令Cb以及第三加载指令Lc构成的第一指令集与第二时间片内并行执行的第二存储指令Sb、第三计算指令Cc、第四加载指令Ld构成的第二指令集构成循环体,可在执行第二时间片对应的指令集中的指令时,根据跳转指令跳转至第一指令集对应的指令的操作码存储区域,具体地,从操作码存储区域获取第三加载指令Lc的第一操作码、第二计算指令Cb的第二操作码、第一存储指令Sa的第三操作码;然后,将第一操作码作为第四加载指令Ld的操作码,将第二操作码作为第三计算指令Cc的操作码,将第三操作码作为第二存储指令Sb的操作码;此外,可获取第四加载指令Ld对应的第一操作域,第三计算指令Cc对应的第二操作域,第二存储指令Sb对应的第三操作域。
本申请提供的技术方案通过对神经网络的指令集中重复的指令进行折叠,通过跳转指令执行重复的指令,减少了重复指令展开的代码量,通过将神经网络中的数据存储在划分的不同区域,提高了获取数据的效率,从而提高了神经网络的运算效率。
本申请还揭露了一个机器学习运算装置,其包括一个或多个在本申请中提到的人工智能计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的机器学习运算,执行结果通过I/O接口传递给***设备。***设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上人工智能计算装置时,人工智能计算装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算。此时,可以共享同一控制***,也可以有各自独立的控制***;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。
该机器学习运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。
本申请还揭露了一个组合处理装置,其包括上述的机器学习运算装置,通用互联接口,和其他处理装置。机器学习运算装置与其他处理装置进行交互,共同完成用户指定的操作。图1-3为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
通用互联接口,用于在所述机器学习运算装置与其他处理装置间传输数据和控制指令。该机器学习运算装置从其他处理装置中获取所需的输入数据,写入机器学习运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入机器学习运算装置片上的控制缓存;也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构如图1-4所示,还可以包括存储装置,存储装置分别与所述机器学习运算装置和所述其他处理装置连接。存储装置用于保存在所述机器学习运算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上***,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还公开了一种芯片,其包括了上述机器学习运算装置或组合处理装置。
在一些实施例里,公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,公开了一种板卡,其包括了上述芯片封装结构。参阅图1-5,图1-5提供了一种板卡,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用 DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一些实施例里,申请了一种电子装置,其包括了上述板卡。
电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
在信息处理技术领域,人工神经网络是一种功能强大的算法,近年来被应用于图像、语言等各种领域。而人工智能计算装置的出现可以使神经网络得到硬件的支持,更高效地进行计算。人工智能计算装置一般有自己的指令集,指令集中会包含较多的待执行指令,执行指令集中的所有指令耗时较长,效率受到影响,也会包含重复执行的指令,例如,在进行数据加载的过程中,若数据规模较大,则需要多次搬运才能完成地址空间转换,又例如,模板运算中重复的加法乘法运算等,从而导致计算效率降低。
为了解决上述的问题,我们提出了如下的方案。
首先介绍本申请使用的计算装置。参阅图2-1,提供了一种人工智能计算装置,该人工智能计算装置用于执行机器学习计算,该计算装置包括:控制器单元21、存储单元20和执行单元22,其中,所述存储单元20连接外部存储装置,所述执行单元22包括加载执行单元221、计算执行单元222和存储执行单元223;其中,
所述控制器单元,用于获取待执行的第一指令集,所述第一指令集包含第一加载指令、 第一计算指令和第一存储指令;确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,若所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系,将所述第一加载指令、所述第一计算指令和所述第一存储指令发送至所述执行单元;
所述执行单元,用于在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令。
在一个可能的实施例中,在所述确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系方面,所述控制器单元具体用于:
依据所述第一加载指令提取所述第一加载指令中所需数据的第一存储地址区间,依据所述第一计算指令提取所述第一计算指令中所需数据的第二存储地址区间,依据所述第一存储指令提取所述第一存储指令中所需数据的第三存储地址区间,若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
在一个可能的实施例中,在所述确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系方面,所述控制器单元具体用于:
依据所述第一加载指令提取所述第一加载指令对应的第一写入区域,依据所述第一计算指令提取所述第一计算指令对应的第一读取区域和第二写入区域,依据所述第一存储指令提取所述第一存储指令对应的第二读取区域;
若所述第一写入区域、所述第一读取区域、所述第二写入区域和所述第二读取区域之间均不存在重叠区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
在一个可能的实施例中,所述人工智能计算装置还包括存储单元,所述存储单元连接外部存储装置,所述执行单元包括加载执行单元、计算执行单元和存储执行单元,在所述在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令方面,所述存储执行单元用于根据所述第一存储指令将第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置,所述计算执行单元用于根据所述第一计算指令对第二运算任务中第二输入数据进行计算,得到第二计算结果;所述加载执行单元用于根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
在一个可能的实施例中,所述存储单元包括第一存储区域和第二存储区域,在所述根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元方面,所述加载执行单元具体用于:
在所述第一时间片内根据所述第一加载指令将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域;
所述执行单元在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令后,所述控制器单元还用于:
获取第二指令集,所述第二指令集包含第二加载指令、第二计算指令和第二存储指令, 所述第二存储指令为用于将所述第二计算结果从所述存储单元传输至所述外部存储装置的指令,所述第二计算指令为用于对所述第三运算任务中的所述第三输入数据进行计算并得到第三计算结果的指令,所述第二加载指令为将第四运算任务中的第四输入数据从所述外部存储装置传输至所述存储单元的指令;
所述执行单元,还用于在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令,所述第二时间片晚于所述第一时间片;
其中,所述存储执行单元用于根据所述第二存储指令将所述第二计算结果从所述存储单元传输至所述外部存储装置;所述计算执行单元用于在所述第二时间片内根据所述第二计算指令从所述第一存储区域获取所述第三输入数据,并根据所述第三输入数据进行计算,得到所述第三计算结果;所述加载执行单元用于在所述第二时间片内根据所述第二加载指令将所述第四输入数据进行乒乓操作从所述外部存储装置传输至所述第二存储区域。
在一个可能的实施例中,所述第三输入数据包括多个第三输入子数据,在将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域方面,所述加载执行单元具体用于:
预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至第一存储区域,并从所述第一存储区域的两端存储至中间。
在一个可能的实施例中,在所述控制器单元获取第二加载指令、第二计算指令和第二存储指令之后,所述执行单元在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令之前,所述控制器单元还用于:
确定所述第一指令集与所述第二指令集之间是否构成循环体;
若所述第一指令集与所述第二指令集之间构成循环体,根据跳转指令跳转至所述第一指令集对应的指令的操作码存储区域,从所述操作码存储区域获取所述第一加载指令的操作码,将所述操作码作为所述第二加载指令的操作码,并获取所述第二加载指令对应的操作域,其中,所述操作码包括所述第一计算指令的标识;所述操作域包括所述第四输入数据的存储地址。
在一个可能的实施例中,在所述确定所述第一指令集与所述第二指令集之间是否构成循环体方面,所述控制器单元具体用于:
获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述 第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
如图2-2A所示,图2-2A为本申请实施例提供的一种人工智能计算方法的流程示意图,应用于人工智能计算装置,所述人工智能计算装置包括控制器单元、存储单元和执行单元;所述存储单元连接外部存储装置,所述执行单元包括加载执行单元、计算执行单元和存储执行单元;所述方法包括:
2201、获取待执行的第一指令集,所述第一指令集包含第一加载指令、第一计算指令和第一存储指令;确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,若所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系,将所述第一加载指令、所述第一计算指令和所述第一存储指令发送至所述执行单元;
2202、在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令。
本申请实施例中,可将神经网络的指令集中的多个指令划分为输入输出指令和计算指令,输入输出指令可划分为加载指令和存储指令,其中,人工智能计算装置的执行单元用于根据加载指令将输入数据从外部存储装置传输到人工智能计算装置上的存储单元,然后根据计算指令从存储单元直接获取输入数据,并根据输入数据进行计算,得到计算结果,将计算结果缓存至存储单元,最后根据存储指令将计算结果从存储单元传输到外部存储装置。
其中,加载指令与存储指令之间、加载指令与计算指令之间、存储指令与计算指令之间可以并行执行,加载指令与加载指令之间、计算指令与计算指令之间、存储指令与存储指令之间不可并行执行,需要串行执行。
其中,在执行指令的过程中,在两条指令之间,若执行一条指令需要用到另一条指令的数据,表明该两条指令之间存在关联关系,例如,若执行一条计算指令需要用到一条加载指令加载的数据,表明该计算指令需要在该加载指令执行完才能执行,可确定该加载指令与该计算指令具有关联关系,因此,可确定待执行的指令之间的关联关系,若确定待执行的多条指令不存在关联关系,则通过执行单元中的加载执行单元、计算执行单元和存储执行单元并行执行不存在关联关系的两条或者三条指令,本申请实施例中,可并行执行指令的情况包括以下几种:加载指令与存储指令之间可并行执行、加载指令与计算指令之间可并行执行、存储指令与计算指令之间可并行执行、加载指令计算指令与存储指令之间可并行执行。因此,本申请实施例中,可将神经网络的指令集中的多个指令按照流水线的方式进行排布,请参阅图2-2B,图2-2B为本申请实施例提供的一种并行执行神经网络的指令集中的指令的演示示意图,如图2-2B所示,L代表加载指令、C代表计算指令、S代表存储指令,其中,横向的每一行加载指令、计算指令和存储指令对应一个运算任务,可对输入数据进行加载、计算得到计算结果,将结果进行存储;纵向的每一列加载指令、计算指令和存储指令对应的一个时间片,表示将不存在关联关系的加载指令、计算指令和存储 指令进行并行执行。可见,通过将不存在关联关系的指令进行并行执行,可以让不存在关联关系的多个运算任务并行执行,从而节省了计算时间,提高了计算效率。
其中,神经网络的指令集的划分可以不局限于加载指令、计算指令和存储指令三个阶段的划分,还可以其他标准划分指令,本申请实施例不做限定。
可选地,上述步骤2201中,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,可包括以下步骤:
211、所述控制器单元依据所述第一加载指令提取所述第一加载指令中所需数据的第一存储地址区间,依据所述第一计算指令提取所述第一计算指令中所需数据的第二存储地址区间,依据所述第一存储指令提取所述第一存储指令中所需数据的第三存储地址区间;
212、若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
可选地,上述步骤2201中,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间是否存在关联关系,可包括以下步骤:
213、所述控制器单元依据所述第一加载指令提取所述第一加载指令对应的第一写入区域,依据所述第一计算指令提取所述第一计算指令对应的第一读取区域和第二写入区域,依据所述第一存储指令提取所述第一存储指令对应的第二读取区域;
214、若所述第一写入区域、所述第一读取区域、所述第二写入区域和所述第二读取区域之间均不存在重叠区域,确定所述第一加载指令、所述第一计算指令和所述第一存储指令之间不存在关联关系。
可选地,所述人工智能计算装置包括存储单元,所述存储单元连接外部存储装置,上述步骤2202中,在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令,可包括以下步骤:
221、根据所述第一存储指令将第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置;
根据所述第一计算指令对第二运算任务中第二输入数据进行计算,得到第二计算结果;
222、根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
可选地,所述存储单元包括第一存储区域和第二存储区域,上述步骤222中,所述根据所述第一加载指令将第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元,可包括以下步骤:
A21、在所述第一时间片内根据所述第一加载指令将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域;
在第一时间片内并行执行所述第一加载指令、所述第一计算指令和所述第一存储指令后,所述方法还包括:
A22、获取第二指令集,所述第二指令集包含第二加载指令、第二计算指令和第二存储指令,所述第二存储指令为用于将所述第二计算结果从所述存储单元传输至所述外部存 储装置的指令,所述第二计算指令为用于对所述第三运算任务中的所述第三输入数据进行计算并得到第三计算结果的指令,所述第二加载指令为将第四运算任务中的第四输入数据从所述外部存储装置传输至所述存储单元的指令;
A23、在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令,所述第二时间片晚于所述第一时间片;其中,所述存储执行单元用于根据所述第二存储指令将所述第二计算结果从所述存储单元传输至所述外部存储装置;所述计算执行单元用于在所述第二时间片内根据所述第二计算指令从所述第一存储区域获取所述第三输入数据,并根据所述第三输入数据进行计算,得到所述第三计算结果;所述加载执行单元用于在所述第二时间片内根据所述第二加载指令将所述第四输入数据进行乒乓操作从所述外部存储装置传输至所述第二存储区域。
其中,可将存储单元划分为第一存储区域和第二存储区域,在执行神经网络的指令集中的加载指令时,可进行乒乓操作轮流将输入数据从外部存储装置传输到第一存储区域和第二存储区域进行存储,具体地,在第一时间片内,可根据第一加载指令将第三输入数据传存储至第一存储单元,在第二时间片内,可根据第二加载指令将第四输入数据存储至第二存储区域,此时可并行执行第二计算指令,根据第二计算指令从第一存储区域获取第三输入数据进行计算,得到计算结果,在下一时间片,可将下一输入数据存储至第一存储区域,且并行执行第二操作指令对应的下一计算指令,如此循环。从而,可以节省存储单元的存储空间。
可选地,上述步骤A21中,所述第三输入数据包括多个第三输入子数据,将所述第三运算任务中的第三输入数据进行乒乓操作从所述外部存储装置传输至所述第一存储区域,可包括以下步骤:
A211、所述加载执行单元预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
A212、按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至第一存储区域,并从所述第一存储区域的两端存储至中间。
其中,将输入数据第一存储区域中,存储的位置越靠近中间,计算时读取输入数据所需要的时间越长,因此,可在存储上述多个第三输入子数据的过程中,先确定每一第三输入子数据的目标存储时长,然后按照存储时长从大到小的顺序从所述第一存储区域的两端存储至中间,如此,在获取第三输入数据进行计算的过程中,可减少较大的目标存储时长对应的第三输入子数据的读取时长,进而提高运算效率。
类似地,在将输入数据从外部存储装置传输至第二存储区域的过程中,也可按照存储时长从大到小的顺序从所述第二存储区域的两端存储至中间。
举例说明,执行运算Y i=∑(wx i+b)的过程中中,w和b为会重复读取操作的数据,可确定w和b对应的存储时长较长,可将w和b存储在第一存储区域或者第二存储区域的两端,将x i存储在第一存储区域或者第二存储区域的中间,从而,在从第一存储区域或者第二存储区域读取数据时,每次读取w和b的时长较小,从而可减少读取数据的耗时。
可选地,在所述控制器单元获取第二加载指令、第二计算指令和第二存储指令之后, 所述执行单元在第二时间片内并行执行所述第二加载指令、所述第二计算指令和所述第二存储指令之前,还可包括以下步骤:
B21、确定所述第一指令集与所述第二指令集之间是否构成循环体;
B22、若所述第一指令集与所述第二指令集之间构成循环体,根据跳转指令跳转至所述第一指令集对应的指令的操作码存储区域,从所述操作码存储区域获取所述第一加载指令的操作码,将所述操作码作为所述第二加载指令的操作码,并获取所述第二加载指令对应的操作域,其中,所述操作码包括所述第一计算指令的标识;所述操作域包括所述第四输入数据的存储地址。
本申请实施例中,可将神经网络的指令集中的指令按照树型结构进行排布,请参阅图2-2C,图2-2C为本申请实施例提供的一种将指令集中的指令按照树型结构进行排布的演示示意图,如图2-2C所示,树型结构中第一层数字用于表示芯片信息,例如,“1”表示第一个芯片,第二层数字用于表示时间片,例如“1”表示第一时间片,“2”表示第二时间片,以此类推,第三层字母表示每一时间片内的加载指令、计算指令、存储指令,其中,每一指令对应一个预设指令信息,预设指令信息中可包括以下至少一种信息:指令类型、剩余执行次数、是否奇偶性翻转,其中,指令类型是指该指令为加载指令、计算指令或者存储指令,以及当指令为计算指令时,计算指令中包含的运算符类型,运算符类型可包括以下至少一种:加、减、乘、除、卷积、以及上述多种运算符之间的组合等等,剩余执行次数是指针对一个运算中需要执行多次的重复运算的剩余执行次数。例如,在运算Y i=∑(wx i+b),i=1,2,3,...100时,i的取值会从1变化到100,则该运算要重复执行Y i=wx i+b的总次数为100,第一时间片中,则该运算要重复执行Y i=wx i+b的总次数为100,每一次都要执行加法和乘法运算,因此,可确定该运算中100次Y i=wx i+b的运算为一个循环体,本申请实施例中,可将第一时间片对应的第一计算指令对应的操作码存储在操作码存储区域,无需重复存储100次Y i=wx i+b运算对应的多个指令的操作码,在执行第二时间片的过程中,可通过跳转指令,跳转至操作码存储区域,获取第二指令集对应的指令的操作码,从而可重复使用操作码存储区域的操作码,节省操作码的存储空间,可缩减第二时间片中的指令集中各指令的代码量,也可节省指令存储空间,提高运算效率。
本申请实施例中,假定Y 1=wx 1+b为第一运算任务,Y 2=wx 2+b为第二运算任务,Y 3=wx 3+b为第三运算任务,Y 1=wx 1+b运算对应的第一存储指令,Y 2=wx 2+b运算对应的第一计算指令和Y 3=wx 3+b对应的第一加载指令之间不存在关联关系,可在第一时间片内并行执行第一加载指令、所述第一计算指令和所述第一存储指令,其中,Y 1=wx 1+b运算对应的多个预设指令信息中,Y 1=wx 1+b运算对应的计算指令的剩余计算次数为99次。进一步地,在第二时间片内,可并行执行不存在关联关系的Y 2=wx 2+b运算对应的第二存储指令、Y 3=wx 3+b对应的第二计算指令以及Y 4=wx 4+b运算对应的第二加载指令,其中,Y 2=wx 2+b运算对应的第一计算指令的剩余计算次数为98次。可见,第一时间片对应的第一指令集与第二时间片内的第二指令集中的指令之间,第一加载指令与第二加载指令类型相同,剩余加载次数不同,第一存储指令与第二存储指令类型相同,剩余存储次数不同,第一计算指令与第二计算指令中的运算符类型都包括加法和乘法运算符,且运算顺序都相同,仅仅是剩余计算次数不同。因此,可确定第一指令集与第二指令集为循环体。
可选地,上述步骤B21中,确定所述第一指令集与所述第二指令集之间是否构成循环 体,可包括以下步骤:
C21、获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
C22、将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;以及将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
C23、若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
本申请实施例中,可预先对各个时间片的指令集对应的循环体进行解析,得到树型结构中每一节点的预设指令信息,针对紧邻的第一时间片和第二时间片,可判断第一时间片对应的第一指令集与第二时间片对应的第二指令集是否构成循环体,具体地,将第一加载指令对应的第一预设指令信息与第二加载指令对应的第二预设指令信息进行比对;将第一计算指令对应的第三预设指令信息与第二计算指令对应的第四预设指令信息进行比对;以及将第一存储指令对应的第五预设指令信息与第二存储指令对应的第六预设指令信息进行比对;若满足除了的剩余执行次数不同,且第二时间片对应的指令的剩余执行次数较小,其余信息完全相同,则可确定第二时间片对应的第二指令集合与第一时间片对应的第一指令集合构成循环体,例如,若第一时间片中包含加载指令、计算指令和存储指令,计算指令包括的运算符为加法和乘法,加载指令的剩余操作次数为5次,计算指令的剩余操作次数为9次,存储指令的剩余操作次数为3次,第二时间片中的第二指令集中也包含加载指令、计算指令和存储指令,计算指令包括的运算符为加法和乘法,加载指令的剩余操作次数为4次,计算指令的剩余操作次数为8次,存储指令的剩余操作次数为2次,可确定第一时间片对应的第一指令集与所述计算指令所属的第二时间片对应的第二指令集构成循环体,从而,可确定连续的多个时间片对应的多个指令集是否构成循环体,若连续的多个时间片对应的多个指令集构成循环体,表明该连续多个时间片中类型相同的指令为重复执行的指令,在该循环体中,循环体的起点为剩余操作次数最大的节点所在的时间片,循环体的长度为满足循环条件的最远时间片与起始时间片的差值。
再举例说明,如图2-2B所示,横向的每一行加载指令、计算指令和存储指令对应一个运算任务,例如,第一个运算任务可包括加载指令La、计算指令Ca和存储指令Sa,可通过加载指令La从外部存储装置将输入数据加载到人工智能计算装置上存储单元的a1区域;然后通过计算指令Ca从a1区域读取输入数据,对输入数据进行计算,得到计算结果,将计算结果存储在人工智能计算装置上存储单元的a2区域;最后,通过存储指令Sa从a2区域读取计算结果,并将计算结果从存储单元的a2区域传输至外部存储装置,类似地,第二个运算任务可包括加载指令Lb、计算指令Cb和存储指令Sb,第三个运算任务可包括加载指 令Lc、计算指令Cc和存储指令Sc,第四个运算任务可包括加载指令Ld、计算指令Cd和存储指令Sd。可以看出,在第一时间片内,若第一运算任务的存储指令Sa、第二运算任务的计算指令Cb和第三运算任务的加载指令Lc之间不存在关联关系,可在第一时间片内并行执行存储指令Sa、计算指令Cb以及加载指令Lc,此外,若第二运算任务、第三运算任务与第四运算任务之间不具有关联关系,还可在第二时间片内并行执行第二运算任务的存储指令Sb、第三个运算任务的计算指令Cc以及第四个运算任务的加载指令Ld。
进一步地,若第一时间片内并行执行的存储指令Sa、计算指令Cb以及加载指令Lc构成的第一指令集与第二时间片内并行执行的存储指令Sb、计算指令Cc、加载指令Ld构成的第二指令集构成循环体,可在执行第二时间片对应的指令集中的指令时,根据跳转指令跳转至第一指令集对应的指令的操作码存储区域,具体地,从操作码存储区域获取加载指令Lc的第一操作码、计算指令Cb的第二操作码、存储指令Sa的第三操作码;然后,将第一操作码作为加载指令Ld的操作码,将第二操作码作为计算指令Cc的操作码,将第三操作码作为存储指令Sb的操作码;此外,可获取加载指令Ld对应的第一操作域,计算指令Cc对应的第二操作域,存储指令Sb对应的第三操作域。
本申请提供的技术方案通过将不具有关联关系的指令进行并行执行,减少了指令执行时间,提高了神经网络的运算效率,对神经网络的指令集中重复的指令进行折叠,通过跳转指令执行重复的指令,减少了重复指令展开的代码量,通过将神经网络中的数据存储在划分的不同区域,提高了获取数据的效率,从而提高了神经网络的运算效率。
本申请还揭露了一个机器学习运算装置,其包括一个或多个在本申请中提到的人工智能计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的机器学习运算,执行结果通过I/O接口传递给***设备。***设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上人工智能计算装置时,人工智能计算装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算。此时,可以共享同一控制***,也可以有各自独立的控制***;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。
该机器学习运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。
本申请还揭露了一个组合处理装置,其包括上述的机器学习运算装置,通用互联接口,和其他处理装置。机器学习运算装置与其他处理装置进行交互,共同完成用户指定的操作。图2-3为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
通用互联接口,用于在所述机器学习运算装置与其他处理装置间传输数据和控制指令。 该机器学习运算装置从其他处理装置中获取所需的输入数据,写入机器学习运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入机器学习运算装置片上的控制缓存;也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构如图2-4所示,还可以包括存储装置,存储装置分别与所述机器学习运算装置和所述其他处理装置连接。存储装置用于保存在所述机器学习运算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上***,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还公开了一种芯片,其包括了上述机器学习运算装置或组合处理装置。
在一些实施例里,公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,公开了一种板卡,其包括了上述芯片封装结构。参阅图2-5,图2-5提供了一种板卡,上述板卡除了包括上述芯片589以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件590、接口装置591和控制器件592;
所述存储器件590与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元593。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一些实施例里,申请了一种电子装置,其包括了上述板卡。
电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
在信息处理技术领域,人工智能处理装置中可以加载神经网络对应的离线模型,通过运行离线模型实现不同的神经网络任务。人工智能处理装置自身的运行环境的不同,例如所加载的运行时库的版本不同,导致人工智能处理装置只能运行对应版本的离线模型。若人工智能处理装置不能及时更新运行时库,将无法运行较高版本的离线模型。
为了解决上述的问题,我们提出了如下的方案。
本申请中的人工智能处理装置可以包括服务器、智能手机(如Android手机、iOS手机、Windows Phone手机等)、平板电脑、掌上电脑、台式电脑、笔记本电脑、移动互联网设备MID(Mobile Internet Devices,MID)或穿戴式设备等,上述电子设备仅是举例,而非穷举,包含但不限于上述人工智能处理装置。
人工智能处理装置的处理器可包括通用处理器和人工智能处理器。
其中,通用处理器中可以包括中央处理单元CPU(Central Processing Unit,CPU)、图形处理单元GPU(Graphics Processing Unit,GPU)和/或图像处理单元IPU(Image Processing Unit,IPU)中的一种或几种的组合。
人工智能处理器包括机器学习处理器单元MLU(Machine Learning Processing Unit,MLU),其中,人工智能处理器可由多个MLU集成,组成为一个具有多核的人工智能处理器。
下面对本申请实施例提供的离线模型的处理方法的具体流程进一步进行说明。
请参阅图3-1,图3-1为本申请实施例提供的一种离线模型的处理方法的流程示意图,该方法应用于人工智能处理装置。具体的,该方法包括如步骤S101~S102中所示的内容,其中:
步骤S101、获取运行离线模型的运行时库的版本信息,以及所述离线模型的模型信息。
在本申请中,离线模型包括编译完成的二进制机器指令,可在人工智能处理器上直接运行。该离线模型可包括模型结构信息、权值数据、输入输出数据,该离线模型也可包括离线模型的版本信息、机器学习处理指令的版本信息等模型信息,在此不做限定。
其中,模型结构信息可以包括神经网络模型对应的层结构。例如,该离线模型中包含卷积层、归一化层、缩放层和全连接层。
权值数据包括各个层对应的权值。
输入输出数据可包括输入输出数据规模,例如,图像数据的输入尺寸为50mm*50mm,像素值范围为(-1024,3071)。该输入输出数据还可包括输入输出数量信息,即定义几个输入数据,几个输出数据等。
在本申请中,离线模型可以通过机器学习库(Machine learning Library)和运行时库(Runtime Library)生成。具体为机器学习库和运行时库把执行神经网络模型计算所使用的数据和指令等一系列数据打包生成离线模型。
其中,机器学习库还用于在人工智能处理器上加速各种机器学习或者深度学习算法。该机器学习库提供了一套高效、通用、灵活的、可扩展的编程接口,其上层的机器学习应用可以直接采用各种编程框架(例如TensorFlow、Caffe、MXNet等)的编程接口,也可以使用机器学习库提供的接口直接编程。
在本申请中,运行时库还用于完成通用处理器和人工智能处理器之间的交互。该运行时库提供了一套面向人工智能处理器的接口,本申请对于运行时库的接口也不做限定,例如,加载离线模型的接口,调用该接口可使该人工智能处理器中加载离线模型等。
需要说明的是,运行时库可脱离机器学习库,单独使用离线模型文件完成神经网络的计算。以人工智能处理装置中的手机为例,该手机仅包括运行时库,在手机下载人工智能应用时,通过该手机中的运行时库运行该人工智能应用中包括的离线模型。而该人工智能应用中的离线模型是通过开发端的人工智能处理装置中的运行时库和机器学习库生成,再通过运行时库将该离线模型打包于该人工智能应用中。
在本申请中,运行离线模型的运行时库的版本信息为待运行该离线模型的其它人工智能处理装置的运行时库的版本信息。
步骤S102、根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型。
在本申请中,功能集合可以包括不同的运算处理功能集合。
在一种可能的示例中,所述功能集合包括通用算子集合和功能算子集合。
其中,通用算子集合包括离线模型中各个版本均需使用的运算处理功能,例如:加法、乘法等较常用运算的处理功能。功能算子集合包括离线模型中指定版本需使用的运算处理单元,例如:卷积、向量内积、排序等不太常用运算的处理单元。
在本申请中,可预先定义不同版本信息对应的功能集合,例如:版本信息为1对应的功能集合为第一功能集合,版本信息为2对应的功能集合为第二功能集合,版本信息为3对应的功能集合为第三功能集合。若版本信息为2,则可调用机器库中的第二功能集合,如此,该第二功能集合结合离线模型的模型信息升级可生成该版本信息对应的离线模型,从而提高了离线模型的生成效率。
本申请对于生成与版本信息对应的离线模型的方法不做限定,在一种可能的示例中,所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集 合,生成与所述版本信息对应的离线模型,包括:通过接口函数调用所述机器学习库中与所述版本信息对应的功能集合;根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
其中,机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合。
举例来说,假设版本信息为1对应的功能集合为第一功能集合,版本信息为2对应的功能集合为第二功能集合,版本信息为3对应的功能集合为第三功能集合。若版本信息为2,则通过接口函数(例如:cnrtGetModelLevelFromFile())调用第二功能集合,从而可基于该第二功能集合和离线模型的模型信息生成该版本信息对应的离线模型。可以理解,直接通过接口函数调用运行离线模型的运行时库的版本信息对应的功能集合,可提高生成离线模型的效率。
在另一种可能的示例中,所述调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型,包括:通过环境变量调用所述机器学习库中与所述版本信息对应的功能集合;根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
其中,机器学习库中包括环境变量(environment variables),所述环境变量用于调用不同版本信息对应的功能集合。
也就是说,通过环境变量调用各个版本信息对应的功能集合,如此,可通过接口函数直接调用目标版本对应的功能集合,以提高生成离线模型的效率。
举例来说,假设版本信息为1对应的功能集合为第一功能集合,版本信息为2对应的功能集合为第二功能集合,版本信息为3对应的功能集合为第三功能集合,其中环境变量对应版本信息。若版本信息为2,则可通过环境变量(例如:CNML_MODEL_LEVEL=2)调用第二功能集合,从而可基于该第二功能集合和离线模型的模型信息生成与该版本信息对应的离线模型。可以理解,直接通过环境变量调用运行离线模型的运行时库的版本信息对应的功能集合,可提高生成离线模型的效率。
在如图3-1所示的离线模型的处理方法中,在获取离线模型的模型信息和待生成的离线模型的版本信息之后,可根据该模型信息和该版本信息调用机器学习库中与该版本信息对应的功能集合生成与该版本信息对应的离线模型。如此,本申请实施例可以根据运行离线模型的运行时库的不同版本,生成与之对应的离线模型,可以提高所生成的离线模型的适用性。
举例来说,假设开发者的人工智能处理装置中机器学习库的版本信息为8,运行时库的版本信息为5,而客户端中运行时库的版本信息为2。可见,客户端对应的运行时库的版本低于人工智能处理装置的版本,则客户端无法直接运行与该人工智能处理装置中运行时库的最新版本信息对应的离线模型。在实施本实施例之后,可生成与客户端中的运行时库的版本信息对应的离线模型。由于客户端中的运行时库的版本信息等于该新生成的离线模型的版本信息,则该客户端中的运行时库可运行该新生成的离线模型,可以提高所生成的离线模型的适用性。
在一种可能的示例中,所述方法还包括:基于所述版本信息对应的运行时库,运行所述离线模型。
可以理解,采用该版本信息对应的运行时库运行该离线模型,由于版本信息一致,离线模型可以被正常使用,以完成神经网络异构计算。
参阅图3-2,图3-2为本申请实施例提供的另一种离线模型的处理方法的流程示意图,该方法应用于人工智能处理装置,具体的,该方法包括如步骤S3201~S3202中所示的内容:
步骤S3201、当不给出运行离线模型的运行时库的版本信息时,获取所述离线模型的模型信息。
其中,离线模型的模型信息可参照步骤S101的描述,在此不再赘述。
步骤S3202、调用机器学习库中与运行时库的最新版本信息对应的功能集合,生成与运行时库的最新版本信息对应的离线模型。
其中,功能集合的描述也可参考步骤S102的描述,在此不再赘述。
在本申请中,运行时库的最新版本信息为当前人工智能处理装置中运行时库的最高版本。对于生成与运行时库的最新版本信息对应的离线模型的方法不做限定,可参照生成与版本信息对应的离线模型的方法,在此也不再赘述。
在如图3-2所示的离线模型的处理方法中,当未获取运行离线模型的运行时库的版本信息时,直接调用该机器学习库中与运行时库的最新版本信息对应的功能集合生成与该版本信息对应的离线模型。也就是说,在默认情况下,直接生成与最新版本信息的运行时库对应的离线模型,采用最新版本信息对应的离线模型可提高运行效率。
参阅图3-3,图3-3示出了上述实施例中所涉及的离线模型的处理装置300的一种可能的功能单元组成框图,离线模型的处理装置300包括:
获取单元301,用于获取运行离线模型的运行时库的版本信息,以及所述离线模型的模型信息;
生成单元302,用于根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型。
在一种可能的示例中,所述机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合,在所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型方面,所述生成单元302,具体用于通过所述接口函数调用所述机器学习库中与所述版本信息对应的功能集合;根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
在一种可能的示例中,所述机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合,在所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型方面,所述生成单元302,具体用于通过所述环境变量调用所述机器学习库中与所述版本信息对应的功能集合;根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
在一种可能的示例中,所述生成单元302,还用于当不给出运行离线模型的运行时库的版本信息时,根据离线模型的模型信息,调用机器学习库中与运行时库的最新版本信息对应的功能集合,生成与运行时库的最新版本信息对应的离线模型。
在一种可能的示例中,所述模型信息包括:模型结构信息、权值数据、输入输出数据。
在一种可能的示例中,所述功能集合包括:通用算子集合、功能算子集合。
在一种可能的示例中,所述装置300还包括:
运行单元303,用于基于所述版本信息对应的运行时库,运行所述离线模型。
参阅图3-4,图3-4为本申请实施例提供的一种人工智能处理装置的结构示意图,如图4所示,该人工智能处理装置包括处理器、存储器、通信接口以及一个或多个程序。其中,上述处理器包括通用处理器和人工智能处理器。上述一个或多个程序不同于上述一个或多个应用程序,且上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行以下步骤的指令:
获取运行离线模型的运行时库的版本信息,以及所述离线模型的模型信息;
根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型。
在一种可能的示例中,所述机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合,在所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型方面,上述程序具体用于执行以下步骤的指令:
通过所述接口函数调用所述机器学习库中与所述版本信息对应的功能集合;
根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
在一种可能的示例中,所述机器学习库中包括接口函数,所述接口函数用于调用不同版本信息对应的功能集合,在所述根据所述模型信息和所述版本信息,调用机器学习库中与所述版本信息对应的功能集合,生成与所述版本信息对应的离线模型方面,上述程序具体用于执行以下步骤的指令:
通过所述环境变量调用所述机器学习库中与所述版本信息对应的功能集合;
根据所述版本信息对应的功能集合和所述模型信息,生成与所述版本信息对应的离线模型。
在一种可能的示例中,上述程序还用于执行以下步骤的指令:
当不给出运行离线模型的运行时库的版本信息时,根据离线模型的模型信息,调用机器学习库中与运行时库的最新版本信息对应的功能集合,生成与运行时库的最新版本信息对应的离线模型。
在一种可能的示例中,所述模型信息包括:模型结构信息、权值数据、输入输出数据。
在一种可能的示例中,所述功能集合包括:通用算子集合、功能算子集合。
在一种可能的示例中,上述程序还用于执行以下步骤的指令:
基于所述版本信息对应的运行时库,运行所述离线模型。
本申请实施例还提供一种计算机可读存储介质,其中,该计算机可读存储介质存储用于存储计算机程序,其中,该计算机程序被处理器执行,以实现如上述方法实施例中记载的任何一种离线模型的处理方法的部分或全部步骤。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种离线模型的处理方法的部分或全部步骤。
本申请实施例还提供一种组合处理装置,其包括上述的离线模型的处理装置,通用互联接口和其它处理装置。所述离线模型的处理装置与所述其它处理装置进行交互,共同完成用户指定的操作。图3-5为组合处理装置的示意图。
其中,其它处理装置包括CPU、GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其它处理装置所包括的处理器数量不做限制。其它处理装置作为离线模型的处理装置与外部数据和控制的接口,包括数据搬运,完成对本离线模型的处理装置的开启、停止等基本控制;其它处理装置也可以和离线模型的处理装置协作共同完成运算任务。
通用互联接口,用于在所述离线模型的处理装置与其它处理装置间传输数据和控制指令。该离线模型的处理装置从其它处理装置中获取所需的输入数据,写入离线模型的处理装置片上的存储装置;可以从其它处理装置中获取控制指令,写入离线模型的处理装置片上的控制缓存;也可以读取离线模型的处理装置的存储模块中的数据并传输给其它处理装置。
在一种可能的示例中,如图3-5所示,该组合处理装置还可以包括存储装置,存储装置分别与所述离线模型的处理装置和所述其它处理装置连接。存储装置用于保存在所述离线模型的处理装置和所述其它处理装置的数据,尤其适用于所需要运算的数据在本离线模型的处理装置或其它处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的片上***,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,无线保真(Wireless-Fidelity,Wi-Fi)接口。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可 以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种人工智能计算装置,其特征在于,所述人工智能计算装置包括控制器单元和执行单元;其中,
    所述控制器单元,用于获取待执行的第一指令集;以及,获取第二指令集;
    所述控制器单元,还用于确定所述第一指令集与所述第二指令集之间是否构成循环体;
    所述执行单元,用于在所述第一指令集与所述第二指令集之间构成循环体时,根据所述第一指令集的指令信息执行所述第二指令集中的指令。
  2. 根据权利要求1所述的装置,其特征在于,在所述根据所述第一指令集的指令信息执行所述第二指令集中的指令方面,所述执行单元具体用于:
    根据跳转指令跳转至所述第一指令集中与所述第二指令集中的第二指令对应的第一指令的操作码存储区域,从所述操作码存储区域获取所述第一指令的操作码,将所述操作码作为所述第二指令的操作码,其中,所述操作码包括所述第一指令的标识。
  3. 根据权利要求1或2所述的装置,其特征在于,所述第一指令集包含第一运算任务的第一加载指令、第一计算指令和第一存储指令;所述第二指令集包含第二运算任务的第二加载指令、第二计算指令和第二存储指令;在所述确定所述第一指令集与所述第二指令集之间是否构成循环体方面,所述控制器单元具体用于:
    获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
    将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
    若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
  4. 根据权利要求1或2所述的装置,其特征在于,所述第一指令集包含第一运算任务的第一存储指令,第二运算任务的第二计算指令和第三运算任务对应的第三加载指令;所述第二指令集包含第二运算任务的第二存储指令,第三运算任务的第三计算指令和第四运算任务的第四加载指令;在所述确定所述第一指令集与所述第二指令集之间是否构成循环体方面,所述控制器单元具体用于:
    获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
    将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指 令信息进行比对;将所述第二计算指令对应的第七预设指令信息与所述第三计算指令对应的第八预设指令信息进行比对;将所述第三加载指令对应的第九预设指令信息与所述第四加载指令对应的第十预设指令信息进行比对;
    若所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,所述第七预设指令信息与第八预设指令信息之间仅存在操作次数的差异,且所述第九预设指令信息与所述第十预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
  5. 根据权利要求4所述的装置,其特征在于,所述控制器单元还用于:
    确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系;
    所述执行单元,还用于在所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系时,在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令。
  6. 根据权利要求5所述的装置,其特征在于,在所述确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系方面,所述控制器单元具体用于:
    提取所述第一存储指令中所需数据的第一存储地址区间,提取所述第二计算指令中所需数据的第二存储地址区间,提取所述第三加载指令中所需数据的第三存储地址区间,若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
  7. 根据权利要求5所述的装置,其特征在于,在所述确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系方面,所述控制器单元具体用于:
    提取所述第一存储指令对应的第一写入区域,提取所述第二计算指令对应的第二读取区域和第二写入区域,提取所述第三加载指令对应的第三读取区域;
    若所述第一写入区域、所述第二读取区域、所述第二写入区域和所述第三读取区域之间均不存在重叠区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
  8. 根据权利要求5-7任一项所述的装置,其特征在于,所述人工智能计算装置还包括存储单元,所述存储单元与外部存储装置连接;所述执行单元包括加载执行单元、计算执行单元和存储执行单元;
    在所述在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令方面,所述存储执行单元用于根据所述第一存储指令将所述第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置,所述计算执行单元用于根据所述第二计算指令对所述第二运算任务中第二输入数据进行计算,得到第二计算结果;所述加载执行单元用于根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
  9. 根据权利要求8所述的装置,其特征在于,所述存储单元包括第一存储区域和第二 存储区域,在所述根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元方面,所述加载执行单元具体用于:
    在所述第一时间片内根据所述第三加载指令将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域。
  10. 根据权利要求9所述的装置,其特征在于,所述第三输入数据包括多个第三输入子数据,在将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域方面,所述加载执行单元具体用于:
    预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
    按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至所述第一存储区域,并从所述第一存储区域的两端存储至中间。
  11. 一种人工智能计算方法,其特征在于,应用于人工智能计算装置,所述方法包括:
    获取待执行的第一指令集;以及,获取第二指令集;
    确定所述第一指令集与所述第二指令集之间是否构成循环体;
    在所述第一指令集与所述第二指令集之间构成循环体时,根据所述第一指令集的指令信息执行所述第二指令集中的指令。
  12. 根据权利要求11所述的方法,其特征在于,所述根据所述第一指令集的指令信息执行所述第二指令集中的指令,包括:
    根据跳转指令跳转至所述第一指令集对应的指令的操作码存储区域,从所述操作码存储区域获取所述第一加载指令的操作码,将所述操作码作为所述第二加载指令的操作码,并获取所述第二加载指令对应的操作域,其中,所述操作码包括所述第一计算指令的标识;所述操作域包括所述第四输入数据的存储地址。
  13. 根据权利要求11或12所述的方法,其特征在于,所述第一指令集包含第一运算任务的第一加载指令、第一计算指令和第一存储指令;所述第二指令集包含第二运算任务的第二加载指令、第二计算指令和第二存储指令;所述确定所述第一指令集与所述第二指令集之间是否构成循环体,包括:
    获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
    将所述第一加载指令对应的第一预设指令信息与所述第二加载指令对应的第二预设指令信息进行比对;将所述第一计算指令对应的第三预设指令信息与所述第二计算指令对应的第四预设指令信息进行比对;将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;
    若所述第一预设指令信息与所述第二预设指令信息之间仅存在操作次数的差异,所述第三预设指令信息与第四预设指令信息之间仅存在操作次数的差异,且所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第 二指令集之间构成循环体。
  14. 根据权利要求11或12所述的方法,其特征在于,所述第一指令集包含第一运算任务的第一存储指令,第二运算任务的第二计算指令和第三运算任务的第三加载指令;所述第二指令集包含第二运算任务的第二存储指令,第三运算任务的第三计算指令和第四运算任务的第四加载指令;所述确定所述第一指令集与所述第二指令集之间是否构成循环体,包括:
    获取所述第一指令集和所述第二指令集中每一指令对应的预设指令信息,得到多个预设指令信息,所述预设指令信息包括以下至少一种:指令类型、剩余执行次数、是否奇偶性翻转;
    将所述第一存储指令对应的第五预设指令信息与所述第二存储指令对应的第六预设指令信息进行比对;将所述第二计算指令对应的第七预设指令信息与所述第三计算指令对应的第八预设指令信息进行比对;将所述第三加载指令对应的第九预设指令信息与所述第四加载指令对应的第十预设指令信息进行比对;
    若所述第五预设指令信息与所述第六预设指令信息之间仅存在操作次数的差异,所述第七预设指令信息与第八预设指令信息之间仅存在操作次数的差异,且所述第九预设指令信息与所述第十预设指令信息之间仅存在操作次数的差异,确定所述第一指令集与所述第二指令集之间构成循环体。
  15. 根据权利要求14所述的方法,其特征在于,所述方法还包括:
    确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系;
    所述执行单元,还用于在所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系时,在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令。
  16. 根据权利要求15所述的方法,其特征在于,所述确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系,包括:
    提取所述第一存储指令中所需数据的第一存储地址区间,提取所述第二计算指令中所需数据的第二存储地址区间,提取所述第三加载指令中所需数据的第三存储地址区间,若所述第一存储地址区间、所述第二存储地址区间和所述第三存储地址区间两两之间不具有重叠的区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
  17. 根据权利要求15所述的方法,其特征在于,所述确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间是否存在关联关系,包括:
    提取所述第一存储指令对应的第一写入区域,提取所述第二计算指令对应的第二读取区域和第二写入区域,提取所述第三加载指令对应的第三读取区域;
    若所述第一写入区域、所述第二读取区域、所述第二写入区域和所述第三读取区域之间均不存在重叠区域,确定所述第一存储指令、所述第二计算指令和所述第三加载指令之间不存在关联关系。
  18. 根据权利要求15-17任一项所述的方法,其特征在于,所述人工智能计算装置还包括存储单元,所述存储单元与外部存储装置连接;所述在第一时间片内并行执行所述第一存储指令、所述第二计算指令和所述第三加载指令,包括:
    根据所述第一存储指令将所述第一运算任务中第一输入数据对应的第一计算结果从所述存储单元传输至所述外部存储装置;
    根据所述第二计算指令对所述第二运算任务中第二输入数据进行计算,得到第二计算结果;
    根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元。
  19. 根据权利要求18所述的方法,其特征在于,所述存储单元包括第一存储区域和第二存储区域,所述根据所述第三加载指令将所述第三运算任务中的第三输入数据从所述外部存储装置传输至所述存储单元,包括:
    在所述第一时间片内根据所述第三加载指令将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域。
  20. 根据权利要求19所述的方法,其特征在于,所述第三输入数据包括多个第三输入子数据,所述将所述第三运算任务中的第三输入数据进行乒乓操作,从所述外部存储装置传输至所述第一存储区域,包括:
    预估所述多个第三输入子数据中每一第三输入子数据在所述第一存储区域的目标存储时长,得到多个目标存储时长;
    按照存储时长从大到小的顺序将所述多个目标存储时长对应的所述多个第三输入子数据传输至所述第一存储区域,并从所述第一存储区域的两端存储至中间。
PCT/CN2020/080447 2019-03-22 2020-03-20 人工智能计算装置及相关产品 WO2020192587A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/440,529 US11983535B2 (en) 2019-03-22 2020-03-20 Artificial intelligence computing device and related product

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201910226552.7 2019-03-22
CN201910226552.7A CN111723920B (zh) 2019-03-22 2019-03-22 人工智能计算装置及相关产品
CN201910226678.4A CN111723921B (zh) 2019-03-22 2019-03-22 人工智能计算装置及相关产品
CN201910226678.4 2019-03-22
CN201910316537.1 2019-04-18
CN201910316537.1A CN110070176A (zh) 2019-04-18 2019-04-18 离线模型的处理方法、离线模型的处理装置及相关产品

Publications (1)

Publication Number Publication Date
WO2020192587A1 true WO2020192587A1 (zh) 2020-10-01

Family

ID=72610907

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/080447 WO2020192587A1 (zh) 2019-03-22 2020-03-20 人工智能计算装置及相关产品

Country Status (2)

Country Link
US (1) US11983535B2 (zh)
WO (1) WO2020192587A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6282633B1 (en) * 1998-11-13 2001-08-28 Tensilica, Inc. High data density RISC processor
US20120079303A1 (en) * 2010-09-24 2012-03-29 Madduri Venkateswara R Method and apparatus for reducing power consumption in a processor by powering down an instruction fetch unit
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1103467C (zh) 1994-10-13 2003-03-19 北京南思达科技发展有限公司 宏指令集对称式并行体系结构微处理器
EP1031076A1 (en) 1997-10-13 2000-08-30 Institute for the Development of Emerging Architectures, L.L.C. Method and apparatus for optimizing execution of load and store instructions
US7089075B2 (en) 2001-05-04 2006-08-08 Tokyo Electron Limited Systems and methods for metrology recipe and model generation
US8505002B2 (en) * 2006-09-29 2013-08-06 Arm Limited Translation of SIMD instructions in a data processing system
US20080162399A1 (en) * 2006-12-31 2008-07-03 Think Passenger, Inc. Consumer marketing platform
US8479185B2 (en) * 2010-12-09 2013-07-02 Oracle International Corporation Method and system for utilizing parallelism across loops
CN103957463A (zh) 2014-05-28 2014-07-30 谭兆红 一种幼儿教育高清动漫播放***
CN104866341B (zh) 2015-05-07 2018-10-09 北京金山安全软件有限公司 一种组件升级方法、装置及终端
US9443192B1 (en) * 2015-08-30 2016-09-13 Jasmin Cosic Universal artificial intelligence engine for autonomous computing devices and software applications
CN108734288B (zh) 2017-04-21 2021-01-29 上海寒武纪信息科技有限公司 一种运算方法及装置
CN108764487B (zh) 2018-05-29 2022-07-08 北京百度网讯科技有限公司 用于生成模型的方法和装置、用于识别信息的方法和装置
CN108897587B (zh) 2018-06-22 2021-11-12 北京优特捷信息技术有限公司 可插拔式机器学习算法运行方法、装置及可读存储介质
CN109255234B (zh) 2018-08-15 2023-03-24 腾讯科技(深圳)有限公司 机器学习模型的处理方法、装置、介质及电子设备
CN109634843B (zh) 2018-10-31 2021-09-21 中国科学院软件研究所 一种面向ai芯片平台的分布式自动化软件测试方法及平台

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6282633B1 (en) * 1998-11-13 2001-08-28 Tensilica, Inc. High data density RISC processor
US20120079303A1 (en) * 2010-09-24 2012-03-29 Madduri Venkateswara R Method and apparatus for reducing power consumption in a processor by powering down an instruction fetch unit
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品

Also Published As

Publication number Publication date
US20220156077A1 (en) 2022-05-19
US11983535B2 (en) 2024-05-14

Similar Documents

Publication Publication Date Title
CN110096309B (zh) 运算方法、装置、计算机设备和存储介质
CN111767995B (zh) 运算方法、装置及相关产品
CN109711540B (zh) 一种计算装置及板卡
CN109726800B (zh) 运算方法、装置及相关产品
CN110458285B (zh) 数据处理方法、装置、计算机设备和存储介质
CN111079909B (zh) 运算方法、***及相关产品
WO2020192587A1 (zh) 人工智能计算装置及相关产品
CN111723920B (zh) 人工智能计算装置及相关产品
CN111949317B (zh) 指令处理方法、装置及相关产品
CN111832714B (zh) 运算方法及装置
CN111723921B (zh) 人工智能计算装置及相关产品
CN111966399A (zh) 指令处理方法、装置及相关产品
CN111338694B (zh) 运算方法、装置、计算机设备和存储介质
CN111949318A (zh) 指令处理方法、装置及相关产品
CN111061507A (zh) 运算方法、装置、计算机设备和存储介质
CN111026440B (zh) 运算方法、装置、计算机设备和存储介质
CN111290789B (zh) 运算方法、装置、计算机设备和存储介质
CN111079915B (zh) 运算方法、装置及相关产品
CN111079914B (zh) 运算方法、***及相关产品
CN111078285B (zh) 运算方法、***及相关产品
CN111078125B (zh) 运算方法、装置及相关产品
CN111339060B (zh) 运算方法、装置、计算机设备和存储介质
CN111079913B (zh) 运算方法、装置及相关产品
CN111079907B (zh) 运算方法、装置及相关产品
CN111078280B (zh) 运算方法、装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20779830

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20779830

Country of ref document: EP

Kind code of ref document: A1