WO2019001418A1 - 数据共享***及其数据共享方法 - Google Patents

数据共享***及其数据共享方法 Download PDF

Info

Publication number
WO2019001418A1
WO2019001418A1 PCT/CN2018/092829 CN2018092829W WO2019001418A1 WO 2019001418 A1 WO2019001418 A1 WO 2019001418A1 CN 2018092829 W CN2018092829 W CN 2018092829W WO 2019001418 A1 WO2019001418 A1 WO 2019001418A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
neural network
task
unit
module
Prior art date
Application number
PCT/CN2018/092829
Other languages
English (en)
French (fr)
Inventor
陈天石
杜子东
刘少礼
王在
胡帅
周徐达
周聖元
郝一帆
高钰峰
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201710497394.XA external-priority patent/CN109117415B/zh
Priority claimed from CN201710515517.8A external-priority patent/CN109214616B/zh
Priority claimed from CN201710721049.XA external-priority patent/CN109426553A/zh
Priority claimed from CN201810407185.6A external-priority patent/CN110413551B/zh
Priority claimed from CN201810467383.1A external-priority patent/CN110502330A/zh
Priority claimed from CN201810641721.9A external-priority patent/CN110619390A/zh
Priority to EP18824582.3A priority Critical patent/EP3637272A4/en
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2019001418A1 publication Critical patent/WO2019001418A1/zh
Priority to US16/694,056 priority patent/US11687467B2/en
Priority to US16/693,999 priority patent/US11656910B2/en
Priority to US16/694,176 priority patent/US11726844B2/en
Priority to US16/693,918 priority patent/US10901815B2/en
Priority to US16/693,956 priority patent/US11537843B2/en
Priority to US16/694,124 priority patent/US20200089535A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0875Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/223Execution means for microinstructions irrespective of the microinstruction function, e.g. decoding of microinstructions and nanoinstructions; timing of microinstructions; programmable logic arrays; delays and fan-out problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/452Instruction code
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/54Store-and-forward switching systems 
    • H04L12/56Packet switching systems
    • H04L12/5601Transfer mode dependent, e.g. ATM
    • H04L2012/5686Use of neural networks

Definitions

  • the present disclosure relates to a sharing system, and more particularly to a data sharing system and a data sharing method thereof.
  • machine learning which can be used for deep learning or other
  • ASIC module application specific integrated circuit
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SRAM like cache SRAM (Cache)
  • AXI advanced Extensible Interface
  • the main object of the present disclosure is to provide a data sharing system and a data sharing method thereof for solving at least one of the above technical problems.
  • the present disclosure proposes a data sharing system including a storage module and at least two processing modules, wherein:
  • At least two processing modules share a storage module
  • At least two processing modules communicate through preset rules to achieve data sharing.
  • the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
  • the foregoing communicating by the preset rule includes: the at least two processing modules include a first processing module and a second processing module, and the first processing module sends the request signal and the corresponding data to the second processing module.
  • the address, the second processing module returns a valid signal and data to the first processing module according to the request signal and the corresponding data address to implement data sharing.
  • the at least two processing modules described above comprise a physical processor.
  • the physical processor described above includes a neural network processor.
  • the neural network processor described above includes means for performing an artificial neural network forward operation.
  • the apparatus for performing an artificial neural network forward operation includes an instruction cache unit and a direct memory access unit, wherein:
  • the instruction cache unit is configured to read in an instruction through a direct memory access unit and cache the read instruction.
  • the foregoing apparatus for performing an artificial neural network forward operation further includes:
  • controller unit for reading an instruction from the instruction cache unit and decoding the instruction into a microinstruction.
  • the apparatus for performing an artificial neural network forward operation further includes an H-tree module, and the H-tree module may include a branch processing module, where
  • the main operation module is connected to the branch processing module, and the branch processing module is connected to the plurality of slave processing modules;
  • the branch processing module is configured to perform forwarding of data or instructions between the main operation module and the slave processing module.
  • the direct memory access unit is further configured to write data from an external address space to a corresponding data cache unit of the main operation module and each slave operation module, or from the data cache unit to an external address. Spatial read data.
  • the at least two processing modules include two processors of mutually different structures; one of the two differently structured processors is a neural network processor.
  • the at least two processing modules include at least two processor cores of the processor; the at least two processor cores are processor cores of the same/different structure.
  • the at least two processing modules include at least two arithmetic units of the processor core; the at least two arithmetic units are arithmetic units of the same/different structure.
  • the sharing system further includes:
  • At least two storage units are respectively connected to at least one of the at least two operation units, and any one of the at least two operation units is connected to the one or more storage units; and at least two storage units share the storage module.
  • the at least two operation units share the same storage unit, or share one storage unit, or partially share the same storage unit, and partially share one storage unit.
  • the at least two processing modules include three arithmetic units of the processor core, and at least two storage units are two, wherein two of the operating units are simultaneously connected to one of the storage units, wherein Another arithmetic unit is connected to another storage unit.
  • the present disclosure proposes a data sharing method including the following steps:
  • At least two processing modules communicate through preset rules to implement data sharing
  • the two processing modules share a storage module.
  • the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
  • the foregoing communicating by the preset rule includes: the at least two processing modules include a first processing module and a second processing module, and the first processing module sends the request signal and the corresponding data to the second processing module.
  • the address, the second processing module returns a valid signal and data to the first processing module according to the request signal and the corresponding data address to implement data sharing.
  • the at least two processing modules described above comprise a physical processor.
  • the physical processor described above includes a neural network processor.
  • the neural network processor described above includes means for performing an artificial neural network forward operation.
  • the apparatus for performing an artificial neural network forward operation includes an instruction cache unit and a direct memory access unit, wherein:
  • the instruction cache unit reads the instruction through the direct memory access unit and caches the read instruction.
  • the apparatus for performing an artificial neural network forward operation further includes a controller unit that reads an instruction from the instruction cache unit and decodes the instruction generation microinstruction.
  • the apparatus for performing an artificial neural network forward operation further includes an H number module, a main operation module, and a plurality of slave operation modules, wherein:
  • the H-tree module at the stage where each layer of the neural network reverse training starts to calculate, the main operation module transmits the input neuron vector of the layer to all the slave arithmetic modules through the H-tree module, and after the calculation process of the slave computing module is completed, The H-tree module progressively combines the output neuron values of the respective slave computing modules into intermediate result vectors;
  • the main operation module uses the intermediate result vector to complete the subsequent calculation.
  • the direct memory access unit further writes data from the external address space to the corresponding data buffer unit of the main operation module and the respective slave operation modules, or reads data from the data buffer unit to the external address space.
  • the at least two processing modules include two processors of mutually different structures; one of the two differently structured processors is a neural network processor.
  • the at least two processing modules include at least two processor cores of the processor; the at least two processor cores are processor cores of the same/different structure.
  • the at least two processing modules include at least two arithmetic units of the processor core; the at least two arithmetic units are arithmetic units of the same/different structure.
  • the data sharing method further adopts:
  • At least two storage units are respectively connected to at least one of the at least two operation units, and any one of the at least two operation units is connected to the one or more storage units; and at least two storage units share the storage module.
  • the at least two operation units share the same storage unit, or share one storage unit, or partially share the same storage unit, and partially share one storage unit.
  • the at least two processing modules include three arithmetic units of the processor core, and at least two storage units are two, wherein two of the operating units are simultaneously connected to one of the storage units, wherein Another arithmetic unit is connected to another storage unit.
  • An aspect of the present disclosure provides an information processing apparatus including a storage module and a data processing module, wherein a storage module is configured to receive and store input data, instructions, and output data, wherein the input data includes one or more key features a data processing module for judging key features included in the input data, and scoring the input data in the storage module according to the judgment result.
  • the input data is original input data, or data preprocessed on the original input data.
  • the data processing module determines the key features included in the input data, including: the data processing module calculates a confidence level of the key features included in the input data, and the confidence is the determination result.
  • the storage module stores data and instructions, the data includes input data, input neurons, weights, output neurons, and output data; and the input data is transmitted to each input neuron in the artificial neural network. Thereby participating in the subsequent operation; outputting the value of the neuron, that is, the judgment result and the score, as the output data.
  • the data processing module includes an operation module, configured to perform a corresponding calculation on the data stored in the storage module according to the instruction stored in the storage module, and output the operation result to the storage module.
  • the operation module is configured to perform corresponding calculation on the data stored in the storage module according to the instruction stored in the storage module.
  • the operation of the operation module includes:
  • the first part is a multiplier
  • the second part is one or more adders
  • the third part is the activation function unit
  • the fourth part is the vector processing unit.
  • the plurality of adders constitute an addition tree.
  • the activation function is sigmoid, tanh, relu, softmax.
  • the fourth part is a vector processing unit, and the vector processing unit performs a pooling operation.
  • the data processing module further includes an instruction cache and a neural network data cache; an instruction cache for buffering instructions; and a neural network data cache for buffering weight data, input neurons, and output in the storage module. Neurons.
  • the neural network data cache includes a weight buffer, an input neuron cache, and an output neuron cache; a weight buffer for buffering weight data; an input neuron buffer for buffering input neurons; and an output nerve
  • the meta-cache is used to buffer and output the operation result output by the operation module, that is, the judgment result and/or the score.
  • the data processing module further includes a direct memory access, which functions as a bridge between the storage module and each cache for reading data and/or instructions stored in the storage module.
  • a direct memory access which functions as a bridge between the storage module and each cache for reading data and/or instructions stored in the storage module.
  • Write store the read and write instructions to the instruction cache, store the read weights into the weight buffer, store the read input neurons, ie the input data, into the input neuron buffer and receive them from the output neurons.
  • the cached output neurons store the judgment results and/or scores to the storage module.
  • the data processing module further includes a control unit, configured to read an instruction from the instruction cache, decode the instruction into an instruction executable by the operation module, and output the instruction to the operation module.
  • the data processing module further includes a scoring unit, configured to: when the artificial neural network running in the information processing device obtains the judgment result, and then obtains the score, the scoring unit does not participate in the data processing; when the information processing When the artificial neural network running in the device only obtains the judgment result without obtaining the score, the scoring unit is used to obtain the score according to the judgment result.
  • a scoring unit configured to: when the artificial neural network running in the information processing device obtains the judgment result, and then obtains the score, the scoring unit does not participate in the data processing; when the information processing When the artificial neural network running in the device only obtains the judgment result without obtaining the score, the scoring unit is used to obtain the score according to the judgment result.
  • the judgment result that is, the value of the output neuron of the final output layer of the artificial neural network running in the information processing device, and the value of the output neuron is the confidence of the occurrence of the key feature, and the confidence is within a certain range.
  • the natural number the score is added to the final output layer of the artificial neural network running in the information processing device as a new final output layer, and the input neuron value of the new final output layer appears as each key feature Confidence; there is only one output neuron in this layer, the value is the score; the weight in the new final output layer operation corresponds to the importance of each key feature; or the layer has N+1 output neurons, the score
  • the value range is [0, N].
  • the score is: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of the occurrence of each key feature, and uses it as the input of the scoring unit, and the scoring unit obtains the score accordingly.
  • the information processing device is an artificial neural network chip.
  • Another aspect of the present disclosure provides an information processing method, including the information processing apparatus, including:
  • the storage module receives and stores input data, instructions, and output data, wherein the input data includes one or more key features
  • the data processing module judges the key features included in the input data, and scores the input data in the storage module according to the judgment result.
  • the input data uses original input data or data preprocessed with original input data.
  • the data processing module determines the key features included in the input data, including: the data processing module calculates a confidence level of the key features included in the input data, and the confidence is the determination result.
  • the storage module stores data and instructions, the data includes input data, input neurons, weights, output neurons, and output data; the input data is transmitted to each input neuron in the artificial neural network, thereby Participate in subsequent operations; output the value of the neuron, that is, the judgment result and the score, as the output data.
  • the data processing module includes an operation module, and the operation module performs a corresponding calculation on the data stored in the storage module according to the instruction stored in the storage module, and outputs the operation result to the storage module.
  • the operation module performs a corresponding calculation on the data stored in the storage module according to the instruction stored in the storage module.
  • the operation of the operation module includes:
  • the first part is a multiplier
  • the second part is one or more adders
  • the third part is the activation function unit
  • the fourth part is the vector processing unit.
  • the plurality of adders constitute an addition tree.
  • the activation function adopts sigmoid, tanh, relu, and softmax.
  • the fourth part is a vector processing unit, and the vector processing unit performs a pooling operation.
  • the data processing module further includes an instruction cache and a neural network data cache; using an instruction cache cache instruction; and using a neural network data cache to buffer weight data, input neurons, and output neurons in the storage module.
  • the neural network data cache includes a weight buffer, an input neuron cache, and an output neuron cache; using a weight buffer to cache weight data; using an input neuron cache to cache input neurons; using an output neuron cache cache And outputting the operation result output by the operation module, that is, the judgment result and/or the score.
  • the data processing module further includes a direct memory access, which functions as a bridge between the storage module and each cache, and reads and writes data and/or instructions stored in the storage module. And storing the read and write instructions into the instruction cache, storing the read weights into the weight buffer, storing the read input neurons, ie, the input data, into the input neuron cache, and receiving the output from the output neuron cache
  • the output neuron is to store the judgment result and/or score to the storage module.
  • the data processing module further includes a control unit that reads an instruction in the instruction cache, decodes the instruction into an instruction that can be executed by the operation module, and outputs the instruction to the operation module.
  • the data processing module further includes a scoring unit.
  • the scoring unit does not participate in the data processing; when the artificial neural network runs in the information processing device When the network only obtains the judgment result without obtaining the score, the scoring unit obtains the score according to the judgment result.
  • the judgment result that is, the value of the output neuron of the final output layer of the artificial neural network running in the information processing device, and the value of the output neuron is the confidence of the occurrence of the key feature, and the confidence is within a certain range.
  • the natural number the score is added to the final output layer of the artificial neural network running in the information processing device as a new final output layer, and the input neuron value of the new final output layer appears as each key feature Confidence; there is only one output neuron in this layer, the value is the score; the weight in the new final output layer operation corresponds to the importance of each key feature; or the layer has N+1 output neurons, the score
  • the value range is [0, N].
  • the score is: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of the occurrence of each key feature, and uses it as the input of the scoring unit, and the scoring unit obtains the score accordingly.
  • the information processing device used in the information processing method is an artificial neural network chip.
  • Still another aspect of the present disclosure also provides an information processing system including an information acquiring apparatus, the information processing apparatus, an interaction interface, and a control apparatus, wherein:
  • An information acquiring device configured to acquire external data and transmit the data to the information processing device
  • An information processing device configured to perform arithmetic processing on external data received from the information acquiring device, and output the operation processing result to the interaction interface;
  • An interactive interface for displaying an operation result received from the information processing device, and transmitting an operation or command received from the outside to the control device;
  • control device configured to control the operation of the information acquiring device, the information processing device, and the interaction interface according to the operation or command received from the interactive interface.
  • the information acquiring device includes a character recognition device, an image recognition device, and a voice recognition device;
  • the character recognition device is configured to acquire text information in external data
  • the image recognition device is configured to acquire image or video information in external data
  • the voice recognition device is configured to acquire audio information in external data.
  • the interactive interface is a display screen of a mobile phone, a computer, a notebook or a tablet.
  • a further aspect of the present disclosure also provides an information processing method, using the information processing system, including:
  • the information acquiring device acquires external data, and directly or preprocesses the external data to the information processing device;
  • the information processing device performs arithmetic processing on the external data received from the information acquiring device or the preprocessed external data, and outputs the operation processing result to the interactive interface;
  • the interactive interface displays the result of the operation received from the information processing device.
  • the information acquiring device includes a character recognition device, an image recognition device, and a voice recognition device, and the information acquisition device acquires external data, including:
  • the information acquiring device uses the character recognition device to acquire the text information in the external data
  • the information acquiring device uses the image recognition device to acquire picture or video information in the external data
  • the information acquisition device uses the voice recognition device to acquire audio information in the external data.
  • a task segmentation apparatus including: a granular task segmentation unit for segmenting a task to form a subtask with at least one granularity; and a task segmentation granularity selection unit for Choose the granularity to adopt.
  • the task segmentation device is used in a neural network
  • the granular task segmentation unit includes at least one of the following units, a first granularity task segmentation unit for using the task as a subtask as a whole;
  • the segmentation unit is configured to divide a part of the sample in the selected task as a subtask to segment the task;
  • the third granularity task is to divide the task according to the layer type of the neural network, and the calculation of the same type layer as a child Task;
  • a fourth granularity task segmentation unit for performing task segmentation according to an inter-layer structure of a neural network, computing of a plurality of adjacent layers as a subtask;
  • a fifth granularity task segmentation unit for layering according to a neural network
  • the internal structure performs task segmentation, and the calculations in the neural network layer are divided into subtasks.
  • the task segmentation granularity selection unit selects the first to fifth granularity task segmentation units based on at least one of a number of samples that the neural network needs to process, a topology of the neural network, and a calculation amount of each layer. At least one of them is used to perform task segmentation.
  • performing task segmentation according to an intra-layer structure of a neural network includes performing task segmentation on a convolutional layer calculation of a neural network, a fully connected layer calculation, a pooled layer calculation, or an activation layer calculation.
  • the segmenting the convolutional layer calculation of the neural network comprises: when the convolutional layer input neurons of the neural network are three-dimensional matrices (Nfin, Nxin, Nyin), the weights are four-dimensional matrices ( Nfout, Nfout, Kx, Ky), when the output neurons are three-dimensional matrices (Nfout, Nxout, Nyout), where Nfin is the number of input feature images, (Nxin, Nyin) is the input feature image size, and Nfout is the number of output feature images.
  • (Kx, Ky) is the convolution kernel size
  • (Nxout, Nyout) is the output feature image size
  • Nfin, Nxin, Nyin, Kx, Ky, Nfout, Nxout, Nyout are positive integers
  • the output neurons are in accordance with (Bfout, Bxout, Byout) is divided into block sizes, and the weights are divided according to the block size of (Bfout, Bfin, Bx, By), where Bfout, Bxout, Byout, Bfout, Bfin, Bx, and By are positive.
  • a task processing apparatus including: a task splitting apparatus; and a task scheduling apparatus, the task scheduling apparatus comprising: a task queue unit for buffering an unscheduled task, a monitoring unit, The working state of each core of the multi-core processor is monitored in real time; the task scheduling unit is configured to select a task to be scheduled from the unscheduled tasks, and allocate a scheduling task to be scheduled to the target core according to the working state of each core.
  • the task scheduling unit allocates a scheduled task to be scheduled to the target core by using at least one of the following methods: counting the number of tasks in each core private task queue, and selecting the core with the least task in the private task queue as the core Target core; count the time for each core to complete all tasks in the private task queue, select the core with the shortest task time as the target core; calculate the distribution of resources required for the scheduled task in all cores, and select the core with the largest number of resources as the core The target core; and a heuristic algorithm to assign the task to be scheduled to the target core.
  • the heuristic algorithm includes at least one of a genetic algorithm, an ant colony algorithm, and a simulated annealing algorithm.
  • the task scheduling unit performs task scheduling every time T, and the task to be scheduled is selected by using at least one of the following methods: randomly selecting unscheduled tasks;
  • the core operating states include at least one of a utilization rate, a workload, an operating frequency, a number of tasks in a private task queue in the core, and a task completion time in the core.
  • a multi-core processor comprising: J processing cores, J being a positive integer; and task processing means.
  • the topology between the processing cores is at least one of a one-dimensional linear, two-dimensional mesh, a two-dimensional star, and a three-dimensional cube.
  • the processing core comprises a neural network processing core, the neural network processing core comprising: a storage unit for storing neurons, weights, and instructions of the neural network; and a selecting unit for receiving the input nerve The meta- and non-zero weight position information, the neuron corresponding to the non-zero weight is selected; the operation unit is configured to receive the neuron corresponding to the input non-zero weight and the corresponding non-zero weight, and complete the neural network training operation; And a control unit, configured to receive an instruction of the neural network, and generate control information after decoding to control the selection unit and the operation unit.
  • the neural network processing core comprising: a storage unit for storing neurons, weights, and instructions of the neural network; and a selecting unit for receiving the input nerve The meta- and non-zero weight position information, the neuron corresponding to the non-zero weight is selected; the operation unit is configured to receive the neuron corresponding to the input non-zero weight and the corresponding non-zero weight, and complete the neural network training operation; And a control
  • the instructions include at least one of a control instruction, a data transfer instruction, an operation instruction, and a logic instruction.
  • the operation instruction is used to complete an arithmetic operation of a neural network, including a matrix operation instruction, a vector operation instruction, a scalar operation instruction, a convolutional neural network operation instruction, a fully connected neural network operation instruction, and a pooled neural network.
  • Operation instruction RBM neural network operation instruction, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction At least one of a TANH neural network operation instruction and a MAXOUT neural network operation instruction.
  • a task segmentation method for a neural network, and at least one of the following task segmentation modes is selected for task segmentation: the task as a whole is a subtask; The sample calculation is used as a sub-task to segment the task; the task is divided according to the layer type of the neural network, and the calculation of the same type layer is used as a sub-task; the task is divided according to the inter-layer structure of the neural network, and the calculation of several adjacent layers is performed.
  • a sub-task; task segmentation according to the intra-layer structure of the neural network, and the calculation in the neural network layer is divided into sub-tasks.
  • At least one of the task segmentation devices is selected for task segmentation based on at least one of a number of samples that the neural network needs to process, a topology of the neural network, and a computational amount of each layer.
  • performing task segmentation according to an intra-layer structure of a neural network includes performing task segmentation on a convolutional layer calculation of a neural network, a fully connected layer calculation, a pooled layer calculation, or an activation layer calculation.
  • the segmenting the convolutional layer calculation of the neural network comprises: when the convolutional layer input neurons of the neural network are three-dimensional matrices (Nfin, Nxin, Nyin), the weights are four-dimensional matrices ( Nfout, Nfout, Kx, Ky), when the output neurons are three-dimensional matrices (Nfout, Nxout, Nyout), where Nfin is the number of input feature images, (Nxin, Nyin) is the input feature image size, and Nfout is the number of output feature images.
  • (Kx, Ky) is the convolution kernel size
  • (Nxout, Nyout) is the output feature image size
  • Nfin, Nxin, Nyin, Kx, Ky, Nfout, Nxout, Nyout are positive integers
  • the output neurons are in accordance with (Bfout, Bxout, Byout) is divided into block sizes, and the weights are divided according to the block size of (Bfout, Bfin, Bx, By), where Bfout, Bxout, Byout, Bfout, Bfin, Bx, and By are positive.
  • a task processing method includes: a service segmentation method; and a task scheduling method, the task scheduling method including: caching an unscheduled task, the task including any one of the claims The sub-task divided by the task-splitting device; real-time monitoring the working status of each core of the multi-core processor; and selecting the to-be-scheduled task from the unscheduled task and allocating the scheduled to-be-scheduled task to the target core according to the working status of each core.
  • the allocating the scheduled task to the target core is performed by at least one of the following methods: counting the number of tasks in each core private task queue, and selecting the core with the least task in the private task queue as the target Core; counts the time for each core to complete all tasks in the private task queue, selects the core with the shortest task time as the target core; counts the distribution of resources required for the scheduled task in all cores, and selects the core with the largest number of resources as the target The core; and a heuristic algorithm to assign the task to be scheduled to the target core.
  • the heuristic algorithm includes at least one of a genetic algorithm, an ant colony algorithm, and a simulated annealing algorithm.
  • task scheduling is performed every time T, and the task to be scheduled is selected by at least one of the following methods: randomly selecting unscheduled tasks; selecting unscheduled tasks with the longest expected execution time; The unscheduled task with the shortest execution time; select the unscheduled task that occupies the most resources; select the unscheduled task that occupies the least resources.
  • the core operating states include at least one of a utilization rate, a workload, an operating frequency, a number of tasks in a private task queue in the core, and a task completion time in the core.
  • a processor including:
  • a task segmentation device for performing task segmentation according to task segmentation granularity
  • the hardware resource dividing device is configured to divide the hardware resources of the processor according to the task segmentation result.
  • the processor further includes a plurality of computing units, the hardware resource dividing means for dividing the plurality of computing units of the processor according to the task segmentation result, ie, the plurality of computing units According to the task segmentation result, the calculation is divided into multiple calculation groups to calculate different forward and reverse paths in the batch, or to request different services.
  • the processor dynamically adjusts the grouping of the plurality of computing units according to the task segmentation result during operation.
  • the task segmentation device comprises:
  • a task segmentation granularity selection unit for selecting a granularity to be employed
  • a granular task segmentation unit is configured to segment a task into at least one granularity to form a subtask.
  • the granular task segmentation unit comprises at least one of the following elements:
  • a first granularity task segmentation unit for using the task as a subtask as a whole;
  • a second granularity task segmentation unit configured to divide a part of the sample in the selected task as a subtask to segment the task
  • the third granularity task segmentation unit is configured to perform task segmentation according to a layer type of the neural network, and calculate the same type layer as a subtask;
  • the fourth granularity task segmentation unit is configured to perform task segmentation according to the inter-layer structure of the neural network, and calculate the calculation of several adjacent layers as a sub-task;
  • the fifth granularity task segmentation unit is configured to perform task segmentation according to the intra-layer structure of the neural network, and divide the calculation in the neural network layer into sub-tasks.
  • the task segmentation granularity selection unit selects the first to fifth granularity task segmentation units based on at least one of a number of samples that the neural network needs to process, a topology of the neural network, and a calculation amount of each layer. At least one of them is used to perform task segmentation.
  • the processor further includes: a task scheduling device; wherein the processor is a multi-core processor; the task scheduling device includes:
  • a task queue unit for caching unscheduled tasks
  • a monitoring unit for monitoring the working status of each core in real time
  • the task scheduling unit is configured to select a task to be scheduled from the unscheduled tasks, and allocate a scheduling task to be scheduled to the target core according to the working state of each core.
  • the task scheduling unit allocates a scheduled task to be scheduled to the target core in at least one of the following manners:
  • a heuristic algorithm is used to assign the task to be scheduled to the target core.
  • a combined processing apparatus includes the processor, the universal interconnect interface and other processing devices interacting to perform user-specified computing operations.
  • a neural network chip comprising the processor or the combined processing device.
  • an electronic device wherein the electronic device includes the chip.
  • a processing method comprising:
  • the task segmentation device performs task segmentation according to the task segmentation granularity
  • the hardware resource dividing device divides the hardware resources of the processor according to the task segmentation result.
  • the hardware resource partitioning device in the step of dividing, by the hardware resource partitioning device, the hardware resources of the processor according to the task segmentation result:
  • the hardware resource dividing device divides the plurality of computing units of the processor according to the task segmentation result, that is, the plurality of computing units are divided into a plurality of computing groups according to the task segmentation result, to respectively calculate different batches Forward and reverse paths, or requests to run different services.
  • the processor dynamically adjusts the grouping of the plurality of computing units according to the task segmentation result during operation.
  • the step of the task segmentation device performing task segmentation according to the task segmentation granularity includes:
  • the task segmentation granularity selection unit selects the task segmentation granularity
  • the granular task segmentation unit splits the tasks of the divided hardware resources into at least one of the granularities to form a subtask.
  • the task segmentation granularity selection unit selects a plurality of the granularity task segmentation units based on at least one of a number of samples that the neural network needs to process, a topology of the neural network, and a calculation amount of each layer. At least one of the tasks to be split.
  • the processing method further includes: after the task is split, scheduling the task, including:
  • the tasks to be scheduled are selected from the unscheduled tasks, and the scheduled tasks to be scheduled are allocated to the target core according to the working states of the cores.
  • At least one of the following manners is used to allocate a scheduled task to be scheduled to the target core:
  • a heuristic algorithm is used to assign the task to be scheduled to the target core.
  • an information processing apparatus including: a storage module, configured to acquire information data, the information data includes at least one key feature, and the storage module prestores a true confidence level corresponding to the key feature;
  • the operation circuit determines, according to the information data, a prediction confidence corresponding to the key feature, and determines whether the prediction confidence of the key feature exceeds a true confidence preset threshold range corresponding to the key feature; the control circuit, when The prediction confidence exceeds the true confidence preset threshold, and the storage module is controlled to modify the key feature or issue a modification signal to the outside.
  • the memory module includes a direct memory access DMA, and the direct memory access DMA is electrically connected to the operation circuit for storing a prediction confidence determined by the operation circuit operation, and The true confidence and prediction confidence are fed into the operational circuit for comparison.
  • the storage module further includes a storage unit for acquiring information data from outside the information processing device and passing the direct storage access DMA for the operation circuit to call.
  • the storage module is further configured to store neural network specific instructions, input neurons in the neural network, output neurons, and weights
  • the information processing apparatus further comprising:
  • An instruction cache for buffering a dedicated instruction from the storage module for the control circuit to call; inputting a neuron buffer for buffering neurons from the storage module for operation by the operation circuit; and a weight buffer for using the storage
  • the module caches the weight for the operation circuit to call; the input neuron buffer is used to store the output neurons obtained from the operation of the operation circuit.
  • the operation circuit is further configured to score the information data according to the determination result of each key feature, or the operation circuit is further configured to perform adaptive training on the neural network.
  • the determining, according to the information data, determining a prediction confidence corresponding to the key feature comprising: performing the neural network operation by using the information as an input of a neural network, the prediction Confidence is the output of the neural network.
  • the information data comprises at least one of the following: a picture, a text, an audio, a video frame, and a video.
  • the method further includes: a pre-processing module, configured to pre-process external raw information data and then pass the storage module; preferably, the pre-processing includes splitting the original information data, Gaussian filtering, Binarization, regularization, and/or normalization to obtain data that conforms to the neural network input format.
  • a pre-processing module configured to pre-process external raw information data and then pass the storage module; preferably, the pre-processing includes splitting the original information data, Gaussian filtering, Binarization, regularization, and/or normalization to obtain data that conforms to the neural network input format.
  • an information processing apparatus comprising: information acquiring means for acquiring external information data; and the information processing apparatus described above for processing the information data to obtain prediction of key features Confidence, and when the predicted confidence exceeds the true confidence threshold, the key feature is modified or a modification signal is issued.
  • an information processing apparatus comprising: information acquiring means for acquiring external information data; and the information processing apparatus described above for processing the information data to obtain prediction of key features Confidence, and when the prediction confidence exceeds the true confidence preset threshold, modifying the key feature, or issuing a modification signal; the interaction interface receiving the modified key feature or modifying the signal to show the modification content to the user.
  • the interaction device further includes a pre-processing module for pre-processing the information data acquired by the information acquisition device and sending the information data to the information processing device.
  • a controller is further included for controlling the information acquisition device, the information processing device, and/or the interactive interface.
  • the interactive interface is further configured to modify the preset threshold in response to a user's operation or command.
  • an information processing method includes: acquiring information data by a storage module, the information data including at least one key feature, and the storage module pre-stores a true confidence corresponding to the key feature; Determining, according to the information data, a prediction confidence corresponding to the key feature, and determining whether the prediction confidence of the key feature exceeds a true confidence preset threshold range corresponding to the key feature; when the prediction confidence exceeds the real Confidence Preset threshold range, the control circuit controls the storage module to modify the key features or issue a modification signal.
  • the memory module includes direct memory access DMA, the method further comprising the steps of: using a direct memory access DMA storage operation circuit to determine a prediction confidence, and the true confidence and prediction Confidence is fed into the arithmetic circuit for comparison.
  • the obtaining the information data by the storage module comprises: acquiring the information data from the outside using the storage unit, and passing the direct storage access DMA for the operation circuit to invoke.
  • the method further includes the steps of: storing the neural network dedicated instruction by using the storage module; and buffering the dedicated instruction from the storage module by the instruction cache, and calling by the control circuit;
  • the storage module is used to store the input neurons, the output neurons and the weights in the neural network; the input neurons are used to buffer the neurons from the storage module for the operation circuit to call; and the weight buffer is used to cache the weights from the storage module.
  • the input neuron buffer is used to store the output neurons obtained from the operation of the operation circuit.
  • the method further includes the step of: using an arithmetic circuit to score the information data according to the determination result of each key feature, or adaptively training the neural network through an operation circuit.
  • the operation circuit determines, according to the information data, a prediction confidence corresponding to the key feature, comprising: performing neural network operation with the information as an input of a neural network, and the prediction confidence As the output of the neural network.
  • the method further includes the step of: pre-processing the external raw information data by the pre-processing module and transmitting the original information data to the storage module.
  • a processing apparatus for performing a generated confrontation network including:
  • a memory for receiving input data, the input data including random noise and reference data, and storing discriminator neural network parameters and generator neural network parameters;
  • the operator is used for inputting the random noise input data into the generator neural network to obtain a noise generation result; and is also used for inputting the noise generation result and the reference data into the discriminator neural network for calculation, thereby obtaining the discrimination result; Updating the discriminator neural network parameter and the generator neural network parameter according to the discriminating result.
  • a method for applying machine creation using the above processing apparatus including:
  • the operator inputs the random noise input data into the generator neural network to perform the operation, and obtains a noise generation result;
  • the arithmetic result is used to input the noise generation result and the reference data into the discriminator neural network to calculate the result;
  • the discriminator neural network parameter and the generator neural network parameter are updated according to the discriminating result by an arithmetic unit.
  • an electronic device including the processing device described above.
  • At least two processing modules in the disclosure can directly communicate with each other through preset rules to implement data sharing; therefore, there is no need to pass through the shared storage module, thereby reducing the overhead of storage communication and effectively reducing the delay of data access;
  • the present disclosure may include processors of different structures, and cores in different structural processors, so that external storage modules of the same or different structures and core external storage modules corresponding to the kernel may be maintained; the present disclosure does not In the case of reducing the original storage efficiency and increasing the original storage cost, each storage unit can allow direct access by one or more arithmetic units, and the specific number does not need to be fixed and agreed, and supports an asymmetric structure, allowing The requirements are configured and adjusted, thereby reducing the number of interactions between on-chip and out-of-chip access, and reducing power consumption; the private storage module that the operating unit enjoys for the computing unit allows the data to be transferred to other computing units.
  • the data while protecting the privacy of the data, the data is allowed to interact quickly, the data utilization rate is improved, the waste of resources caused by storing the same data on the chip and the memory access overhead of repeatedly reading the same data are avoided, and the memory access is further improved. Speed, reducing the power consumption of memory access.
  • FIG. 1 is a schematic structural diagram of a data processing system in the prior art
  • FIG. 2 is a schematic structural diagram of a data sharing system according to an embodiment of the present disclosure
  • FIG. 3 is a schematic structural diagram of a processor in the system of FIG. 2;
  • FIG. 4 is a schematic structural view of the H-tree module of Figure 3;
  • FIG. 5 is a schematic structural diagram of a main operation module in FIG. 3;
  • FIG. 5 is a schematic structural diagram of a main operation module in FIG. 3;
  • FIG. 6 is a schematic structural diagram of a slave arithmetic module in FIG. 3;
  • FIG. 7 is a schematic structural diagram of a data sharing system according to another embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a data sharing system according to another embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a data sharing system according to another embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of an information processing apparatus in an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of an information processing apparatus including an arithmetic module in an embodiment of the present disclosure
  • FIG. 12 is a schematic structural diagram of an information processing apparatus including an instruction cache and a neural network data cache in an embodiment of the present disclosure
  • FIG. 13 is a schematic structural diagram of a neural network data cache in an embodiment of the present disclosure.
  • FIG. 14 is a schematic structural diagram of an information processing apparatus including a direct memory access and control unit in an embodiment of the present disclosure
  • 15 is a schematic diagram of a specific structure of an information processing apparatus in an embodiment of the present disclosure.
  • 16 is a flowchart of an information processing method of an information processing apparatus in an embodiment of the present disclosure.
  • FIG. 17 is a schematic structural diagram of an information processing system in an embodiment of the present disclosure.
  • FIG. 19 is a structural block diagram of a task segmentation apparatus according to an embodiment of the present disclosure.
  • FIG. 20 is a structural block diagram of a task scheduling apparatus according to an embodiment of the present disclosure.
  • 21 is a block diagram showing the structure of a multi-core processor according to still another embodiment of the present disclosure.
  • FIG. 22 is a structural block diagram of each neural network processing core processed by a neural network in still another embodiment of the present disclosure.
  • FIG. 23 is a structural block diagram of a processor according to an embodiment of the present disclosure.
  • 24 is a structural block diagram of a processor according to another embodiment of the present disclosure.
  • 25 is a structural block diagram of a processor according to another embodiment of the present disclosure.
  • 26 is a structural block diagram of a processor according to another embodiment of the present disclosure.
  • FIG. 27 is a structural block diagram of a processor according to another embodiment of the present disclosure.
  • 29 is a structural block diagram of a task scheduling apparatus according to an embodiment of the present disclosure.
  • FIG. 30 is a block diagram showing the structure of a multi-core processor according to an embodiment of the present disclosure.
  • FIG. 31 is a structural block diagram of each neural network processing core processed by a neural network in an embodiment of the present disclosure
  • 32 is a block diagram showing the structure of a combination processing device according to an embodiment of the present disclosure.
  • FIG. 34 is a schematic structural diagram of a computing unit after being divided according to an embodiment of the present disclosure.
  • 35 is a schematic structural diagram of a computing unit after being divided according to another embodiment of the present disclosure.
  • 36 is a schematic structural diagram of a computing unit after being divided according to another embodiment of the present disclosure.
  • Figure 37 is a block diagram of an information processing apparatus of an embodiment of the present disclosure.
  • Figure 38 is a block diagram of an information processing apparatus according to another embodiment of the present disclosure.
  • Figure 39 is a block diagram of an information processing apparatus according to still another embodiment of the present disclosure.
  • Figure 40 is a block diagram of an information processing apparatus of an embodiment of the present disclosure.
  • 41 is a flowchart of an information processing method according to an embodiment of the present disclosure.
  • FIG. 42 is a basic block diagram of a processing apparatus for executing a generated confrontation network in accordance with an embodiment of the present disclosure
  • FIG. 43 is a basic block diagram of a processing apparatus for executing a generated confrontation network according to still another embodiment of the present disclosure.
  • 44 is a flow chart of a method of performing machine authoring in an embodiment of the present disclosure.
  • the present disclosure proposes a method in which a machine learning ASIC arithmetic unit can directly access an SoC on-chip memory module and implement fast data interaction with other modules in other SoCs.
  • the method can effectively improve data interaction efficiency and greatly reduce interaction delay.
  • a common storage module at each level it can be accessed by a privileged access unit.
  • the access units can complete data interaction and access directly or through some rule or a certain protocol.
  • the present disclosure proposes a data sharing system including a storage module and at least two processing modules, wherein:
  • At least two processing modules share a storage module
  • At least two processing modules communicate through preset rules to implement data sharing.
  • the data sharing system of the present disclosure supports heterogeneous multi-processor scenarios.
  • There are external storage modules outside the processor which are common storage modules of multiple processors. These processors can be the same processor, can be different processors, or partially identical.
  • the at least two processing modules may include the same/different structure processor, the same/different structure processor core, and the same/different in the same/differential structure processor core.
  • the arithmetic unit of the structure may include the same/different structure processor, the same/different structure processor core, and the same/different in the same/differential structure processor core.
  • the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
  • the foregoing communicating by the preset rule includes: the at least two processing modules include a first processing module and a second processing module, and the first processing module sends the request signal and the corresponding data to the second processing module.
  • the address, the second processing module returns a valid signal and data to the first processing module according to the request signal and the corresponding data address, to implement data sharing.
  • at least two processing modules herein are not limited to include the first processing module and the second processing module, and may further include a third processing module, and any two of the three modules may be used. Communication is performed using the above preset rules.
  • the present disclosure also proposes a data sharing method, including the following steps:
  • At least two processing modules communicate through preset rules to implement data sharing
  • the two processing modules share a storage module.
  • At least two processing modules include two processors, for example, processor 1 and processor 2, and communication between two processors refers to internal processor. Communication between internal storage modules.
  • the external storage module allows the processor 1 and the processor 2 to directly access the data to the locations required by the internal storage module 1 and the internal storage module 2, respectively.
  • the consistency of data between the external storage module and the internal storage module of the processor is maintained by a certain consistency protocol.
  • the processor 1 changes the data in the internal storage module
  • the data in the corresponding location in the internal storage module 1 is changed in a manner of “write-through”, and the data in the external storage module is changed.
  • the processor 2 may send a request signal and a corresponding data address to the processor 1 through some preset rule, and after receiving the request signal, the processor 1 receives the request signal. Responding to valid signals and data to complete data interaction; therefore, for a structure with multiple processors, the same storage space can be maintained, and multiple processors can directly communicate with each other through some defined rules, thereby Reduce storage communication overhead and reduce data access latency.
  • the processor 1, the processor 2, and the like involved in the embodiment may be the same processor or different processors. It can be applied to cooperation between a new type of artificial neural network processor and a conventional general purpose processor. As can be assumed that processor 1 is a general purpose processor CPU, processor 2 is an artificial neural network processor.
  • the artificial neural network processor may include a structure for performing an artificial neural network forward operation, and the structure for performing the artificial neural network forward operation includes an instruction cache unit 1, a controller unit 2, and a direct memory.
  • the instruction cache unit 1, the controller unit 2, the direct memory access unit 3, the H-tree module 4, the main operation module 5, and the slave operation module 6 can all be implemented by a hardware circuit (for example, an application-specific integrated circuit ASIC).
  • the instruction cache unit 1 reads the instruction through the direct memory access unit 3 and caches the read instruction; the controller unit 2 reads the instruction from the instruction cache unit 1, and translates the instruction into a micro instruction that controls the behavior of other modules, among other modules.
  • it may be a direct memory access unit 3, a main operation module 5, and a slave operation module 6, etc.; the direct memory access unit 3 can access an external address space, directly read and write data to each cache unit inside the processor, and complete data loading and storage.
  • the H-tree module may include a branch processing module 103; the specific connection structure is as shown in FIG. 4, where
  • the main operation module 101 is connected to the branch processing module 103, and the branch processing module 103 is connected to the plurality of slave processing modules 102;
  • the branch processing module 103 is configured to perform forwarding of data or instructions between the main operation module 101 and the slave processing module 102.
  • f is the activation function, which can be: sigmoid function, any of the tanh, relu, and softmax functions. This assumes a binary tree structure with eight slave processing circuits.
  • the method of implementation can be:
  • the controller unit acquires the input neuron matrix x, the weight matrix w and the fully connected operation instruction from the storage module, and transmits the input neuron matrix x, the weight matrix w and the fully connected operation instruction to the main operation module;
  • the main operation module splits the input neuron matrix x into 8 sub-matrices, and then distributes the 8 sub-matrices through the tree module to the 8 slave processing modules, and broadcasts the weight matrix w to the 8 slave processing modules;
  • the main operation module is configured to sort the 8 intermediate results to obtain the operation result of the wx, perform the operation of the offset b on the operation result, perform the activation operation to obtain the final result y, and send the final result y to the controller unit, the controller unit
  • the final result y is output or stored to the storage module.
  • a block diagram of a configuration example of the main operation module 5 includes an operation unit 51, a data dependency determination unit 52, and a neuron buffer unit 53.
  • the neuron buffer unit 53 is configured to buffer input data and output data used by the main operation module 5 in the calculation process, and the operation unit 51 performs various operation functions of the main operation module 5, and the data dependency determination unit 52 is read by the operation unit 51.
  • the port of the neuron buffer unit 53 is written, and at the same time, the read/write consistency of data in the neuron cache unit can be ensured.
  • the data dependency determining unit 52 is also used to transmit the read data to the slave computing module 6 through the H-tree module 4, and the output data from the computing module 6 is directly sent to the computing unit 51 through the H-tree module 4.
  • the command output from the controller unit 2 is sent to the calculation unit 51 and the data dependency determination unit 52 to control its behavior.
  • each slave arithmetic module 6 includes an arithmetic unit 61, a data dependency determining unit 62, a neuron buffer unit 63, and a weight buffer unit 64.
  • the operation unit 61 is configured to receive the microinstruction issued by the controller unit 2 and perform an arithmetic logic operation;
  • the data dependency determination unit 62 is configured to perform a read and write operation on the neuron buffer unit 63 in the calculation process. Before the data dependency determination unit 62 performs the read and write operations, it first ensures that there is no read/write consistency conflict between the data used between the instructions.
  • all the microinstructions sent to the data dependency unit 62 are stored in the data dependency unit 62.
  • the instruction In the internal instruction queue, in the queue, if the range of the read data of the read instruction conflicts with the range of the write command write data of the queue position, the instruction must wait until the dependent write command is executed.
  • the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave arithmetic module 6.
  • the weight buffer unit 64 buffers the weight data required by the slave computing module 6 in the calculation process. For each slave arithmetic module 6, only the weights between all input neurons and partial output neurons are stored. Taking the all-connected layer as an example, the output neurons are segmented according to the number N of the operation units, and the weights corresponding to the n-th output neurons of each segment are stored in the n-th slave operation unit.
  • the arithmetic logic 6 can realize the parallel arithmetic operation in the forward operation process of each layer of artificial neural network from the operation module 6.
  • each of the slave arithmetic modules 6 calculates an output neuron value, and all of the output neuron values are combined into a final intermediate result vector in the H-tree module 4. Therefore, each slave arithmetic module 6 only needs to calculate the value of the output neuron corresponding to the present module in the intermediate result vector y.
  • the H-tree module 4 sums all the neuron values output from the arithmetic module 6 to obtain a final intermediate result vector y.
  • the main operation module 5 performs subsequent calculations based on the intermediate result vector y, such as adding offset, pooling (for example, MAXPOOLING or AVGPOOLING, etc.), performing activation and sampling.
  • a common memory module including a CPU and an artificial neural network processor allows two processors to directly access the data and cache the data into the cache of the CPU and the cache unit of the artificial neural network processor.
  • the "write-through" method is used to change the corresponding position of the data in the cache while changing the corresponding position of the data in the external storage module, and correspondingly to the artificial neural network processor.
  • the data sends a failure signal.
  • the artificial neural network processor uses the data, after the failure signal is found, the new value is read from the external storage module and written to the corresponding location of the cache unit in the artificial neural network processor.
  • the artificial neural network processor can send a request signal and a corresponding data address to the CPU through a well-defined rule. After receiving the request signal, the CPU responds with a valid signal and data to complete the data interaction. Therefore, for the heterogeneous multi-processor structure, the data sharing system proposed in this embodiment can reduce the storage communication overhead and reduce the data access delay by maintaining the same storage space.
  • Each processor has multiple cores, and the core has a core internal storage module and a nuclear external storage module.
  • the data of the nuclear external storage module can be directly accessed by several or all cores.
  • a data sharing system is proposed, in which at least two processing modules are two processor cores, and data sharing between them is through its internal core internal storage module.
  • the storage module refers to the core external storage module.
  • a core 1 needs to access the core internal storage module of the core 2, it can be accessed through a communication protocol.
  • the core external storage module allows the core 1 and the core 2 to access, and then the core 1 and the core 2 respectively read the required data to the corresponding positions of the core internal storage module 1 and the core internal storage module 2.
  • the core 1 changes the data in the core internal storage module
  • the "write back" mode is used to change only the data of the corresponding position in the core internal storage module 1, and the core external storage module sends an invalid signal to Core internal storage module 2.
  • the part of the data in the internal storage module 1 of the core is swapped out, or when the data is to be used by the core 2, after the failure signal is found, the new value is read from the core external storage module and written to the corresponding in the core internal storage module 2. position.
  • the core 2 can also pass some defined rule, such as sending the request signal and the corresponding data address to the core 1, and the core 1 receives the request signal. , reply valid signals and data to complete the data interaction.
  • the types of nuclear and nuclear can be the same, such as neural network cores, but also different, such as neural network cores and CPU cores. This allows data to be protected by the same or different structures while maintaining certain data protection. At the same time, the memory access overhead is reduced and the access latency is reduced.
  • Each neural network core includes a plurality of neural network operation units. Therefore, as shown in FIG. 8, in some embodiments of the present disclosure, a data sharing system is proposed, wherein at least two processing modules refer to three operations. Units, the three arithmetic units can directly access the core internal storage module, and can directly transfer related data in a certain direction, thereby facilitating the transmission of data between the arithmetic units, reducing the number of accesses to the storage module, thereby reducing Power consumption and access latency.
  • the arithmetic unit 1 reads out n and w from the core internal storage module, and directly performs an operation to obtain out1; the arithmetic unit 2 reads m from the core internal storage module, and receives the received from the arithmetic unit 1 The synapse value w performs a corresponding operation to obtain out2; the arithmetic unit 3 reads q from the core internal storage module, and receives the synapse value w transmitted from the arithmetic unit 1 to perform a corresponding operation to obtain out3.
  • the number of accesses to the internal storage module of the core is reduced, the delay and power consumption are reduced, the operation speed is improved, and the operation energy consumption is saved.
  • one or more memory cells may be added in the core, allowing one memory cell to be shared by several operating cells or one memory cell. It is private by 1 arithmetic unit.
  • the shared system includes two storage units, and the storage unit 1 is shared by the operation unit 1 and the operation unit 2.
  • the operation unit 1 and the operation unit 2 can directly access the storage unit 1, and the operation unit 3 cannot Direct access;
  • the storage unit 2 is private to the operation unit 3, the operation unit 3 can be directly accessed, and the operation unit 1 and the operation unit 2 cannot be directly accessed.
  • the operation unit 1 wants to access the operation result in the operation unit 3, it can be directly acquired by the operation unit 3, without accessing the core internal storage module through the storage unit 1, and then causing the storage unit 2 to update the core internal storage module and then transfer the storage.
  • Unit 1 and then allows the arithmetic unit 1 to access such a long process, so that while the data is effectively protected, other unprivileged arithmetic units (such as the arithmetic unit 1) cannot arbitrarily change the storage unit (such as the storage unit 2).
  • the number of accesses can be greatly reduced, and the waste of on-chip storage resources by on-chip storage of the same data is avoided, thereby reducing delay and power consumption, further improving the operation speed, and saving computational energy consumption.
  • the apparatus includes a storage module and a data processing module; and a storage module, configured to receive and store input data, instructions, and output data; wherein the input data includes one or more
  • the key feature is that the input data is the original input data or the data preprocessed by the original input data; the data processing module is configured to judge the key features included in the input data, that is, the data processing module calculates the key features included in the input data. Confidence, confidence is the judgment result, and the input data in the storage module is scored according to the judgment result.
  • the storage module stores data and instructions, the data includes input data, input neurons, weights, output neurons, and output data; the input data is transmitted to each input neuron in the artificial neural network to participate in subsequent operations; output neurons The value is the judgment result and/or the score as the output data.
  • FIG. 11 is a schematic structural diagram of an information processing apparatus including an arithmetic module in an embodiment of the present disclosure, wherein the data processing module includes an arithmetic module, configured to perform corresponding calculation on data stored in the storage module according to an instruction stored in the storage module, and Output the operation result to the storage module.
  • the data processing module includes an arithmetic module, configured to perform corresponding calculation on data stored in the storage module according to an instruction stored in the storage module, and Output the operation result to the storage module.
  • the operation module performs operations including neural network calculation, and the operation module includes but is not limited to: a first partial multiplier; the second part one or more adders (more specifically, the second partial adder constitutes an addition tree); the third part To activate the function unit; and/or the fourth part vector processing unit. More specifically, the vector processing unit can process vector operations and/or pool operations.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out).
  • the third part converts the input data (in) to the active output data (out) by the activation function (active).
  • the operations of the above several parts can freely select one or more parts to be combined in different orders, thereby realizing the operation of various functions.
  • FIG. 12 is a schematic structural diagram of an information processing apparatus including an instruction cache and a neural network data cache in an embodiment of the present disclosure; as shown in FIG. 12, the data processing module of the information processing apparatus further includes an instruction cache and a neural network data cache; Instruction cache for caching instructions; neural network data cache for buffering weight data, input neurons, and output neurons in a storage module.
  • the neural network data cache includes a weight cache, an input neuron cache, and an output neuron cache; and an instruction cache for caching instructions; a neural network data cache for buffering weight data, input neurons, and output neurons in the storage module.
  • FIG. 14 is a schematic structural diagram of an information processing apparatus including a direct memory access and control unit in an embodiment of the present disclosure; as shown in FIG. 14, wherein the data processing module of the information processing apparatus further includes direct memory access to communicate
  • the function of the bridge between the storage module and each cache is used to read and write data and/or instructions stored in the storage module, store the read and write instructions into the instruction cache, and store the read weights into the weight cache.
  • the instruction cache for storing Direct memory access cache instruction
  • weight cache used to cache weight data of direct memory access cache
  • input neuron cache used to cache input neurons of direct memory access cache.
  • the data processing module of the information processing apparatus further includes a control unit for reading an instruction from the instruction cache, decoding the instruction into an instruction executable by the operation module, and outputting the instruction to the operation module;
  • the neuron cache is used to cache the operation result output by the arithmetic module, that is, the judgment result and/or the score, and output to the direct memory access.
  • the data processing module may further include a scoring unit for: obtaining an judgment result by an artificial neural network running in the information processing apparatus.
  • the unit does not participate in the data processing; when the artificial neural network running in the information processing device only obtains the judgment result without obtaining the score, the unit is used to obtain the score according to the judgment result.
  • the judgment result is the value of the output neuron of the final output layer of the artificial neural network running in the information processing device, and the value of the output neuron is the confidence level of the key feature, and the confidence is a natural number within a certain range, for example: Confidence between [0,1], indicating the probability of occurrence of key features; confidence binarization ⁇ 0,1 ⁇ , 0 means no key features appear, 1 means key features appear or 1 means no key features appear, 0 Indicates that a key feature has occurred.
  • the representation of confidence is not limited to the above two.
  • the score is added to the final output layer of the artificial neural network running in the information processing device as a new final output layer, and the input neuron value of the new final output layer is a confidence level of occurrence of each key feature;
  • This layer has only one output neuron, the value is the score, the weight in the new final output layer operation corresponds to the importance of each key feature; or the layer has N+1 output neurons, the value range of the score [0, N], if the layer output neurons are numbered 0, 1, 2, ..., N, then the value of the i-th output neuron corresponds to the score value taking the confidence P i of i , the final score
  • the score with the highest confidence, ie score i 0 ,
  • the score may also be: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of each key feature, it is used as the input of the scoring unit, and the scoring unit is scored accordingly.
  • the scoring unit There are many ways to score the scoring unit. It can be a complex machine learning algorithm or a simple data processing. For example, the scoring unit simply averages the confidence between values [0, 1], and then Multiply by 100 to get a 100% rating.
  • the information processing device as described above is an artificial neural network chip.
  • FIG. 16 is a flowchart of an information processing method of an information processing apparatus in an embodiment of the present disclosure.
  • the storage module receives and stores input data, where the input data includes one or more key features;
  • the data processing module determines a key feature included in the input data, and scores the input data in the storage module according to the determination result, wherein the score may be obtained by an artificial neural network running in the information processing device, or may be processed by the data.
  • the scoring unit in the module is obtained.
  • FIG. 17 is a schematic structural diagram of an information processing system in an embodiment of the present disclosure, where the information processing system includes:
  • An information acquiring device configured to acquire external data and transmit the data to the information processing device
  • An information processing device configured to perform arithmetic processing on external data received from the information acquiring device, and output the operation processing result to the interaction interface;
  • An interactive interface for displaying an operation result received from the information processing device, and transmitting an operation or command received from the outside to the control device;
  • control device configured to control the operation of the information acquiring device, the information processing device, and the interaction interface according to the operation or command received from the interactive interface.
  • An information acquiring device configured to acquire external data, and directly or preprocessed the external data to the information processing device;
  • the external data includes text, a picture, an audio, and/or a video;
  • the information acquiring device includes at least a character recognition device, An image recognition device and a voice recognition device for acquiring text information in external data, the text information being a combination of one or more language characters and/or symbols, one or more language characters and/or symbols Combining at least the answers to the test papers of the language, mathematics, physics, etc.
  • the image recognition device is for acquiring picture or video information in the external data, the image recognition device is a camera;
  • the picture is a two-dimensional image and/or a two-dimensional perspective, The two-dimensional map and/or the two-dimensional perspective view are at least the answer to the test papers of art, drawing, etc.
  • the voice recognition device is used to acquire audio information in the external data, and the voice recognition device is a microphone.
  • the pre-processing operation makes the input data more suitable for artificial
  • the information processing device is configured to perform arithmetic processing on the external data received from the information acquiring device or the preprocessed external data, and output the operation result to the interactive interface; in the disclosed embodiment, the information processing device adopts an artificial neural network chip achieve.
  • the result of the calculation is the judgment result or the score.
  • the judgment result is the value of the output neuron of the final output layer of the artificial neural network running in the information processing device, and the value of the output neuron is the confidence level of the key feature, and the confidence is a natural number within a certain range, for example: Confidence between [0,1], indicating the probability of occurrence of key features; confidence binarization ⁇ 0,1 ⁇ , 0 means no key features appear, 1 means key features appear or 1 means no key features appear, 0 Indicates that a key feature has occurred.
  • the representation of confidence is not limited to the above two.
  • the score is: after the final output layer of the artificial neural network running in the information processing device, a layer is added as a new final output layer, and the input neuron value of the new final output layer is the confidence level of each key feature; The layer has only one output neuron, and its value is the score.
  • the score may also be: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of the occurrence of each key feature, and uses it as the input of the scoring unit, and the scoring unit obtains the score accordingly.
  • the scoring unit There are many ways to score the scoring unit. It can be a complex machine learning algorithm or a simple data processing. For example, the scoring unit simply averages the confidence between values [0, 1], and then Multiply by 100 to get a 100% rating.
  • the information processing apparatus of the specific embodiment of the present disclosure employs an artificial neural network chip.
  • the artificial neural network chip can be adaptively trained.
  • the chip accumulates the user's data self-learning, and gradually adapts to the user's handwriting, idioms, writing errors, posture features, habitual actions, continuously improving the accuracy and improving the user's movement/posture adjustment ability.
  • the artificial neural network chip has powerful computing power and supports offline operation of the neural network. The user terminal/front end offline can be realized without the cloud server assisting the calculation.
  • the use of artificial neural network chips automatically scores handwriting, text, and picture actions instead of manual, and the relative manual score is more accurate and faster; the subjective question evaluation is more objective, ignoring the influence of people's preferences and the influence of testers' calligraphy level. .
  • An interactive interface for displaying an output received from the information processing device and transmitting an operation or command received from the outside to the control device.
  • the user interaction interface is a display screen of a mobile phone, a computer, a notebook, a tablet, and the like.
  • control device configured to control the operation of the information acquiring device, the information processing device, and the interaction interface according to the operation or command received from the interactive interface.
  • FIG. 18 is a flowchart of an information processing method of an information processing system according to an embodiment of the present disclosure. As shown in the figure, the information processing method includes:
  • the information acquiring device acquires external data, and directly or preprocesses the external data to the information processing device;
  • the information processing device performs an operation process on the external data received from the information acquiring device or the preprocessed external data, and outputs the operation result to the interaction interface;
  • S203 An interactive interface, configured to display an operation result received from the information processing device.
  • the information acquiring device acquires external data, and the external data is directly or preprocessed and transmitted to the information processing device, and the external input data includes text, pictures, audio, and/or video, and is preprocessed, and is matched with the information processing device.
  • the data, pre-processing includes segmentation, Gaussian filtering, binarization, regularization or normalization; preprocessing can make the input data more suitable for artificial neural network processing, remove noise and redundancy in the input data, improve classification , recognition accuracy, etc.
  • the artificial neural network chip can be adaptively trained.
  • the chip accumulates the user's data self-learning, and gradually adapts to the user's handwriting, idioms, writing errors, posture features, habitual actions, continuously improving the accuracy and improving the user's movement/posture adjustment ability.
  • the artificial neural network chip can support the offline operation of the neural network, and the user terminal/front-end offline can realize the automatic scoring and monitoring work without the cloud server assisting the calculation; when the chip is networked and the cloud server is assisted to calculate, the chip computing capability is improved.
  • the use of artificial neural network chips automatically scores handwriting, text, and picture actions instead of manual, and the relative manual score is more accurate and faster; the subjective question evaluation is more objective, ignoring the influence of people's preferences and the influence of testers' calligraphy level. .
  • the information processing apparatus of the embodiment is configured to score a set of test papers containing one or more key features acquired by the character recognition device in the information acquisition device, and the key features in the test paper include keywords, and operations through an artificial neural network chip.
  • the output neuron of the final output layer of the artificial neural network chip outputs the judgment result, and the judgment result is the confidence degree of the key feature of the test paper, such as the confidence degree of the keyword occurrence, and the confidence is between [0, 1], indicating the key The probability of occurrence of a feature; where the higher the confidence, the greater the probability that the keyword will appear.
  • Confidence binarization ⁇ 0,1 ⁇ , 0 means no key features appear, 1 means that key features appear, or 1 means no key features, 0 means key features; confidence representation is not limited to the above two.
  • a layer is added to the final output layer of the artificial neural network chip as a new final output layer, and the input neuron value of the new final output layer is the confidence level of each key feature.
  • the score may be: adding a layer as a new final output layer after the final output layer of the artificial neural network running in the information processing device, and the input neuron value of the new final output layer is a confidence of each key feature. degree. There is only one output neuron in this layer, and its value is the score value.
  • the weight in the new final output layer operation corresponds to the importance of each key feature; or the layer has N+1 output neurons, and the value of the score The range is [0, N]. If the output neurons of the layer are numbered 0, 1, 2, ..., N, the value of the i-th output neuron corresponds to the score value taking the confidence P i of i .
  • the score may also be: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of each key feature, it is used as the input of the scoring unit, and the scoring unit is scored accordingly.
  • the scoring unit There are many ways to score the scoring unit. It can be a complex machine learning algorithm or a simple data processing. For example, the scoring unit simply averages the confidence between values [0, 1], and then Multiply by 100 to get a 100% rating.
  • the keywords and the probability of occurrence are given by the operation of the artificial neural network, and then the score of the test paper is given by adding a new layer of the final output layer or using it as the input of the scoring unit.
  • the score is displayed on the display of mobile phones, computers, notebooks, tablets, etc. The user can get the score of the test paper through the display screen.
  • the specific processing process of keywords in the artificial neural network chip is:
  • Step 1 The external data acquired by the character recognition device, the image recognition device, and the voice recognition device in the information acquisition device is preprocessed or directly transmitted to the storage module of the artificial neural network chip; the external data is preprocessed to make the external data more suitable. It is processed by artificial neural network to remove noise and redundancy in the input data, improve classification and recognition accuracy.
  • Step 2 Direct memory access (DMA) transfers the data in the storage module to the corresponding on-chip cache (ie, instruction cache, input neuron cache, weight buffer) in batches; in the artificial neural network chip, a dedicated on-chip is used.
  • Cache ie, instruction cache, input neuron cache, output neuron cache, and weight cache
  • dedicated artificial neural network operations and memory access instructions can effectively improve computation and memory efficiency.
  • Step 3 the control unit reads the instruction from the instruction cache, decodes it and sends it to the operation module;
  • Step 4 The arithmetic module performs a corresponding operation according to the instruction.
  • the operation module performs operations including but not limited to: a first partial multiplier; and a second partial one or more adders (more specifically, second The adder of the part constitutes the addition tree); the third part is the activation function unit; and/or the fourth part vector processing unit. More specifically, the vector processing unit can process vector operations and/or pool operations.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out).
  • the third part converts the input data (in) to the active output data (out) by the activation function (active).
  • the operations of the above several parts can freely select one or more parts to be combined in different orders, thereby realizing the operation of various functions.
  • the addition tree operation used by the arithmetic module can process multiple sets of weights and input neurons in parallel, which can improve the operation efficiency.
  • Step 5 Repeat steps 2 through 4 until all the data in the storage module has been calculated, and the final result of the functional requirements is obtained.
  • the final result is obtained by the output neuron of the last layer of the neural network, outputted from the arithmetic module to the output neuron cache, and then returned to the storage module via the DMA.
  • the scoring may also be: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of the occurrence of each key feature, it is used as an input of the scoring unit, and the scoring unit is scored accordingly.
  • the scoring unit There are many ways to score the scoring unit. It can be a complex machine learning algorithm or a simple data processing. For example, the scoring unit simply averages the confidence between values [0, 1], and then Multiply by 100 to get a 100% rating.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the information processing apparatus of this embodiment is configured to score a video, that is, a set of pictures including one or more key features.
  • the storage module in the artificial neural network chip prestores one or more key pictures; the storage module acquires the video from the outside and transmits it to the operation module, and outputs the final output layer of the artificial neural network chip through the operation of the artificial neural network chip.
  • the neuron outputs the judgment result, and the judgment result is the similarity between each input picture and each key picture. In detail, if there are N input pictures and M key pictures, NM similarities are obtained.
  • the similarity of this embodiment is the confidence, the confidence is a natural number within a certain range, the confidence is between [0, 1], indicating the probability of occurrence of key features; the confidence binarization ⁇ 0, 1 ⁇ , 0 means No key features appear, 1 indicates the occurrence of key features or 1 indicates no key features, and 0 indicates the occurrence of key features; the representation of confidence is not limited to the above two.
  • the input neuron value of the new final output layer is the confidence of each key feature, and the confidence is the input picture and each key.
  • the weight in the new final output layer operation corresponds to the importance of each similarity.
  • the layer has N+1 output neurons, the value of the score is [0, N], and if the output neurons of the layer are numbered 0, 1, 2, ..., N, then the ith output
  • the score may also be: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of each key feature, it is used as the input of the scoring unit, and the scoring unit is scored accordingly.
  • the scoring unit There are many ways to score the scoring unit. It can be a complex machine learning algorithm or a simple data processing. For example, the scoring unit simply averages the confidence between values [0, 1], and then Multiply by 100 to get a 100% rating.
  • the score is displayed on the display of mobile phones, computers, notebooks, tablets, etc.
  • the user can get a rating on the video through the display.
  • the video also includes audio, the audio is divided into multiple segments of audio, and the multi-segment audio corresponds to multiple images.
  • the chip can compare the similarity of all pictures in the video with each key picture, and/or compare the similarity of each waveform and key waveform obtained by all audio decomposition in the video, and score the video.
  • each output neuron of the final output layer of the neural network corresponds to an input picture, and the value of the output neuron is the key picture most similar to the input picture and the input picture. Similarity. If consistent with the previous example, the layer has a total of N output neurons.
  • each output neuron of the final output layer of the neural network corresponds to a key picture
  • the value of the output neuron is the input picture most similar to the key picture and the key picture. Similarity. If consistent with the previous example, the layer has a total of M output neurons.
  • Step 1 the specific processing process of the video data in the artificial neural network chip is: Step 1, the character recognition device, the image recognition device, and the voice recognition in the information acquisition device
  • the external data acquired by the device is preprocessed or directly transmitted to the storage module of the artificial neural network chip; the preprocessing module in the device can make the input data more suitable for artificial neural network processing, remove noise and redundancy in the input data, and improve Classification, recognition accuracy, etc.
  • Step 2 Direct memory access (DMA) transfers the data in the storage module to the corresponding on-chip buffer in batches, that is, the instruction cache, the input neuron cache, and the weight buffer; in the artificial neural network chip, a dedicated on-chip is used.
  • Cache ie, instruction cache, input neuron cache, output neuron cache, and weight cache
  • dedicated artificial neural network operations and memory access instructions can effectively improve computation and memory efficiency.
  • Step 3 the control unit reads the instruction from the instruction cache, decodes it and sends it to the operation module;
  • Step 4 The arithmetic module performs a corresponding operation according to the instruction: in each layer of the neural network,
  • the arithmetic module performs operations including neural network calculations.
  • the arithmetic module includes but is not limited to: a first partial multiplier; a second partial one or more adders (more specifically, an adder of the second partial constitutes an addition tree); the third part is an activation function unit; and/or Four-part vector processing unit. More specifically, vector processing units can handle vector operations and/or pool operations.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out).
  • the third part converts the input data (in) to the active output data (out) by the activation function (active).
  • the operations of the above several parts can freely select one or more parts to be combined in different orders, thereby realizing the operation of various functions.
  • the addition tree operation of the operation module can process multiple sets of weights and input neurons in parallel, thereby effectively improving the operation efficiency.
  • Step 5 Repeat steps 2 through 4 until all the data in the memory module has been calculated, and the final result of the functional requirements is obtained.
  • the final result is obtained by the output neuron of the last layer of the neural network, outputted from the arithmetic module to the output neuron cache, and then returned to the storage module via the DMA.
  • the value of the last layer of the output neuron of the neural network is the similarity value; if the scoring is required, a layer is added to the last layer of the output layer as the new final An output layer, the input neuron value of the new final output layer is a similarity value; the new final output layer includes an output neuron whose value is a video score; the weight of the new final output layer operation corresponds to The degree of importance of each similarity value.
  • the layer has N+1 output neurons, the value of the score is [0, N], and if the output neurons of the layer are numbered 0, 1, 2, ..., N, then the ith output
  • the scoring may also be: after the final output layer of the artificial neural network running in the information processing device obtains the confidence of the occurrence of each key feature, it is used as an input of the scoring unit, and the scoring unit is scored accordingly.
  • the scoring unit There are many ways to score the scoring unit. It can be a complex machine learning algorithm or a simple data processing. For example, the scoring unit simply averages the confidence between values [0, 1], and then Multiply by 100 to get a 100% rating.
  • the artificial neural network chip has powerful computing power and supports offline operation of the neural network. In the absence of the cloud server to assist in the calculation, the user terminal/front end offline can realize the automatic scoring and monitoring work; when the chip is networked and the cloud server is assisted in calculating, the chip More computing power.
  • the artificial neural network chip automatically scores the motion of the pictures in the video, instead of manual, and the relative manual score is more accurate and fast; the subjective question evaluation is more objective, ignoring the influence of people's preferences.
  • the device and method of the embodiment instantly monitors the user's action/posture, automatically issues a reminder to adjust the user's action/posture, instead of manual coaching and monitoring work, and is more accurate and instant than manual.
  • the adaptive training of the artificial neural network chip enables the chip to accumulate the user's data, self-learning, and gradually adapt to the user's handwriting, such as writing mistakes, posture characteristics, customary movements, continuously improving the accuracy and improving the user's movement/posture adjustment. ability.
  • All of the modules in the embodiments of the present disclosure may be hardware structures, and physical implementations of the hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and DNA computers.
  • FIG. 19 is a structural block diagram of a task segmentation device according to an embodiment of the present disclosure.
  • the task segmentation device 100 includes a granular task segmentation unit 10 and task segmentation. Particle size selection unit 20.
  • the granularity task segmentation unit 10 divides the task into at least one granularity to form a subtask, and provides a multi-granular task segmentation selection for the neural network application.
  • the task segmentation granularity selection unit 20 selects the granularity of the task division and guides the neural network. Select the most appropriate task segmentation granularity, so that the sub-task after segmentation can meet the real-time performance of the system.
  • the granular task segmentation unit 10 includes a first granularity task segmentation unit 11, a second granularity task segmentation unit 12, a third granularity task segmentation unit 13, and a fourth granularity task.
  • N is a positive integer greater than zero.
  • the first granularity task segmentation unit 11 treats the task as a subtask as a whole, and specifically, completes M sample calculations as a subtask. This task splitting method only generates one subtask, and there is no dependency between subtasks.
  • the second granularity task segmentation unit 12 will perform several sample calculations as one subtask.
  • the third-grained task segmentation unit 13 can perform task segmentation on the neural network application according to the layer type of the neural network, and the calculation of the same type of layer serves as a task.
  • the layer types of the neural network include, but are not limited to, a convolutional layer, a fully connected layer, an LSTM layer, a pooling layer, an active layer, an LRN layer, and a BN layer. There are complex dependencies between subtasks in this task segmentation mode.
  • the fourth granularity task segmentation unit 14 may perform task segmentation on the neural network application according to the inter-layer structure of the neural network, and the calculation of adjacent layers is performed as a sub-task.
  • the neural network application is divided into n subtasks.
  • the first subtask completes the first layer to the N1 layer of the neural network, and the total number of N1 layers is calculated.
  • the second subtask completes the N1+1th layer to the N1+N2 layer.
  • the N2 layer neural network calculates that the i-th sub-task completes the N1+...+Ni-1+1 layer to the N1+...+Ni layer, and the total Ni layer is calculated.
  • the i th subtask is the predecessor task of the i+1th subtask
  • the i+1th task is the successor task of the i th task
  • i+1 tasks must wait for the i-th task to complete before they can begin execution.
  • the fifth granularity task segmentation unit 15 can perform task segmentation on the neural network application according to the intra-layer structure of the neural network, and the calculation in the neural network layer can be further divided into sub-tasks.
  • the segmentation according to the calculations within the neural network layer includes, but is not limited to, task segmentation for a roll layer calculation, full connection layer calculation, pool layer calculation, or activation layer calculation of the neural network.
  • the convolutional layer input neurons are three-dimensional matrices (Nfin, Nxin, Nyin), the weights are four-dimensional matrices (Nfout, Nfout, Kx, Ky), and the output neurons are Three-dimensional matrix (Nfout, Nxout, Nyout), where Nfin is the number of input feature images, (Nxin, Nyin) is the input feature image size, Nfout is the number of output feature images, and (Kx, Ky) is the convolution kernel size, (Nxout, Nyout) is the output feature image size.
  • Completing an output neuron requires Nfin ⁇ Kx ⁇ Ky submultiple addition and operation, and the number of output neurons is Nfout ⁇ Nxout ⁇ Nyout, and the entire convolutional layer needs to complete Nfout ⁇ Nxout ⁇ Nyout ⁇ Nfin ⁇ Kx ⁇ Ky submultiple addition operation.
  • the output neurons are segmented according to the block size of (Bfout, Bxout, Byout), and the weights are segmented according to the block size of (Bfout, Bfin, Bx, By), then each The subtask calculates the intermediate result of Bfout ⁇ Bxout ⁇ Byout output neurons with the weights of (Bfout, Bfin, Bx, By).
  • the intermediate result of each output neuron is Bfin ⁇ Bx ⁇ By multiply and add operations, and a total of Bfout needs to be completed.
  • ⁇ Bxout ⁇ Byout ⁇ Bfin ⁇ Bx ⁇ By is multiplied and added.
  • Bfout is a positive integer greater than 0 and less than or equal to Nfout
  • Bxout is a positive integer greater than 0 and less than or equal to Nxout
  • Byout is a positive integer greater than 0 and less than or equal to Nyout
  • Bfin is a positive integer greater than 0 and less than or equal to Nfin
  • Bx is greater than 0.
  • By is a positive integer greater than 0 and less than or equal to Ky.
  • Task segmentation is performed on a fully connected layer calculation of the neural network.
  • the fully connected layer input neurons are Nin
  • the weights are two-dimensional matrices (Nout, Nin)
  • the output neurons Nout where Nin is the number of input neurons, Nout is The number of neurons output. Completing an output neuron requires a Nin multiplication and addition operation, and the number of output neurons is Nout. A total of Nout ⁇ Nin multiplication and addition operations are required to complete the entire connection layer.
  • the output neurons are segmented according to the block size of Bout, and the weights are segmented according to the block size of (Bout, Bin), and the weight of each subtask is (Bout, Bin).
  • the value matrix calculates the intermediate result of the Bout output neurons.
  • Bin multiplication and addition operations are required, and a total of Bout ⁇ Bin multiplication and addition operations are required.
  • Bout is a positive integer greater than 0 and less than or equal to Nout
  • Bin is a positive integer greater than 0 and less than or equal to Nin.
  • the pooled layer input neuron is Nin
  • the output neuron Nout where Nin, Nout is a positive integer greater than 0, and the pooling operations include but are not limited to the average pooling , the maximum pool, the median pool.
  • the output neurons are segmented according to the block size of Bout, and each subtask completes the calculation of Bout output neurons.
  • Bout is a positive integer greater than 0 and less than or equal to Nout
  • Bin is a positive integer greater than 0 and less than or equal to Nin.
  • An active layer calculation of the neural network is task-splitting, the excitation input neuron is Nin, and the output neuron Nout, where Nin, Nout is a positive integer greater than 0, and the activation functions include but are not limited to sigmoid, tanh, relu, softmax.
  • the output neurons are segmented according to the block size of Bout, and each subtask completes the calculation of Bout output neurons.
  • Bout is a positive integer greater than 0 and less than or equal to Nout. There is no dependency between subtasks in this task splitting mode.
  • the task segmentation granularity selecting unit 20 selects the granularity used for the task partitioning, and is not limited to selecting only one granularity as described above, and may also be a combination of multiple granularities.
  • a neural network application may combine the fourth granularity task unit and the fifth granularity.
  • the splitting method of the task segmentation unit The neural network application is first divided into n subtasks according to the segmentation method of the fourth granularity task segmentation unit 14, and then the p subtasks are segmented according to the segmentation mode of the fifth granularity task segmentation unit 1.
  • the granular task segmentation unit 10 may include at least one of the first to fifth granular task segmentation units, and does not necessarily include all of the first to fifth granularity task segmentation units.
  • the granularity task segmentation unit 10 may further include a hybrid granularity task segmentation unit for combining the segmentation manners of the first to fifth granularity task segmentation units for selection by the task segmentation granularity selection unit 20.
  • FIG. 20 is a structural block diagram of a task scheduling apparatus according to an embodiment of the present disclosure.
  • the task scheduling apparatus 300 includes a task queue unit 30, a monitoring unit 40, and a task scheduling unit. 50.
  • the neural network task scheduling apparatus 300 can comprehensively consider the dependencies between tasks, the locality of the tasks, the task granularity, the running frequency of the core and the load for task scheduling, improve the quality of service, improve the utilization rate of the core, and ensure the inter-core The task is balanced and reduces energy consumption.
  • the task queue unit 30 caches all unscheduled neural network tasks, and can selectively store the execution time of each task to be scheduled, the task dependency graph, and the task resources are processed in the kernel.
  • the neural network task is, for example, the previous implementation. Sub-tasks split in the example.
  • the monitoring unit 40 detects the overall service quality of the multi-core neural network processor and the working status of each core in real time, for example, the utilization rate of each core, the workload, the working frequency, the number of tasks in the private task queue in the core, and the task completion time.
  • the task scheduling unit 50 selects a task to be scheduled from the unscheduled task, determines a mapping relationship between the task to be scheduled and the target core according to the information to be scheduled and the working state of each core, and allocates the task to be scheduled to the target core.
  • the task scheduling unit 50 can schedule unscheduled tasks in the task queue every time T, and T is a real number greater than zero. If there is a dependency of the unscheduled task t with other tasks and the predecessor task is not completed, the task scheduling unit 50 does not schedule the task t.
  • the task scheduling unit 50 selects the unscheduled task to select the to-be-scheduled task mode, and may adopt at least one of the following methods: randomly select the task, select the task with the longest expected execution time, select the task with the shortest expected execution time, and select the task with the most resources. , choose the task that takes up the least resources.
  • the task scheduling unit 50 may schedule the task to be scheduled to be allocated to the target core by using at least one of the following scheduling modes.
  • the first scheduling mode is: counting the number of tasks in each core private task queue, selecting the core with the least task in the private task queue as the target core, and assigning the task to be scheduled to the target core;
  • the second scheduling mode is: counting the time for each core to complete all tasks in the private task queue, selecting the core with the shortest task time as the target core, and assigning the task to be scheduled to the target core;
  • the third scheduling mode is to collect the resources of the resources to be scheduled in all the cores, select the core with the largest number of resources as the target core, and assign the tasks to be scheduled to the target core;
  • the fourth scheduling method the heuristic algorithm is used to assign the task to be scheduled to the target kernel, and the heuristic algorithm includes but is not limited to genetic algorithm, ant colony algorithm, and simulated annealing algorithm.
  • FIG. 21 is a structural block diagram of a multi-core processor according to still another embodiment of the present disclosure.
  • the multi-core neural network processor 1000 includes : J processing cores, J is a positive integer greater than 1, the task segmentation apparatus 100 and the task scheduling apparatus 300 in the foregoing embodiments.
  • the task segmentation device 100 splits the input neural network application, so that the sub-task after the segmentation can satisfy the real-time performance of the system, and the task scheduling device 300 performs neural network sub-task scheduling, which can improve the service quality, improve the utilization rate of the processing core, and ensure Handle task balancing between cores and reduce energy consumption.
  • the neural network processes the kernel for neural network operations and completes the neural network subtask.
  • the topological structures between the J neural network processing cores include but are not limited to one-dimensional linear, two-dimensional mesh, two-dimensional star, three-dimensional cube, and the like.
  • FIG. 22 is a structural block diagram of each neural network processing core processed by a neural network according to still another embodiment of the present disclosure.
  • the neural network processing core 500 includes a storage unit 501, a control unit 502, a selection unit 503, and an operation. Unit 504.
  • the storage unit 501 is configured to store neurons, weights, and instructions of the neural network; when the neural network subtask processes the sparse neural network, the stored weights are non-zero weights and location information of non-zero weights.
  • the instruction control unit 502 is configured to receive a neural network dedicated instruction, and generate a control information control selection unit and an operation unit after being decoded;
  • the neural network specific instructions include all instructions dedicated to performing artificial neural network operations.
  • Neural network specific instructions include, but are not limited to, control instructions, data transfer instructions, arithmetic instructions, and logic instructions.
  • the control command controls the execution process of the neural network.
  • Data transfer instructions complete the transfer of data between different storage media, including but not limited to matrices, vectors, and scalars.
  • the arithmetic instruction completes the arithmetic operation of the neural network, including but not limited to the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Instruction, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions.
  • Logic instructions complete the logical operations of the neural network, including but not limited to vector logic operations instructions and scalar logic operation instructions.
  • the RBM neural network operation instruction is used to implement the Restricted Boltzmann Machine (RBM) neural network operation.
  • RBM Restricted Boltzmann Machine
  • the LRN neural network operation instruction is used to implement the Local Response Normalization (LRN) neural network operation.
  • LRN Local Response Normalization
  • the LSTM neural network operation instruction is used to implement Long Short-Term Memory (LSTM) neural network operation.
  • LSTM Long Short-Term Memory
  • the RNN neural network operation instruction is used to implement Recurrent Neural Networks (RNN) neural network operation.
  • RNN Recurrent Neural Networks
  • the RELU neural network operation instruction is used to implement the Rectified linear unit (RELU) neural network operation.
  • the PRELU neural network operation instruction is used to implement the Parametric Rectified Linear Unit (PRELU) neural network operation.
  • PRELU Parametric Rectified Linear Unit
  • SIGMOID neural network operation instruction is used to realize S-type growth curve (SIGMOID) neural network operation
  • the TANH neural network operation instruction is used to implement the hyperbolic tangent function (TANH) neural network operation.
  • TANH hyperbolic tangent function
  • MAXOUT neural network operation instructions are used to implement (MAXOUT) neural network operations.
  • Cambricon instruction set More specifically, it includes the Cambricon instruction set.
  • the Cambricon instruction set is characterized in that each instruction in the instruction set has a length of 64 bits, and the instruction consists of an operation code and an operand.
  • the instruction set contains four types of instructions, namely, control instructions, data transfer instructions, computational instructions, and logical instructions.
  • Control instructions are used to control the execution process.
  • Control instructions include jump instructions and conditional branch instructions.
  • the data transfer instruction is used to complete data transfer between different storage media.
  • the data transfer instructions include a load instruction, a store instruction, and a move instruction.
  • the load instruction is used to load data from the main memory to the cache
  • the store instruction is used to store data from the cache to the main memory
  • the move instruction is used to transfer data between the cache and the cache or the cache and registers or registers and registers.
  • Data transfer instructions support three different ways of organizing data, including matrices, vectors, and scalars.
  • the arithmetic instructions are used to perform neural network arithmetic operations.
  • the arithmetic instructions include matrix operation instructions, vector operation instructions, and scalar operation instructions.
  • the matrix operation instruction completes the matrix operation in the neural network, including a matrix multiply vector, a vector multiply matrix, a matrix multiply scalar, an outer product, Matrix add matrix, matrix subtract matrix.
  • the vector operation instruction completes the vector operation in the neural network, including vector elementary arithmetics, vector transcendental functions, dot products, and random vector generators. , the maximum/minimum of a vector.
  • the vector basic operations include vector addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and the vector transcendental function refers to functions that do not satisfy any polynomial equation with a polynomial as a coefficient, including but not limited to an exponential function. Number function, trigonometric function, inverse trigonometric function.
  • scalar operations complete scalar operations in neural networks, including scalar elementary arithmetics and scalar transcendental functions.
  • the scalar basic operations include scalar addition, subtraction, multiplication, and division.
  • the scalar transcendental functions are functions that do not satisfy any polynomial equations with polynomials as coefficients, including but not limited to exponential functions. Number function, trigonometric function, inverse trigonometric function.
  • Logical operations include vector logic operations instructions and scalar logic operation instructions.
  • vector logic operation instructions include vector comparison, vector logical operations, and vector greater than merge. Where vector comparison includes but greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Vector logic operations include with, or, not.
  • scalar logic operations include scalar comparison, scalar logical operations.
  • the scalar comparison includes but greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to.
  • Scalar logic operations include AND, OR, and NOT.
  • the selecting unit 503 is configured to receive the input neurons and the non-zero weight position information, and select the neurons corresponding to the non-zero weights. That is to say: for each output neuron data, the selection unit removes input neuron data of non-zero weight data that does not correspond to the output neuron data.
  • the operation unit 504 is configured to receive the neurons corresponding to the input non-zero weights and the corresponding non-zero weights, complete the neural network training operation, and retransmit the output neurons to the storage portion.
  • the arithmetic unit 504 performs a corresponding operation on the data according to an instruction stored in the storage unit.
  • the arithmetic unit 504 includes but is not limited to three parts, the first part is a multiplier, the second part is one or more adders, and the third part is an activation function unit.
  • the one or more adders of the second portion constitute an addition tree.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the output (out) after multiplication.
  • Pooling operations include, but are not limited to, average pooling, maximum pooling, median pooling, and input data in is the data in a pooled core associated with output out.
  • the operation unit performs operations including, but not limited to, the first portion is to multiply the input data 1 and the input data 2 to obtain the multiplied data; the second portion performs an addition tree operation for adding the input data 1 through the addition.
  • the trees are added step by step, or the input data 1 is added to the input data 2 to obtain output data; the third portion performs an activation function operation, and the input data is subjected to an activation function (active) operation to obtain output data.
  • the operations of the above parts can be freely combined to realize the operation of various functions.
  • the neural network processing core 500 can also include a pre-processing module 505, as shown in FIG. 4, which pre-processes the raw data, including dicing, Gaussian filtering, binarization, regularization, normalization, and the like.
  • the neural network processing core 500 can also include an instruction cache 506, a non-zero weight buffer 507, a non-zero weight location cache 508, an input neuron cache 509, and an output neuron cache 510.
  • Instruction cache 506 for storing dedicated instructions; non-zero weight buffer 507 for buffering non-zero weight data; non-zero weight location buffer 508 for buffering non-zero weight location data and based on non-zero weight locations
  • the data maps each weight in the input data to the corresponding input neuron one by one; the input neuron cache 509 is used to buffer the input neurons; and the output neuron buffer 510 is used to buffer the output neurons output by the arithmetic unit.
  • the non-zero weight position data indicates whether each of the input neuron data and each of the output neuron data has a corresponding weighted non-zero weight data.
  • the one-to-one correspondence of non-zero weight position buffers is that 1 means that there is a connection, 0 means no connection, and the connection state of each set of output neurons and all input neurons forms a string of 0 and 1 to represent The connection relationship of the output neurons.
  • the non-zero weight position cache has a one-to-one correspondence. 1 means that there is a connection, 0 means no connection, and the connection state of each set of input neurons and all output neurons forms a string of 0 and 1. Indicates the connection relationship of the input neurons.
  • the one-to-one correspondence of the non-zero weight position buffers is to distance the input neuron where the first connection of the set of output neurons is located from the first input neuron, and the output neuron is the second.
  • the above-mentioned connected relationship has corresponding non-zero weight data for each input neuron data and each output neuron data, and no connection means whether each input neuron data and each output neuron data have corresponding Non-zero weight data.
  • the neural network processing core 500 may also include a direct data access unit DMA 512 (direct memory access).
  • DMA 512 direct memory access
  • the DMA is used to read or write data or instructions in the storage unit, the instruction cache, the non-zero weight cache, the non-zero weight location cache, the input neuron cache, and the output neuron cache.
  • a chip that includes the neural network processor described above.
  • a chip package structure that includes the chip described above.
  • a board that includes the chip package structure described above.
  • an electronic device that includes the above-described card.
  • Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearable device vehicles, household appliances, and/or medical devices.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
  • a further embodiment of the present disclosure provides a task segmentation method for a neural network to select at least one of the following five granular task segmentation modes for task segmentation.
  • the first granularity task segmentation method takes the task as a subtask as a whole, and specifically, the M sample calculations are completed as a subtask.
  • This task splitting method only generates one subtask, and there is no dependency between subtasks.
  • the second granularity task segmentation method will complete several sample calculations as a subtask.
  • the third granularity task segmentation method can perform task segmentation on the neural network application according to the layer type of the neural network, and the calculation of the same type layer as a task.
  • the layer types of the neural network include, but are not limited to, a convolutional layer, a fully connected layer, an LSTM layer, a pooling layer, an active layer, an LRN layer, and a BN layer. There are complex dependencies between subtasks in this task segmentation mode.
  • the fourth granularity task segmentation method can perform task segmentation on the neural network application according to the inter-layer structure of the neural network, and the calculation of adjacent layers is used as a sub-task.
  • the neural network application is divided into n subtasks.
  • the first subtask completes the first layer to the N1 layer of the neural network, and the total number of N1 layers is calculated.
  • the second subtask completes the N1+1th layer to the N1+N2 layer.
  • the N2 layer neural network calculates that the i-th sub-task completes the N1+...+Ni-1+1 layer to the N1+...+Ni layer, and the total Ni layer is calculated.
  • the i th subtask is the predecessor task of the i+1th subtask
  • the i+1th task is the successor task of the i th task
  • i+1 tasks must wait for the i-th task to complete before they can begin execution.
  • the fifth granularity task segmentation unit mode performs task segmentation according to the intra-layer structure of the neural network, and the calculation in the neural network layer can be further divided into sub-tasks.
  • the segmentation according to the calculations within the neural network layer includes, but is not limited to, task segmentation for a roll layer calculation, full connection layer calculation, pool layer calculation, or activation layer calculation of the neural network.
  • a further embodiment of the present disclosure provides a task scheduling method, which can comprehensively consider a dependency relationship between tasks, a locality of tasks, a task segmentation granularity, a core running frequency and a load for task scheduling, improve service quality, and improve a core. Utilization, ensuring a balanced task between cores and reducing energy consumption.
  • the task scheduling method includes the following steps:
  • the execution time of each task to be scheduled, the task dependency graph, and the task resource are distributed in the kernel, and the neural network task is, for example, a sub-task divided in the previous embodiment;
  • the working states of the cores are, for example, the utilization rate of each core, the workload, the working frequency, the number of tasks in the private task queue in the core, and the task completion time.
  • the task to be scheduled is selected from the unscheduled tasks, and the mapping relationship between the task to be scheduled and the target core is determined according to the information of the task to be scheduled and the working state of each core, and the task to be scheduled is allocated to the target core.
  • the task scheduling may schedule unscheduled tasks in the task queue every time T, and T is a real number greater than zero. If the unscheduled task t has a dependency relationship with other tasks and the predecessor task is not completed, the task t is not scheduled.
  • Scheduling the task to be scheduled to the target core may adopt at least one of the following scheduling modes: the first scheduling mode: counting the number of tasks in each core private task queue, and selecting the core with the least task in the private task queue as the target core. Assigning the task to be scheduled to the target core;
  • the second scheduling mode is: counting the time for each core to complete all tasks in the private task queue, selecting the core with the shortest task time as the target core, and assigning the task to be scheduled to the target core;
  • the third scheduling mode is to collect the resources of the resources to be scheduled in all the cores, select the core with the largest number of resources as the target core, and assign the tasks to be scheduled to the target core;
  • the fourth scheduling method the heuristic algorithm is used to assign the task to be scheduled to the target kernel, and the heuristic algorithm includes but is not limited to genetic algorithm, ant colony algorithm, and simulated annealing algorithm.
  • the processor includes:
  • a task segmentation device for performing task segmentation according to task segmentation granularity
  • the hardware resource dividing device is configured to divide the hardware resources of the processor according to the task segmentation result.
  • the hardware resource partitioning device may include a distribution configuration module for distributing the configuration information.
  • the configuration information may include configuration information that divides the hardware resources determined according to the task segmentation result. At this time, the corresponding configuration information is determined according to the task segmentation result, and the hardware resources are divided according to the configuration information.
  • the processor further includes a calculation module, the calculation module includes a plurality of calculation units, and the hardware resource division means is configured to divide the plurality of calculation units of the processor according to the task segmentation result, that is, the plurality of calculations
  • the unit is divided into a plurality of calculation groups according to the task segmentation result to calculate different forward and reverse paths in the batch, or requests to run different services.
  • the processor further includes: an external storage module, an internal storage module, and a control module.
  • An external storage module for storing data information of the computing module, the internal storage module, the control module, and the distribution configuration module.
  • the data information includes: weight data, neuron data (including input), instruction data, configuration information, and the like.
  • the external storage module can provide a read/write interface to an external memory, and can configure related registers to flexibly implement operations on different external memories.
  • An internal storage module for storing data for use by the computing module, including: weights, neurons (including inputs), instruction data, and the like.
  • the internal storage module also provides a read/write interface with an external storage module to complete data exchange between the internal storage module and the external storage module.
  • the control module provides an interface for controlling signal exchange with the external storage module for accepting and parsing the external control signal, thereby completing control of other modules.
  • the control module also provides a signal exchange interface with the calculation module for configuring and controlling the calculation module to perform different calculations.
  • the control module further provides a handshake interface with the distribution configuration module of the hardware resource partitioning device for transmitting the configuration signal to the distribution configuration module, thereby controlling the functions performed by the distribution configuration.
  • the control module may include a storage unit, and a storage unit may be configured outside thereof for storing different control information.
  • the control module also provides a signal exchange interface with the task segmentation device for controlling the task segmentation device for task segmentation.
  • the distribution module provides a handshake interface with the computing module to distribute configuration information for configuring functions and data connections in the computing module to support the computing module to complete batch and multi-service requests.
  • the function is mainly to complete calculation functions such as inner product operation, outer product operation, nonlinear function operation, and transcendental function operation; correspondingly, the data connection is to calculate the connection state required by the module according to the calculation function, for example, How many calculation groups are divided into multiple calculation units included in the calculation module.
  • the distribution configuration module may include a storage unit, and a storage unit may be externally configured to store different configuration information.
  • the task segmentation device provides a signal exchange interface with the calculation module to perform task division on the calculation module.
  • the task segmentation device may divide the task on all computing units of the computing module, or may selectively divide the task on a part of the computing unit of the computing module.
  • the computing module includes a plurality of processing elements (PEs).
  • PEs processing elements
  • the plurality of computing units can be divided into a plurality of computing groups to perform different operations. Further, the plurality of computing units may be the same computing unit, that is, an isomorphic mode; or may be different computing units, that is, a heterogeneous mode.
  • the computing unit may be a computing unit that performs a simple operation, such as performing simple operations such as scalar multiplication, scalar addition, scalar multiplication, and the like; or may be a computational unit that performs vector operations, such as performing vector multiplication. , vector addition, vector inner product, etc.; it can also be a hybrid calculation unit, such as a matrix calculation unit for operations such as matrix multiplication, a hybrid calculation unit for vector inner product calculation and nonlinear calculation, and a pulse array product.
  • the processor includes: an external storage module and a control module; and further includes: a weight buffer unit, an input neuron buffer unit, an output neuron buffer unit, and an instruction cache unit.
  • the instruction cache unit is configured to cache an instruction
  • the weight buffer unit is configured to cache weight data
  • the input neuron buffer unit is configured to cache input neuron data
  • the output neuron buffer unit is configured to cache an operation result output by the calculation module, and output the result to an external storage module.
  • control module is configured to read an instruction from the instruction cache, decode it into an instruction that the calculation module can execute, and output the instruction to the calculation module.
  • Other modules and functions in this embodiment may be the same as in the previous embodiment, and details are not described herein again.
  • the input data of the processor includes a picture, a video, an audio, a text, and the like.
  • the output data of the device includes numerical data, the results of which indicate meanings including but not limited to classification results and generation results.
  • the control module of the processor controls the calculation module, the hardware resource division device and the task segmentation device according to the control signal, and the control mode includes direct control and analytical control, and the direct control mode directly inputs the control signal into other modules. It does not need to be parsed by the control module; the analysis control mode is that the control signal needs to be parsed in the control module, and the parsed control signal is input into other modules for configuration and control.
  • the task segmentation device includes a granular task segmentation unit and a task segmentation granularity selection unit.
  • the granular task segmentation unit uses at least one granularity to segment the tasks to form subtasks, and provides multi-granular task segmentation selection for neural network applications.
  • the task segmentation granularity selection unit selects the granularity of task partitioning and guides neural network selection. The appropriate task is divided into granularities, so that the sub-tasks after segmentation can meet the real-time performance of the system.
  • the granular task segmentation unit may include a first granularity task segmentation unit, a second granularity task segmentation unit, a third granularity task segmentation unit, a fourth granularity task segmentation unit, and a fifth granularity task segmentation. Sub-unit.
  • N is a positive integer greater than zero.
  • the first granularity task segmentation unit takes the task as a subtask as a whole, and specifically, completes M sample calculations as a subtask.
  • This task splitting method only generates one subtask, and there is no dependency between subtasks.
  • the second granularity task segmentation unit will perform several sample calculations as a subtask.
  • the third-grained task segmentation unit can perform task segmentation on the neural network application according to the layer type of the neural network, and the calculation of the same type layer as a task.
  • the layer types of the neural network include, but are not limited to, a convolutional layer, a fully connected layer, an LSTM layer, a pooling layer, an active layer, an LRN layer, and a BN layer. There are complex dependencies between subtasks in this task segmentation mode.
  • the fourth granularity task segmentation unit can perform task segmentation on the neural network application according to the inter-layer structure of the neural network, and the calculation of adjacent layers is performed as a sub-task.
  • the neural network application is divided into n subtasks.
  • the first subtask completes the first layer to the N1 layer of the neural network, and the total number of N1 layers is calculated.
  • the second subtask completes the N1+1th layer to the N1+N2 layer.
  • the N2 layer neural network calculates that the i-th sub-task completes the N1+...+Ni-1+1 layer to the N1+...+Ni layer, and the total Ni layer is calculated.
  • the i th subtask is the predecessor task of the i+1th subtask
  • the i+1th task is the successor task of the i th task
  • i+1 tasks must wait for the i-th task to complete before they can begin execution.
  • the fifth granularity task segmentation unit can perform task segmentation on the neural network application according to the intra-layer structure of the neural network, and the calculation in the neural network layer can be further divided into sub-tasks.
  • the segmentation according to the calculations within the neural network layer includes, but is not limited to, task segmentation for a roll layer calculation, full connection layer calculation, pool layer calculation, or activation layer calculation of the neural network.
  • the task segmentation functions mentioned above may be implemented by independent hardware units, for example, using a first granularity task segmentation unit, a second granularity task segmentation unit, a third granularity task segmentation unit, and a fourth granularity task segmentation.
  • the sub-unit and the fifth-grained task slicing unit respectively implement the above functions, and the same hardware unit can also be used to implement the above functions.
  • the convolutional layer input neurons are three-dimensional matrices (Nfin, Nxin, Nyin), the weights are four-dimensional matrices (Nfout, Nfout, Kx, Ky), and the output neurons are Three-dimensional matrix (Nfout, Nxout, Nyout), where Nfin is the number of input feature images, (Nxin, Nyin) is the input feature image size, Nfout is the number of output feature images, and (Kx, Ky) is the convolution kernel size, (Nxout, Nyout) is the output feature image size.
  • Completing an output neuron requires Nfin ⁇ Kx ⁇ Ky submultiple addition and operation, and the number of output neurons is Nfout ⁇ Nxout ⁇ Nyout, and the entire convolutional layer needs to complete Nfout ⁇ Nxout ⁇ Nyout ⁇ Nfin ⁇ Kx ⁇ Ky submultiple addition operation.
  • the output neurons are segmented according to the block size of (Bfout, Bxout, Byout), and the weights are segmented according to the block size of (Bfout, Bfin, Bx, By), then each The subtask calculates the intermediate result of Bfout ⁇ Bxout ⁇ Byout output neurons with the weights of (Bfout, Bfin, Bx, By).
  • the intermediate result of each output neuron is Bfin ⁇ Bx ⁇ By multiply and add operations, and a total of Bfout needs to be completed.
  • ⁇ Bxout ⁇ Byout ⁇ Bfin ⁇ Bx ⁇ By is multiplied and added.
  • Bfout is a positive integer greater than 0 and less than or equal to Nfout
  • Bxout is a positive integer greater than 0 and less than or equal to Nxout
  • Byout is a positive integer greater than 0 and less than or equal to Nyout
  • Bfin is a positive integer greater than 0 and less than or equal to Nfin
  • Bx is greater than 0.
  • By is a positive integer greater than 0 and less than or equal to Ky.
  • Task segmentation is performed on a fully connected layer calculation of the neural network.
  • the fully connected layer input neurons are Nin
  • the weights are two-dimensional matrices (Nout, Nin)
  • the output neurons Nout where Nin is the number of input neurons, Nout is The number of neurons output. Completing an output neuron requires a Nin multiplication and addition operation, and the number of output neurons is Nout. A total of Nout ⁇ Nin multiplication and addition operations are required to complete the entire connection layer.
  • the output neurons are segmented according to the block size of Bout, and the weights are segmented according to the block size of (Bout, Bin), and the weight of each subtask is (Bout, Bin).
  • the value matrix calculates the intermediate result of the Bout output neurons.
  • Bin multiplication and addition operations are required, and a total of Bout ⁇ Bin multiplication and addition operations are required.
  • Bout is a positive integer greater than 0 and less than or equal to Nout
  • Bin is a positive integer greater than 0 and less than or equal to Nin.
  • the pooled layer input neuron is Nin
  • the output neuron Nout where Nin, Nout is a positive integer greater than 0, and the pooling operations include but are not limited to the average pooling , the maximum pool, the median pool.
  • the output neurons are segmented according to the block size of Bout, and each subtask completes the calculation of Bout output neurons.
  • Bout is a positive integer greater than 0 and less than or equal to Nout
  • Bin is a positive integer greater than 0 and less than or equal to Nin.
  • An active layer calculation of the neural network is task-splitting, the excitation input neuron is Nin, and the output neuron Nout, where Nin, Nout is a positive integer greater than 0, and the activation functions include but are not limited to sigmoid, tanh, relu, softmax.
  • the output neurons are segmented according to the block size of Bout, and each subtask completes the calculation of Bout output neurons.
  • Bout is a positive integer greater than 0 and less than or equal to Nout. There is no dependency between subtasks in this task splitting mode.
  • the task segmentation granularity selection unit selects the granularity of the task partitioning, and is not limited to selecting only one granularity as described above, and may also be a combination of multiple granularities.
  • a neural network application may combine the fourth granularity task unit and the fifth granularity task.
  • the segmentation method of the segmentation unit The neural network application is first divided into n subtasks according to the segmentation method of the fourth granularity task segmentation unit, and then the p subtasks are segmented according to the segmentation mode of the fifth granularity task segmentation unit.
  • the granular task segmentation unit may include at least one of the first to fifth granular task segmentation units, and does not necessarily include all of the first to fifth granularity task segmentation units.
  • the granular task segmentation unit may further include a hybrid granularity task segmentation unit for combining the segmentation manners of the first to fifth granularity task segmentation units for selection by the task segmentation granularity selection unit.
  • the processor may be a multi-core processor, which further includes a task scheduling device.
  • the task scheduling device includes a task queue unit, a monitoring unit, and a task scheduling unit.
  • the neural network task scheduling device can comprehensively consider the dependencies between tasks, the locality of tasks, the granularity of task segmentation, the running frequency of the core and the load for task scheduling, improve the quality of service, improve the utilization rate of the core, and ensure the between the cores. Balance tasks and reduce energy consumption.
  • the task queue unit caches all unscheduled neural network tasks, and selectively stores the execution time of each task to be scheduled, the task dependency graph, and the task resources are processed in the kernel.
  • the neural network task is, for example, the previous one. Sub-tasks split in the embodiment.
  • the monitoring unit detects the overall service quality of the multi-core neural network processor and the working status of each core in real time, such as the utilization rate of each core, the workload, the working frequency, the number of tasks in the private task queue in the core, and the task completion time.
  • the task scheduling unit selects a task to be scheduled from the unscheduled task, determines a mapping relationship between the task to be scheduled and the target core according to the information to be scheduled and the working state of each core, and allocates the task to be scheduled to the target core.
  • the task scheduling unit may schedule unscheduled tasks in the task queue every time T, and T is a real number greater than zero. If the unscheduled task t has a dependency relationship with other tasks and the predecessor task is not completed, the task scheduling unit does not schedule the task t.
  • the task scheduling unit selects the unscheduled task to select the to-be-scheduled task mode, and may adopt at least one of the following methods: randomly select the task, select the task with the longest expected execution time, select the task with the shortest expected execution time, and select the task with the most resources. Choose the task that takes up the least resources.
  • the task scheduling unit may schedule the task to be scheduled to be allocated to the target core by using at least one of the following scheduling modes.
  • the first scheduling mode is: counting the number of tasks in each core private task queue, selecting the core with the least task in the private task queue as the target core, and assigning the task to be scheduled to the target core;
  • the second scheduling mode is: counting the time for each core to complete all tasks in the private task queue, selecting the core with the shortest task time as the target core, and assigning the task to be scheduled to the target core;
  • the third scheduling mode is to collect the resources of the resources to be scheduled in all the cores, select the core with the largest number of resources as the target core, and assign the tasks to be scheduled to the target core;
  • the fourth scheduling method the heuristic algorithm is used to assign the task to be scheduled to the target kernel, and the heuristic algorithm includes but is not limited to genetic algorithm, ant colony algorithm, and simulated annealing algorithm.
  • the processor is a multi-core processor, such as a multi-core neural network processor.
  • the multi-core neural network processor includes: J processing cores, and J is a positive integer greater than 1.
  • the task segmentation device and the task scheduling device in the foregoing embodiments.
  • the task segmentation device splits the input neural network application, so that the sub-task after segmentation can satisfy the real-time performance of the system, and the task scheduling device performs neural network sub-task scheduling, which can improve the service quality, improve the utilization rate of the processing core, and ensure the processing core. Balance between tasks and reduce energy consumption.
  • the neural network processes the kernel for neural network operations and completes the neural network subtask.
  • the topological structures between the J neural network processing cores include but are not limited to one-dimensional linear, two-dimensional mesh, two-dimensional star, three-dimensional cube, and the like.
  • the neural network processing core includes a storage unit, a control unit, a selection unit, and an operation unit.
  • the storage unit is configured to store neurons, weights, and instructions of the neural network; when the neural network subtask processes the sparse neural network, the stored weights are non-zero weights and location information of non-zero weights.
  • An instruction control unit configured to receive a neural network dedicated instruction, and generate a control information control selection unit and an operation unit after decoding
  • the neural network specific instructions include all instructions dedicated to performing artificial neural network operations.
  • Neural network specific instructions include, but are not limited to, control instructions, data transfer instructions, arithmetic instructions, and logic instructions.
  • the control command controls the execution process of the neural network.
  • Data transfer instructions complete the transfer of data between different storage media, including but not limited to matrices, vectors, and scalars.
  • the arithmetic instruction completes the arithmetic operation of the neural network, including but not limited to the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Instruction, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions.
  • Logic instructions complete the logical operations of the neural network, including but not limited to vector logic operations instructions and scalar logic operation instructions.
  • the RBM neural network operation instruction is used to implement the Restricted Boltzmann Machine (RBM) neural network operation.
  • RBM Restricted Boltzmann Machine
  • the LRN neural network operation instruction is used to implement the Local Response Normalization (LRN) neural network operation.
  • LRN Local Response Normalization
  • the LSTM neural network operation instruction is used to implement Long Short-Term Memory (LSTM) neural network operation.
  • LSTM Long Short-Term Memory
  • the RNN neural network operation instruction is used to implement Recurrent Neural Networks (RNN) neural network operation.
  • RNN Recurrent Neural Networks
  • the RELU neural network operation instruction is used to implement the Rectified linear unit (RELU) neural network operation.
  • the PRELU neural network operation instruction is used to implement the Parametric Rectified Linear Unit (PRELU) neural network operation.
  • PRELU Parametric Rectified Linear Unit
  • SIGMOID neural network operation instruction is used to realize S-type growth curve (SIGMOID) neural network operation
  • the TANH neural network operation instruction is used to implement the hyperbolic tangent function (TANH) neural network operation.
  • TANH hyperbolic tangent function
  • MAXOUT neural network operation instructions are used to implement (MAXOUT) neural network operations.
  • Cambricon instruction set More specifically, it includes the Cambricon instruction set.
  • Each instruction in the Cambricon instruction set has a length of 64 bits, and the instruction consists of an operation code and an operand.
  • the instruction set contains four types of instructions, namely, control instructions, data transfer instructions, computational instructions, and logical instructions.
  • Control instructions are used to control the execution process.
  • Control instructions include jump instructions and conditional branch instructions.
  • the data transfer instruction is used to complete data transfer between different storage media.
  • the data transfer instructions include a load instruction, a store instruction, and a move instruction.
  • the load instruction is used to load data from the main memory to the cache
  • the store instruction is used to store data from the cache to the main memory
  • the move instruction is used to transfer data between the cache and the cache or the cache and registers or registers and registers.
  • Data transfer instructions support three different ways of organizing data, including matrices, vectors, and scalars.
  • the arithmetic instructions are used to perform neural network arithmetic operations.
  • the arithmetic instructions include matrix operation instructions, vector operation instructions, and scalar operation instructions.
  • the matrix operation instruction completes the matrix operation in the neural network, including a matrix multiply vector, a vector multiply matrix, a matrix multiply scalar, an outer product, Matrix add matrix, matrix subtract matrix.
  • the vector operation instruction completes the vector operation in the neural network, including vector elementary arithmetics, vector transcendental functions, dot products, and random vector generators. , the maximum/minimum of a vector.
  • the vector basic operations include vector addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and the vector transcendental function refers to functions that do not satisfy any polynomial equation with a polynomial as a coefficient, including but not limited to an exponential function. Number function, trigonometric function, inverse trigonometric function.
  • scalar operations complete scalar operations in neural networks, including scalar elementary arithmetics and scalar transcendental functions.
  • the scalar basic operations include scalar addition, subtraction, multiplication, and division.
  • the scalar transcendental functions are functions that do not satisfy any polynomial equations that use polynomials as coefficients, including but not limited to exponential functions. Number function, trigonometric function, inverse trigonometric function.
  • Logical operations include vector logic operations instructions and scalar logic operation instructions.
  • vector logic operation instructions include vector comparison, vector logical operations, and vector greater than merge. Where vector comparison includes but greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Vector logic operations include with, or, not.
  • scalar logic operations include scalar comparison, scalar logical operations.
  • the scalar comparison includes but greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to.
  • Scalar logic operations include AND, OR, and NOT.
  • the selecting unit is configured to receive input neuron and non-zero weight position information, and select a neuron corresponding to the non-zero weight. That is to say: for each output neuron data, the selection unit removes input neuron data of non-zero weight data that does not correspond to the output neuron data.
  • the operation unit is configured to receive the neurons corresponding to the input non-zero weights and the corresponding non-zero weights, complete the neural network training operation, and retransmit the output neurons to the storage portion.
  • the arithmetic unit performs a corresponding operation on the data according to an instruction stored in the storage unit.
  • the arithmetic unit includes but is not limited to three parts, the first part is a multiplier, the second part is one or more adders, and the third part is an activation function unit.
  • the one or more adders of the second portion constitute an addition tree.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the output (out) after multiplication.
  • Pooling operations include, but are not limited to, average pooling, maximum pooling, median pooling, and input data in is the data in a pooled core associated with output out.
  • the operation unit performs operations including, but not limited to, the first portion is to multiply the input data 1 and the input data 2 to obtain the multiplied data; the second portion performs an addition tree operation for adding the input data 1 through the addition.
  • the trees are added step by step, or the input data 1 is added to the input data 2 to obtain output data; the third portion performs an activation function operation, and the input data is subjected to an activation function (active) operation to obtain output data.
  • the operations of the above parts can be freely combined to realize the operation of various functions.
  • the neural network processing core may also include a pre-processing module, as shown in FIG. 31, which pre-processes the original data, including dicing, Gaussian filtering, binarization, regularization, normalization, and the like.
  • the neural network processing core may also include an instruction cache, a non-zero weight buffer, a non-zero weight location cache, an input neuron cache, and an output neuron cache.
  • Instruction cache for storing dedicated instructions; non-zero weight buffer for caching non-zero weight data; non-zero weight location buffer for caching non-zero weight location data and input based on non-zero weight location data
  • Each weight in the data corresponds to the corresponding input neuron one by one; the input neuron buffer is used to buffer the input neurons; and the output neuron is used to buffer the output neurons output by the arithmetic unit.
  • the non-zero weight position data indicates whether each of the input neuron data and each of the output neuron data has a corresponding weighted non-zero weight data.
  • the one-to-one correspondence of non-zero weight position buffers is that 1 means that there is a connection, 0 means no connection, and the connection state of each set of output neurons and all input neurons forms a string of 0 and 1 to represent The connection relationship of the output neurons.
  • the non-zero weight position cache has a one-to-one correspondence. 1 means that there is a connection, 0 means no connection, and the connection state of each set of input neurons and all output neurons forms a string of 0 and 1. Indicates the connection relationship of the input neurons.
  • the one-to-one correspondence of the non-zero weight position buffers is to distance the input neuron where the first connection of the set of output neurons is located from the first input neuron, and the output neuron is the second.
  • the above-mentioned connected relationship has corresponding non-zero weight data for each input neuron data and each output neuron data, and no connection means whether each input neuron data and each output neuron data have corresponding Non-zero weight data.
  • the neural network processing core may also include direct data access unit DMA (direct memory access).
  • DMA direct memory access
  • the DMA is used to read or write data or instructions in the storage unit, the instruction cache, the non-zero weight cache, the non-zero weight location cache, the input neuron cache, and the output neuron cache.
  • the present disclosure further provides a combined processing device.
  • the combined processing device includes the processor, and the universal interconnect interface interacts with other processing devices to perform user-specified operations. Calculation operation.
  • the other processing device includes a processor type of one or more of a general purpose/dedicated processor such as a central processing unit CPU, a graphics processing unit GPU, a neural network processor, or the like.
  • a general purpose/dedicated processor such as a central processing unit CPU, a graphics processing unit GPU, a neural network processor, or the like.
  • the number of processors included in other processing devices is not limited.
  • the other processing device serves as an interface between the neural network computing device and external data and control, including data handling, and completes basic control such as opening and stopping of the neural network computing device; other processing devices may also cooperate with the neural network computing device to complete the computing task.
  • a universal interconnect interface for transmitting data and control commands between the neural network computing device and other processing devices.
  • the neural network computing device acquires the required input data from other processing devices and writes the storage device on the slice of the neural network computing device; the control command can be obtained from the other processing device and written into the control cache on the slice of the neural network computing device;
  • the data in the storage module of the neural network computing device can be read and transmitted to other processing devices.
  • the combined processing device can be used as a SOC on-chip system for mobile phones, robots, drones, video monitoring devices, etc., effectively reducing the core area of the control part, increasing the processing speed, and reducing the overall power consumption.
  • the universal interconnect interface of the combined processing device is coupled to certain components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure further provides a processing method, as shown in FIG. 33, the processing method includes:
  • the task segmentation device performs task segmentation according to the task segmentation granularity
  • the hardware resource dividing device divides the hardware resources of the processor according to the task segmentation result.
  • the hardware resources of the processor according to the task segmentation result in the step of dividing, by the hardware resource dividing device, the hardware resources of the processor according to the task segmentation result:
  • the input data and control signal sequences are stored to an external storage module for use;
  • the control module parses the control signal
  • the distribution configuration module parses the distribution configuration signal; for example, after the corresponding configuration information is determined by the task segmentation result in the execution process, the control signal parsed by the control module includes the instruction and the configuration information (the configuration information may also be instructed
  • the method is as follows: if the control module determines that the configuration information is, the configuration information is sent to the distribution configuration module, and the distribution configuration module further sends the configuration information to the calculation module; the processor schedules each module according to different signal meanings to complete the corresponding Operation; for example, when performing a multi-batch operation, the scheduling distribution configuration module distributes configuration information, schedules calculation module grouping and performs calculations, and schedules the storage module to transmit or receive data and the like.
  • the configuration information may be directly sent by the external storage module to the distribution configuration module under the control of the control module, in addition to being sent by the external storage module to the distribution configuration module via the control module;
  • the corresponding calculation results are output from the calculation module to the internal storage module and then transferred to the external storage module for subsequent or other use.
  • each of the forward paths in the batch can be executed in parallel when performing batch calculation of the neural network, including the training process and the test process, wherein each forward path calculation performed in parallel is independent (particular, The weights may or may not be shared.
  • the device divides the computing unit into N independent computing groups according to the configuration to independently calculate different forward paths in the batch.
  • the device can calculate the optimal configuration and complete the configuration offline, wherein the optimal configuration may be a number configuration of the calculation group, for example, for a specific calculation scenario, multiple computing units in the calculation module How many calculation groups are divided to achieve an optimal calculation effect; the configuration may be dynamically adjusted during execution to achieve an optimal process, wherein the dynamic adjustment configuration may be configured, for example, when the convolution layer is executed.
  • the independent calculation groups respectively calculate different output images, and when calculating the fully connected layer, they are configured into one calculation group, that is, all calculation units are used to calculate the same layer.
  • the device can be divided into multiple groups to complete the gradient corresponding to different input samples in the batch, and the device is configured online.
  • the update calculation of the weights can be performed quickly (in particular, it can also be configured online to complete the corresponding gradient calculation of the corresponding input samples in the batch).
  • the inputs and weights required for different services may be different or the same.
  • the device needs to be configured in different independent groups to run requests corresponding to different services.
  • the computing load corresponding to different services may be completely different, and the required computing resource requirements are different.
  • the device adjusts the packet dynamics of the computing unit during operation to meet the quality of service requirements in multiple services.
  • the PEs are organized in a one-dimensional array, and multiple PEs can be configured into different groups, and different groups can be used to calculate different inputs.
  • the convolutional layer forward calculation in the convolutional neural network is taken as an example to describe in detail how the processor of the embodiment and the corresponding PE configuration calculate the batch of the convolutional neural network.
  • the processor loads a new batch of inputs and assigns them to different groups to continue the calculation.
  • the PEs are organized in a two-dimensional array, and a plurality of adjacent PEs may be configured in different groups, and different groups may be used to calculate different inputs.
  • the PEs are organized in a two-dimensional array, and a plurality of adjacent PEs may be configured in different groups, and different groups may be used to calculate different inputs.
  • the computing unit performs an operation including a neural network calculation.
  • the calculation module includes: a multiplier for multiplying data input thereto to obtain an output after multiplication; and/or one or more adders for adding data input thereto to obtain output data .
  • the plurality of adders may constitute an addition tree for performing an addition tree operation, that is, the data input thereto is added step by step to obtain output data.
  • the calculation module includes, but is not limited to, a first partial multiplier, a second partial addition tree, a third portion being an activation function unit, and/or a fourth partial pooling unit.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out).
  • Add the output data (out), where in1 is a vector of length N, N is greater than 1, and is called: out in1[1]+in1[2]+...+in1[N], and/or
  • the input data (in1) is added by the addition number and added to the input data (in2) to obtain the output data (out).
  • the operations of the above several parts can freely select one or more parts to be combined in different orders, thereby realizing the operation of various functions.
  • the arithmetic components of the above several parts can freely select one or more parts to be combined in different orders, thereby realizing operations of various functions.
  • the processing method is used in a neural network, and the task segmentation device selects the following five granular task segments in the step of performing task segmentation on the divided hardware resources according to the task segmentation granularity. At least one of the ways to perform task segmentation.
  • the first granularity task segmentation method takes the task as a subtask as a whole, and specifically, the M sample calculations are completed as a subtask.
  • This task splitting method only generates one subtask, and there is no dependency between subtasks.
  • the second granularity task segmentation method will complete several sample calculations as a subtask.
  • the third granularity task segmentation method can perform task segmentation on the neural network application according to the layer type of the neural network, and the calculation of the same type layer as a task.
  • the layer types of the neural network include, but are not limited to, a convolutional layer, a fully connected layer, an LSTM layer, a pooling layer, an active layer, an LRN layer, and a BN layer. There are complex dependencies between subtasks in this task segmentation mode.
  • the fourth granularity task segmentation method can perform task segmentation on the neural network application according to the inter-layer structure of the neural network, and the calculation of adjacent layers is used as a sub-task.
  • the neural network application is divided into n subtasks.
  • the first subtask completes the first layer to the N1 layer of the neural network, and the total number of N1 layers is calculated.
  • the second subtask completes the N1+1th layer to the N1+N2 layer.
  • the N2 layer neural network calculates that the i-th sub-task completes the N1+...+Ni-1+1 layer to the N1+...+Ni layer, and the total Ni layer is calculated.
  • the i th subtask is the predecessor task of the i+1th subtask
  • the i+1th task is the successor task of the i th task
  • i+1 tasks must wait for the i-th task to complete before they can begin execution.
  • the fifth granularity task segmentation unit mode performs task segmentation according to the intra-layer structure of the neural network, and the calculation in the neural network layer can be further divided into sub-tasks.
  • the segmentation according to the calculations within the neural network layer includes, but is not limited to, task segmentation for a roll layer calculation, full connection layer calculation, pool layer calculation, or activation layer calculation of the neural network.
  • the processing method further includes: allocating and scheduling the task after the task is divided.
  • the task scheduling method includes:
  • the execution time of each task to be scheduled, the task dependency graph, and the task resource are distributed in the kernel, and the neural network task is, for example, a sub-task divided in the previous embodiment;
  • the working status of each core for example, the utilization rate of each core, the workload, the working frequency, the number of tasks in the private task queue in the core, and the task completion time.
  • the task to be scheduled is selected from the unscheduled tasks, and the mapping relationship between the task to be scheduled and the target core is determined according to the information of the task to be scheduled and the working state of each core, and the task to be scheduled is allocated to the target core.
  • the task scheduling may schedule unscheduled tasks in the task queue every time T, and T is a real number greater than zero. If the unscheduled task t has a dependency relationship with other tasks and the predecessor task is not completed, the task t is not scheduled.
  • Scheduling the task to be scheduled to the target core may adopt at least one of the following scheduling modes: the first scheduling mode: counting the number of tasks in each core private task queue, and selecting the core with the least task in the private task queue as the target core. Assigning the task to be scheduled to the target core;
  • the second scheduling mode is: counting the time for each core to complete all tasks in the private task queue, selecting the core with the shortest task time as the target core, and assigning the task to be scheduled to the target core;
  • the third scheduling mode is to collect the resources of the resources to be scheduled in all the cores, select the core with the largest number of resources as the target core, and assign the tasks to be scheduled to the target core;
  • the fourth scheduling method the heuristic algorithm is used to assign the task to be scheduled to the target kernel, and the heuristic algorithm includes but is not limited to genetic algorithm, ant colony algorithm, and simulated annealing algorithm.
  • the signal input of the distribution configuration module may also directly have an external signal input, using direct control or analytical control.
  • the PE tissue can be a three-dimensional tissue, even a multi-dimensional tissue.
  • the grouping of PEs can also be organized by columns, and different grouping modes can also be switched during operation.
  • multiple grouped PEs may also perform different arithmetic operations corresponding to the same input.
  • the computing unit can be any computing element, from a simple computing element to a computing element that performs complex functions.
  • the processor and the processing method of the present disclosure can perform image processing, video processing calculation, and the like in addition to performing neural network calculation; and the neural network is not limited to the convolutional neural network, and may be full. Connected to a neural network, an RBM neural network, and a recurrent neural network (RNN); and is not limited to a convolutional layer, and may be a fully connected layer, a pooling layer, or the like.
  • RBM neural network recurrent neural network
  • a chip is also provided that includes the above-described neural network computing device or combined processing device.
  • a chip package structure is also provided that includes the above described chip.
  • a board is also provided that includes the chip package structure described above.
  • an electronic device is also provided that includes the above-described card.
  • Electronic equipment including data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , mobile storage, wearables, vehicles, household appliances, and/or medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
  • an embodiment of the present disclosure provides an information processing apparatus, including: a storage module, configured to acquire information data, where the information data includes at least one key feature, and the storage module pre-stores the key feature corresponding to The actual confidence level; the operation circuit determines, according to the information data, a prediction confidence corresponding to the key feature, and determines whether the prediction confidence of the key feature exceeds a true confidence preset threshold range corresponding to the key feature; And a control circuit that controls the storage module to modify a key feature or issue a modification signal to the outside when the prediction confidence exceeds a true confidence preset threshold range.
  • the information data can be automatically corrected and corrected instead of manual, and the manual score is more accurate and faster.
  • the types of information data have been described above by type, and the function classification thereof will be described below, and may specifically refer to a student's homework or test paper, or an action or expression data of a sports item, or a mode or step of operation of a puzzle item.
  • the assignment or test paper may be electronic text, handwritten text and/or graphics including a combination of handwritten one or more language words and/or symbols, a handwritten two-dimensional map, handwritten 2D perspective.
  • the handwritten combination of one or more language words and/or symbols is a handwritten answer to a test of a language, mathematics, physics, or the like.
  • the handwritten two-dimensional map and/or two-dimensional perspective view is a handwritten answer to a test paper such as an art, a drawing, or the like.
  • the action or expression may be a recorded picture and/or video;
  • the operation mode or step of the puzzle item may be electronic data, a picture or a video embodying the operation mode or step.
  • a storage module can be used to store data and instructions, wherein the data can include a letter input neuron (eg, pre-processed data), an output neuron (eg, a prediction confidence corresponding to the key feature), Weights, loss functions, gradients and scores in neural network operations and output, and error mode judgment results.
  • a letter input neuron eg, pre-processed data
  • an output neuron eg, a prediction confidence corresponding to the key feature
  • Weights eg, loss functions, gradients and scores in neural network operations and output, and error mode judgment results.
  • the operation circuit may be configured to perform a corresponding operation on the data according to an instruction stored in the storage module; the operation circuit may perform a three-step operation, and the first step is to input the input neuron and the weight data
  • the second step is to perform an addition tree operation, which is used to add the results of the first step through the addition tree step by step to obtain a weighted sum, and may or may not apply a bias to the weighted sum;
  • the third step is to The result obtained by the step performs an activation function operation to obtain an output neuron.
  • the value of the output neuron is the predicted confidence of the key feature.
  • the activation function may be a sigmoid function, a tanh function, a ReLU function, or a softmax function.
  • the prediction confidence may be any natural number - for example, the greater the value of the confidence, the higher the credibility of the key feature. Confidence can also be normalized to a natural number within a certain range - for example, the confidence is between [0, 1], and the confidence indicates the confidence probability of including the key feature.
  • the memory module may include a direct memory access DMA, the direct memory access DMA being electrically connected to the operation circuit, configured to store a prediction confidence determined by the operation circuit operation, and The true confidence and prediction confidence are fed into the operational circuit for comparison.
  • a direct memory access DMA the direct memory access DMA being electrically connected to the operation circuit, configured to store a prediction confidence determined by the operation circuit operation, and The true confidence and prediction confidence are fed into the operational circuit for comparison.
  • the storage module further includes a storage unit for acquiring information data from outside the information processing device and transmitting the direct storage access DMA for the operation circuit to call.
  • the storage module is further configured to store neural network specific instructions
  • the information processing apparatus further includes: an instruction cache for buffering the dedicated instruction from the storage module for the control circuit to call.
  • the storage module is further configured to store input neurons, output neurons, and weights in the neural network
  • the information processing apparatus further includes: an input neuron cache, configured to cache the neurons from the storage module, for The operation circuit calls the weight buffer for buffering the weight from the storage module for the operation circuit to call, and the input neuron buffer for storing the output neurons obtained from the operation of the operation circuit.
  • the arithmetic circuit is further configured to score the information data according to the determination result of each key feature.
  • the scoring process may be a weighted post-score of the output neurons corresponding to each key feature.
  • the information processing apparatus further includes a pre-processing module for pre-processing the external raw information data and transmitting the original information data to the storage module.
  • the input data can be more suitable for artificial neural network processing, removing noise and redundancy in the input data, improving classification, recognition accuracy, and the like, and reducing space occupation in subsequent storage modules.
  • the pre-processing comprises segmenting, Gaussian filtering, binarizing, regularizing and/or normalizing the original information data to obtain data of a neural network input data format; preferably, the neural network input data format comprises But not limited to: image size, color mode, average brightness, and/or data size.
  • the arithmetic circuit is further configured to perform adaptive training on the neural network.
  • the parameters in the network (such as weights, offsets, etc.) can be adaptively updated by comparing the calculated prediction confidence with the known true confidence, thereby improving the recognition and prediction accuracy of the device.
  • the adaptive training process described above is processed offline.
  • the information processing apparatus of the present disclosure may be an integrated chip that integrates the units, modules, and circuits it contains, preferably an artificial neural network chip that can implement neural network operations.
  • an information processing apparatus including: information acquiring means for acquiring external information data; and the information processing apparatus described in the above embodiment, for processing the The information data obtains a prediction confidence of the key feature, and when the prediction confidence exceeds the true confidence preset threshold, the key feature is modified, or a modification signal is issued.
  • an information processing apparatus including: information acquiring means for acquiring external information data; and the information processing apparatus described in the above embodiment, for processing the Information data, obtaining a prediction confidence of the key feature, and modifying the key feature or issuing a modification signal when the prediction confidence exceeds a true confidence preset threshold; the interaction interface receiving the modified key feature or modifying the signal, The modification content is shown to the user.
  • the information acquisition device may be a camera having only an imaging function, a camera, a scanner, or the like. It may also be a terminal device (such as a mobile phone, a computer or a wearable device) in which the information acquisition device and the interactive interface are integrated.
  • a terminal device such as a mobile phone, a computer or a wearable device
  • the interactive interface may include a display screen, a touch display screen, and/or a data output interface.
  • the interactive interface may receive data of the information acquiring device (for example, including modified key features), or receive original information data of the information acquiring device and modify signals, and modify original information data (eg, pictures) under the control of the controller (including but It is not limited to graffiti, adding modified tags, adding videos, adding partial images, adding text, adding voices, and displaying them by visual means.
  • the interaction device may further include pre-processing means for pre-processing the information data acquired by the information acquisition device and feeding the information to the information processing device.
  • pre-processing means for pre-processing the information data acquired by the information acquisition device and feeding the information to the information processing device.
  • the information processing device further includes a controller for controlling the information acquisition device, the information processing device, and/or the interactive interface.
  • the information acquiring device may be controlled to obtain original information data from the outside, control the information processing device to receive the information data, perform processing, perform judgment, rewrite or issue a rewrite signal operation, and control the interactive interface to display the rewritten content.
  • the interactive interface is also operative to modify the set threshold in response to a user's operation or command. For example, when the user adjusts a threshold corresponding to a predetermined confidence level of a specific key feature (for example, a specific piece of text, a certain voice, or a certain video), the information may be performed by using a touch screen, a mouse, a voice command, or a keyboard. Get the operation of the device.
  • a threshold corresponding to a predetermined confidence level of a specific key feature for example, a specific piece of text, a certain voice, or a certain video
  • the information may be performed by using a touch screen, a mouse, a voice command, or a keyboard. Get the operation of the device.
  • an information processing method including:
  • S301 Acquire information data by using a storage module, where the information data includes at least one key feature, and the storage module prestores a true confidence level corresponding to the key feature;
  • the operation circuit determines, according to the information data, a prediction confidence level corresponding to the key feature, and determines whether the prediction confidence of the key feature exceeds a true confidence setting threshold range corresponding to the key feature;
  • the processing method may correspond to the execution step of the processing device, and the specific implementation manner may refer to the description of the foregoing steps, and details are not described herein.
  • the third embodiment corresponds to a processing device that uses information data as a picture
  • the fourth embodiment corresponds to a processing device in which the information data is audio and/or video
  • the fifth embodiment corresponds to an information processing device.
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • the storage unit of the information processing apparatus in this embodiment receives the information data, which may include, but is not limited to, a set of pictures including one or more key features; the device calculates the information data including the confidence of each key feature, and gives a judgment. As a result, the device scores the information data in the storage unit according to the judgment result.
  • the information data may be original information data, or may be a result obtained by preprocessing the original data.
  • the information processing apparatus herein can perform adaptive training, for example, the apparatus inputs a set of pictures including one or more key features, such as pictures including handwritten characters, pictures constituting a video, and the like. Each key feature corresponds to a confidence level with a confidence level of a natural number.
  • the confidence that each key feature is included is known, that is, the true confidence; the device uses these pictures as information data to calculate the confidence that each key feature is included, ie Predict confidence.
  • the calculated prediction confidence is compared with the known true confidence, and the parameters (such as weights, offsets, etc.) in the network are adaptively updated, thereby improving the recognition and prediction accuracy of the device.
  • the confidence level can be any natural number - for example, the greater the value of the confidence, the higher the credibility of including the key feature. Confidence can also be normalized to a natural number within a certain range - for example, the confidence is between [0, 1], and the confidence indicates the confidence probability of including the key feature.
  • the true confidence value of the training set is chosen to be one - for example, ⁇ 0, 1 ⁇ , 0 means that the input picture does not contain the key feature, and 1 means that the feature is included; of course, it can be reversed, 1 means no, 0 means contain.
  • the information processing device herein may be an artificial neural network chip, including: a storage unit for storing data and instructions, wherein the data includes input neurons, output neurons, weights, scores, error mode judgment results, etc.; a circuit for performing a corresponding operation on the data according to an instruction stored in the storage unit; the operation circuit mainly performing a three-step operation, the first step is to multiply the input neuron and the weight data; Performing an addition tree operation for adding the results of the first step step by step through the addition tree to obtain a weighted sum, adding or not processing the weighted sum as needed; and the third step performing activation on the result obtained in the second step Function operations to get output neurons.
  • the information processing apparatus may further include DMA (Direct Memory Access) for performing data or instruction reading and writing in the storage unit, the instruction cache, the weight buffer, the input neuron buffer, and the output neuron cache;
  • DMA Direct Memory Access
  • control circuit is configured to read a dedicated instruction from the instruction cache, and decode the same into an arithmetic circuit instruction and input to the arithmetic circuit;
  • the instruction cache is used to store the dedicated instruction;
  • the weight buffer is used Cache weight data; input neuron buffer for buffering input neurons input to the mapping unit; output neuron buffer for buffering output neurons outputted by the operation circuit (corresponding to confidence of each key feature);
  • the chip further includes a pre-processing module.
  • the module preprocesses the original information data, that is, one or more pictures containing handwritten characters or graphics, to obtain image data that matches the scale of the input layer at the lowest level of the artificial neural network used by the chip.
  • the preprocessing includes segmentation, Gaussian filtering, binarization, regularization, normalization, and the like.
  • the method for obtaining the judgment result by the artificial neural network chip comprises: each output neuron of the final output layer of the neural network corresponds to a keyword, and the value of the output neuron is a confidence level of occurrence of the keyword.
  • the modified method includes splitting the standard answer into a collection of a number of standard key features, which may be words, words, phrases (text data input) or a part of the image (image data input), pre-stored in the memory unit of the chip.
  • the individual output neurons of the final output layer of the neural network give confidence in the respective critical feature parts and the corresponding standard correct mode. (If an error mode occurs or its confidence level is greater than the preset threshold, the error mode is modified to the corresponding key feature in the standard answer.)
  • the result of the output neuron is stored in the DMA and is again passed to the arithmetic circuit. Modifying the confidence threshold comparison, if the key feature confidence is lower than the preset threshold, the key feature is modified according to the standard correct mode of the key feature.
  • Step 1 the information data is directly transmitted to the storage unit through the pre-processing module
  • Step 2 the DMA transfers it to the corresponding on-chip cache (ie, instruction cache, input neuron cache, weight buffer) in batches;
  • the corresponding on-chip cache ie, instruction cache, input neuron cache, weight buffer
  • Step 3 the control circuit reads the instruction from the instruction cache, decodes it and sends it to the operation circuit;
  • Step 4 According to the instruction, the operation circuit performs a corresponding operation.
  • the operation is mainly divided into three steps: step 4.1, multiplying the corresponding input neuron and the weight; step 4.2, performing the addition tree
  • step 4.1 multiplying the corresponding input neuron and the weight
  • step 4.2 performing the addition tree
  • the operation that is, the result of step 4.1 is added step by step through the addition tree to obtain a weighted sum, and the weighted sum is offset or not processed as needed
  • step 4.3 the activation function is performed on the result obtained in step 4.2 to obtain an output neuron. And pass it to the output neuron cache.
  • Step 5 Repeat steps 2 through 4 until all data is calculated, and the final result of the functional requirements is obtained.
  • the final result is obtained by the output neuron of the last layer of the neural network, outputted from the arithmetic circuit to the output neuron cache, and then temporarily stored in the DMA for the next operation.
  • Step 6 the result of the scoring stored in the neural network output neurons in the DMA, that is, the confidence of each key feature is directly input into the operator through the data path between the DMA and the operator, and compared with a preset threshold, if the key feature Confidence is less than the preset threshold, and the input key features in the DMA are replaced with the standard correct mode for the corresponding key feature.
  • the information data is modified in the DMA.
  • step 7 the information data in the modified DMA is stored in the storage unit and output as the final modified output data.
  • the value of the output neuron of the last layer of the neural network is the confidence level of the keyword occurrence; if modification is required, the modification in the storage unit after the final step 7 is finally performed.
  • the data is the final modified data.
  • the structure can realize the function of scoring and/or modification, and the result of the scoring is outputted after the execution of step 1-5; the modified output is the final storage unit output of step 1-7.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • the storage unit in the artificial neural network chip (corresponding to the information processing device) provided in this embodiment is used for pre-storing one or more key frame pictures (corresponding to key features); the storage unit acquires video from the outside and passes it to the operation.
  • a circuit wherein the video includes a plurality of input pictures; the operation circuit calculates a similarity between each input picture and each key frame picture (in detail, if there are N input pictures and M key pictures, NM similarities are obtained) ) and / or standardize the video.
  • the video further includes audio, the audio is divided into multiple pieces of audio, and the multi-segment audio corresponds to the plurality of pictures.
  • the chip can compare the similarity of all pictures in the video with each key frame picture, and/or compare the similarity of each waveform and key waveform obtained by all audio decomposition in the video, and normalize the video.
  • the video is an action video of one or more testers.
  • the action video is dancing, martial arts, or inter-curricular performances, sports movements and/or postures, writing actions and/or gestures, typing actions and/or gestures, reading actions and/or gestures.
  • the method for obtaining the similarity may be: each output neuron of the final output layer of the neural network corresponds to a similarity, and the value of the output neuron is the similarity value. (If consistent with the previous example, the layer has a total of NM output neurons)
  • the method for obtaining the similarity may also be: each output neuron of the final output layer of the neural network corresponds to an input picture, and the value of the output neuron is the most similar key frame picture as the input picture is similar to the input picture. degree. (If consistent with the previous example, the layer has a total of N output neurons)
  • the method for obtaining the similarity may also be: each output neuron of the final output layer of the neural network corresponds to a key picture, and the value of the output neuron is the input picture most similar to the key frame picture and the key frame picture. Similarity. (If consistent with the previous example, the layer has a total of M output neurons)
  • the method of scoring may be: adding a layer above the final output layer in the neural network as a new final output layer, and output neurons in the previous final output layer as input neurons of the layer; The output neurons are scored; the weights in the layer correspond to the importance of each similarity, that is, the weight.
  • the modified method may be: directly inputting the similarity calculation result obtained from the DMA into the operation circuit, and comparing with the preset threshold, and if the similarity is lower than the preset threshold, determining the key feature (here) Can be expressed as a video keyframe image) does not meet the standardization criteria and needs to be modified. Therefore, the corresponding input picture is replaced with the corresponding standard key frame picture, and the DMA is written back, and finally output to the storage unit for output.
  • the above obtained similarity and scoring process are all completed in the artificial neural network chip, and may include the following steps:
  • Step 1 the information data is directly transmitted to the storage unit through the pre-processing module
  • Step 2 the DMA transfers it to the corresponding on-chip cache (ie, instruction cache, input neuron cache, weight buffer) in batches;
  • the corresponding on-chip cache ie, instruction cache, input neuron cache, weight buffer
  • Step 3 the control circuit reads the instruction from the instruction cache, decodes it and sends it to the operation circuit;
  • Step 4 According to the instruction, the operation circuit performs a corresponding operation: in each layer of the neural network, the operation is mainly divided into three steps: step 4.1, multiplying the corresponding input neuron and the weight; step 4.2, performing addition Tree operation, that is, the result of step 4.1 is added step by step through the addition tree to obtain a weighted sum, and the weighted sum is offset or not processed as needed; step 4.3, the result obtained in step 4.2 is subjected to an activation function operation to obtain an output nerve Meta and pass it to the output neuron cache.
  • step 4.1 multiplying the corresponding input neuron and the weight
  • step 4.2 performing addition Tree operation, that is, the result of step 4.1 is added step by step through the addition tree to obtain a weighted sum, and the weighted sum is offset or not processed as needed
  • step 4.3 the result obtained in step 4.2 is subjected to an activation function operation to obtain an output nerve Meta and pass it to the output neuron cache.
  • Step 5 Repeat steps 2 through 4 until all data is calculated, and the final result of the functional requirements is obtained.
  • the final result is obtained by the output neuron of the last layer of the neural network, outputted from the arithmetic circuit to the output neuron cache, and then written to the DMA to prepare for the next operation.
  • Step 6 the similarity result stored in the neural network output neuron in the DMA, that is, each key feature (key frame) score is directly input into the operator through the data path between the DMA and the operator, and compared with a preset threshold. If the key feature confidence is less than the preset threshold, the input key features in the DMA are replaced with corresponding standard key frames. When all the key features are compared and replaced according to the above steps, the information data is modified in the DMA.
  • step 7 the information data in the modified DMA is stored in the storage unit and output as the final modified output data.
  • the value of the last layer of the output neuron of the neural network is the similarity (score) of each key frame and the standard key frame; if modification is required, the process proceeds to step 7
  • the modified data in the subsequent storage unit is the final modified data.
  • Embodiment 5 is a diagrammatic representation of Embodiment 5:
  • the device comprises an information acquisition device, an information processing device (for example, an artificial neural network chip) (the structure is the same as the third embodiment), an interaction interface and a control circuit.
  • an information acquisition device for example, an information processing device (for example, an artificial neural network chip) (the structure is the same as the third embodiment), an interaction interface and a control circuit.
  • an information processing device for example, an artificial neural network chip
  • the information acquisition device (this device may be an extension of the pre-processing device, equivalent to the interface + pre-processing device) is used for receiving external information, including text, images, audio, video, and the like.
  • the raw data or the preprocessed data is transmitted as information data to the artificial neural network chip.
  • the interactive interface is used to interact with the user, that is, to receive the user's operation or command, and transmit it to the control circuit.
  • the interactive interface is also used to receive the output data of the artificial neural network chip and convert it into a suitable form of feedback information for display to the user.
  • the control circuit receives the user's operation or command and controls the operation of the entire device.
  • the interactive interface allows the user to freely modify the above preset thresholds to achieve different levels of effect modification results and is more friendly.
  • the interactive interface can also give the user feedback information, such as the alarm of sitting posture error and the modification correction of the pen holding mode.
  • the information acquisition device is an image acquisition device and a sound acquisition device.
  • the image acquisition device is a camera.
  • the sound acquisition device is a microphone.
  • the terminal is a character device, a mobile phone, a computer, a notebook, and a tablet.
  • the disclosed related devices, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the part or module is only a logical function division.
  • there may be another division manner for example, multiple parts or modules may be combined. Or it can be integrated into one system, or some features can be ignored or not executed.
  • Each functional part/unit/subunit/module/submodule/component in the present disclosure may be hardware, such as the hardware may be a circuit, including digital circuits, analog circuits, and the like.
  • Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
  • the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, and the like.
  • the "memory" described in the present disclosure may be integrated inside the processing means for performing the generation of the anti-network, or may be a separate device as an external memory for data transmission with a processing means for executing the generation of the anti-network.
  • a processing apparatus for executing a generated confrontation network As shown in FIG. 42, the method includes:
  • the memory 110 is configured to receive input data, where the input data includes random noise and reference data, and store discriminator neural network parameters and generator neural network parameters;
  • the operator 120 is configured to input the random noise input data into the generator neural network to obtain a noise generation result, and is also used to input the noise generation result and the reference data into the discriminator neural network to perform the operation, thereby obtaining the discrimination result; And updating the discriminator neural network parameter and the generator neural network parameter according to the discriminating result.
  • the processing device of the embodiment of the present disclosure plans a reasonable arithmetic unit and a hardware structure matched by the memory for the specific implementation manner against the network, thereby improving the calculation efficiency.
  • the memory 110 for executing a processing device that generates an anti-network receives input data including random noise and reference data (including but not limited to real pictures, speech or text).
  • the reference data includes, but is not limited to, a set of pictures containing one or more key features, a set of audio containing one or more key sample points, a set containing one or more phrases or phrases having a part-of-speech tag; the operator 120 is based on
  • the input data is trained to obtain a set of generation function parameters, and noise generation results (such as a creative image) are obtained according to the generation function parameters and reference data (for example, a reference image).
  • the input data may be the original input data, or may be the result obtained by pre-processing the original data.
  • the memory is further configured to store a calculation instruction
  • the processing device further includes a controller 130, configured to extract the calculation instruction and parse the operation instruction into an operation instruction, and send the operation instruction to the operation Device.
  • the controller 130 is configured to extract a calculation instruction from the memory, analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and input data to the operator.
  • the memory 110 includes a discriminator parameter storage unit 112 for storing discriminator neural network parameters, a generator parameter storage unit 113 for storing generator neural network parameters, and a discriminator instruction storage unit 114.
  • a discriminator instruction storage unit 112 for storing discriminator neural network parameters
  • a generator parameter storage unit 113 for storing generator neural network parameters
  • a discriminator instruction storage unit 114 For storing a calculation instruction for performing a discriminator neural network operation
  • a generator instruction storage unit 115 for storing a calculation instruction for performing a generator neural network operation
  • a data storage unit 111 for storing data, where the data storage unit is Including random noise, noise generation results (ie negative samples, such as pictures generated by random noise) and reference data (real pictures, speech or text obtained from outside, etc.
  • the structure here is mainly to adapt to gan (against generation network))
  • the structural characteristics of the generator and the discriminator so that the weight storage of the generator and the discriminator can be physically distinguished, the storage resources can be utilized more efficiently, and the I/O instructions can be modified to accommodate the storage structure to distinguish Discriminator I/O instructions and generator I/O instructions.
  • the data storage unit 111 is configured to acquire and store data, and further includes acquiring and storing a network model (including a discriminator neural network and a generator neural network) and calculating instructions.
  • a network model including a discriminator neural network and a generator neural network
  • an input/output unit 150 is further included for acquiring external data and outputting the internal calculation result to an external device or other components.
  • the DMA 140 is further included, and the generator neural network parameters are forwarded from the memory to the operator 120, and the random noise and the reference data are forwarded from the data storage unit 111 to the computing unit 120 by DMA.
  • the memory may further include a storage medium, and the storage medium may be an off-chip memory.
  • the controller 130 includes an instruction cache unit 110, an instruction storage unit 111, and a storage queue unit 113.
  • the instruction cache unit 110 is configured to store the calculation instruction associated with the network model;
  • the instruction processing unit 111 is configured to parse the calculation instruction to obtain a plurality of operation instructions;
  • the storage queue unit 113 is configured to store the instruction queue, where
  • the instruction queue includes a plurality of operation instructions or calculation instructions to be executed in the order of the queue.
  • the calculation instructions may include: one or more operational domains and an opcode.
  • the calculation instructions can include neural network operational instructions. Taking the neural network operation instruction as an example, as shown in Table 1, the register number 0, the register number 1, the register number 2, the register number 3, and the register number 4 may be an operation domain. Wherein, each register number 0, register number 1, register number 2, register number 3, and register number 4 may be the number of one or more registers.
  • the CONFIG instruction configures various constants required for the current layer calculation before each layer of artificial neural network calculation begins; the COMPUTE instruction completes the arithmetic logic calculation of each layer of the artificial neural network; the IO instruction implements the input data required for calculation from the external address space and After the calculation is completed, the data is stored back to the external space; the NOP instruction is responsible for clearing the microinstructions currently loaded into all internal microinstruction buffer queues, and ensuring that all instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is responsible for the jump of the next instruction address that the controller will read from the instruction memory location, which is used to implement the control flow jump; the MOVE instruction is responsible for the address of the device internal address space.
  • the data is carried to another address in the internal address space of the device. The process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the dependency processing unit is configured to determine, if there is a plurality of operation instructions, whether the first operation instruction has a relationship with the zeroth operation instruction before the first operation instruction, such as the first operation instruction and the If the zeroth operation instruction has an association relationship, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction, the first operation instruction transmission is extracted from the instruction storage unit. To the operator;
  • Determining whether the first operation instruction is associated with the zeroth operation instruction before the first operation instruction comprises:
  • a first storage address interval of data for example, a matrix
  • S110 input random noise and reference data to a memory (eg, random noise and reference data are stored to the storage unit);
  • the generator neural network parameters can be forwarded from the memory to the operator 120 by DMA, and the random noise and reference data are forwarded from the 111 to the computing unit 120 by DMA;
  • S140 The operator updates the discriminator neural network parameter and the generator neural network parameter according to the discriminating result.
  • step S140 it specifically includes: a loss value of the generator neural network and the discriminator neural network respectively calculated according to the discrimination result; and then adaptively updating the discriminator according to the maximum gradient direction in which the loss value is reduced.
  • the parameters in the neural network further improve the discriminating precision of the discriminator; and adaptively update the parameters in the synthesizer neural network according to the maximum gradient direction in which the discriminator discriminates the loss value.
  • the noise generating result obtained by the generator neural network is calculated as the final creation result.
  • the processing means for performing the generation of the confrontation network in this embodiment is for authoring video and/or images.
  • a memory for executing a processing device that generates an anti-network the input data including but not limited to a set of pictures including one or more key features; the operator training based on the input data to obtain a set of generation function parameters, according to the Generate function parameters and input reference images to generate output creation images.
  • the input data may be the original input data, or may be the result obtained by pre-processing the original data.
  • the processing device for performing the generation of the anti-network is adaptively trained, for example, the device inputs a set of training pictures including one or more key features, such as a hand-drawn picture, a live picture, a video key frame picture, and the like.
  • the device mixes the input training picture as a real picture with the generated model according to the false picture generated by the noise, inputs the discriminator into the discriminator, discriminates the true and false, and weights the respectively calculated loss value of the generator and the discriminator according to the discriminating result, and then according to the loss value.
  • the input picture of the discriminator takes two values: -0,1 ⁇ , 0 means the input picture is the input training picture, and 1 means the input picture is the fake picture generated by the generator according to the noise; of course, it can be reversed , 1 means true, 0 means false.
  • the adaptive training process described above is processed offline.
  • Specific video or image creation steps can include:
  • Step 1 Passing random noise input data into the memory through the preprocessing unit or directly into the memory;
  • Step 2 DMA (Direct Memory Access) directs it into the instruction cache in batches, inputs the neuron cache, and the weight cache;
  • Step 3 the controller reads the instruction from the instruction cache, decodes it and sends it to the operator;
  • Step 4 According to the instruction, the operator performs a corresponding operation: in each layer of the neural network, the operation is mainly divided into three steps: Step 4.1, multiplying the corresponding input neuron and the weight in the multiplier; step 4.2, The addition tree operation is performed in the addition tree, that is, the result of step 4.1 is added step by step through the addition tree to obtain a weighted sum, and the weighted sum is offset or not processed as needed; step 4.3, the step is performed in the activation function unit 4.2 The result obtained performs an activation function operation to get the output neurons and pass them to the output neuron cache.
  • Step 5 repeat steps 2 to 4 until all data is calculated, wherein the noise generation result of the generator can be obtained according to the final output layer of the neural network, and the result is stored in the generator output buffer by the DMA;
  • Step 6 mixing part of the input data with the generator generation result as the input data of the discriminator model, repeating steps 2 to 4, knowing that all the data is calculated, wherein the discriminating result of the discriminator can be based on the final output layer of the neural network.
  • the result is obtained, and the result is stored in the discriminator output buffer by the DMA;
  • step 7 the output of the discriminator is input to the operator by the DMA, and the optimized gradient of the generator and the optimized gradient of the discriminator are respectively obtained after the partial derivative operation, and respectively added to the neuron weights of the generator and the discriminator. After that, the corresponding result is stored in the corresponding neuron cache;
  • Step 8 Repeat steps 5, 6, and 7 until the generator and discriminator loss functions are optimal;
  • Step 9 the input reference data is passed to the memory through the data pre-processing unit or directly into the memory;
  • step 10 step 10
  • step 2 to step 4 are repeated, and the output result of the generator model neural network output layer is the authoring result.
  • the output creation picture size also the number of neurons in the final output layer of the artificial neural network
  • the training data input training feature
  • the network parameter update mode random gradient descent
  • the processing device for performing the generation of the confrontation network in this embodiment is for authoring audio.
  • the input data may be the original input data, or may be the result obtained by pre-processing the original data.
  • the processing device for performing the generation of the anti-network is adaptively trained, for example, the device inputs a set of audio data including one or more key sampling points, such as a voice segment, a synthetic edit electronic audio, and the like. Then, the input training audio is mixed as the real audio and the generated model according to the false audio generated by the noise, and is input into the discriminator to discriminate the true and false, and the loss value of the generator and the discriminator respectively calculated according to the discriminating result is weighted, and then according to the loss value.
  • the device inputs a set of audio data including one or more key sampling points, such as a voice segment, a synthetic edit electronic audio, and the like. Then, the input training audio is mixed as the real audio and the generated model according to the false audio generated by the noise, and is input into the discriminator to discriminate the true and false, and the loss value of the generator and the discriminator respectively calculated according to the discriminating result is weighted, and then according to the loss value.
  • the input audio of the discriminator is selected from the true and false values - for example, ⁇ 0, 1 ⁇ , 0 means that the input picture is an input training picture, and 1 means that the input picture is a false picture generated by the generator according to noise; of course, it can be reversed , 1 means true, 0 means false.
  • the adaptive training process described above is processed offline.
  • the method of creating a picture (video key frame) by the artificial neural network chip is: performing matrix multiplication with the input reference picture according to the optimal generator weight parameter obtained by the training, and obtaining the final created picture (video key frame).
  • Specific voice creation steps can include:
  • Step 1 Random noise (the generator source is generated by random noise, which can generate meaningful audio according to the weight).
  • the input data is sent to the storage unit through the preprocessing unit or directly into the storage unit;
  • Step 2 DMA (Direct Memory Access) directs it into the instruction cache in batches, inputs the neuron cache, and the weight cache;
  • Step 3 the controller reads the instruction from the instruction cache, decodes it and sends it to the operator;
  • Step 4 According to the instruction, the operator performs a corresponding operation: in each layer of the neural network, the operation is mainly divided into three steps: step 4.1, multiplying the corresponding input neuron and the weight; step 4.2, performing the addition tree The operation, that is, the result of step 4.1 is added step by step through the addition tree to obtain a weighted sum, and the weighted sum is offset or not processed as needed; in step 4.3, the activation function is performed on the result obtained in step 4.2 to obtain an output neuron. And pass it to the output neuron cache.
  • Step 5 repeat steps 2 to 4 until all data is calculated, wherein the noise generation result of the generator can be obtained according to the final output layer of the neural network, and the result is stored in the generator output buffer by the DMA;
  • Step 6 mixing part of the input data with the generator generation result as the input data of the discriminator model, repeating steps 2 to 4, knowing that all the data is calculated, wherein the discriminating result of the discriminator can be based on the final output layer of the neural network.
  • the result is obtained, and the result is stored in the discriminator output buffer by the DMA;
  • step 7 the output of the discriminator is input to the operator by the DMA, and the optimized gradient of the generator and the optimized gradient of the discriminator are respectively obtained after the partial derivative operation, and respectively added to the neuron weights of the generator and the discriminator. After that, the corresponding result is stored in the corresponding neuron cache;
  • Step 8 Repeat steps 5, 6, and 7 until the generator and discriminator loss functions are optimal;
  • Step 9 the input reference data is passed to the storage unit through the data pre-processing unit or directly to the storage unit;
  • step 10 step 10
  • step 2 to step 4 are repeated, and the output result of the generator model neural network output layer is the authoring result.
  • the processing device for performing the generation of the confrontation network in this embodiment is for authoring a text type.
  • a memory for executing a processing device that generates an anti-network the input data including but not limited to a set of one or more phrases or phrases (text types) having a part-of-speech tag; the device training based on the input data to obtain a set Generate function parameters, generate output creative text paragraphs according to the generated function parameters and input reference text paragraphs.
  • the input data may be the original input data, or may be the result obtained by pre-processing the original data.
  • the output data can be a paragraph of text or a special format in a strict format such as a verse.
  • Adaptive training for performing processing devices that generate anti-networks such as:
  • the device inputs a set of phrases or phrases containing one or more part-of-speech tags, such as speech segments, synthetically edited electronic audio audio, and the like.
  • the device mixes the input training text paragraph as a feature text paragraph and the generated model according to the selected text paragraph selected by the noise in the same-word group, and inputs the discriminant to the discriminator to discriminate between true and false, and weights the generator and discriminate according to the discriminating result.
  • the loss value of the device and then adaptively update the parameters (such as weights, offsets, etc.) in the discriminator according to the maximum gradient direction in which the loss value is reduced, thereby improving the discriminating accuracy of the discriminator; and the generator is based on the discriminator
  • the discriminating loss direction increases the maximum gradient direction, adaptively updates the parameters in the generator (such as weights, offsets, etc.), thereby increasing the generator's ability to generate audio samples based on noise. Close to the feature sample point distribution, reducing the discriminator accuracy.
  • the optimal generator standard is reached, and the parameters of the generator can generate random text to generate a text with reference style according to the reference text paragraph.
  • the input text paragraph of the discriminator takes two choices of true and false values—for example, ⁇ 0, 1 ⁇ , where 0 means that the input phrase or phrase is a phrase or phrase contained in the input training paragraph, and 1 means that the input picture is generated by the generator according to noise. Random phrase; of course, can also be reversed, 1 means true, 0 means false.
  • the adaptive training process described above is processed offline.
  • the processing device for executing the generated confrontation network is an artificial neural network chip.
  • Specific text type authoring steps can include:
  • Step 1 Passing random noise input data into the memory through the preprocessing unit or directly into the memory;
  • Step 2 DMA (Direct Memory Access) directs it into the instruction cache in batches, inputs the neuron cache, and the weight cache;
  • Step 3 the controller reads the instruction from the instruction cache, decodes it and sends it to the operator;
  • Step 4 According to the instruction, the operator performs a corresponding operation: in each layer of the neural network, the operation is mainly divided into three steps: step 4.1, multiplying the corresponding input neuron and the weight; step 4.2, performing the addition tree The operation, that is, the result of step 4.1 is added step by step through the addition tree to obtain a weighted sum, and the weighted sum is offset or not processed as needed; in step 4.3, the activation function is performed on the result obtained in step 4.2 to obtain an output neuron. And pass it to the output neuron cache.
  • Step 5 repeat steps 2 to 4 until all data is calculated, wherein the noise generation result of the generator can be obtained according to the final output layer of the neural network, and the result is stored in the generator output buffer by the DMA;
  • Step 6 mixing part of the input data with the generator generation result as the input data of the discriminator model, repeating steps 2 to 4, knowing that all the data is calculated, wherein the discriminating result of the discriminator can be based on the final output layer of the neural network.
  • the result is obtained, and the result is stored in the discriminator output buffer by the DMA;
  • step 7 the output of the discriminator is input to the operator by the DMA, and the optimized gradient of the generator and the optimized gradient of the discriminator are respectively obtained after the partial derivative operation, and respectively added to the neuron weights of the generator and the discriminator. After that, the corresponding result is stored in the corresponding neuron cache;
  • Step 8 Repeat steps 5, 6, and 7 until the generator and discriminator loss functions are optimal;
  • Step 9 the input reference data is passed to the storage unit through the data pre-processing unit or directly to the storage unit;
  • step 10 step 10
  • step 2 to step 4 are repeated, and the output result of the generator model neural network output layer is the authoring result.
  • Embodiments of the present disclosure also provide an electronic device including the above-described processing apparatus for performing a generated confrontation network.
  • Electronic devices may include, but are not limited to, robots, computers, printers, scanners, tablets, smart terminals, cell phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearable device vehicles, household appliances, and/or medical devices.
  • the vehicle may include an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument , B-ultrasound and / or electrocardiograph.
  • Each functional unit/module/submodule/subunit in the present disclosure may be hardware, such as the hardware may be a circuit, including digital circuits, analog circuits, and the like.
  • Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
  • the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种数据共享***,包括存储模块和至少两个处理模块,其中:至少两个处理模块共用存储模块;至少两个处理模块之间进行通信,以实现数据共享。以及一种数据共享***的数据共享方法。本公开可降低存储通信的开销,有效降低数据访问的延时。

Description

数据共享***及其数据共享方法 技术领域
本公开涉及一种共享***,尤其涉及一种数据共享***及其数据共享方法。
背景技术
随着人工智能技术的不断发展,机器学习技术和深度神经网络技术得到了广泛的应用,如可应用于语音识别、图像处理、数据分析、广告推荐***、汽车自动驾驶等等,可以说,机器学习和深度神经网络已经被应用在了生活的各个方面。这些技术能够取得如此广泛的应用,和其能够很好地处理大数据的优势是分不开的。但随着数据量的越来越大,其计算量也随之增加,因此如何有效的组织和存储数据,成为了设计片上***芯片(SoC芯片)时一个不得不面对的问题。
如图1所示,在现有的SoC芯片中,机器学习(可以做深度学习或其他)专用集成电路(ASIC模块)的数据时,通常都存在私有的静态随机存取存储器(SRAM)里,通过先进的可扩展接口(AXI)总线将数据放到片外动态随机存取存储器(DRAM)或片内的SRAM(类似缓存SRAM(Cache))里,再间接和其他模块交互。这使得***开销提高、数据读取延时增大、数据共享和交互的能耗增多。
发明内容
基于以上问题,本公开的主要目的在于提出一种数据共享***及其数据共享方法,用于解决以上技术问题的至少之一。
为了实现上述目的,作为本公开的一个方面,本公开提出了一种数据共享***,包括存储模块和至少两个处理模块,其中:
至少两个处理模块共用存储模块;
至少两个处理模块之间通过预设的规则进行通信,以实现数据共享。
在本公开的一些实施例中,上述预设的规则包括通信协议、传送协议、握手协议和/或总线协议。
在本公开的一些实施例中,上述通过预设的规则通信包括:至少两个处理模块包括第一处理模块和第二处理模块,第一处理模块向第二处理模块发送请求信号和相应的数据地址,第二处理模块根据请求信号和相应的数据地址,向第一处理模块回复有效信号和数据,以实现数据共享。
在本公开的一些实施例中,上述至少两个处理模块包括物理处理器。
在本公开的一些实施例中,上述物理处理器包括神经网络处理器。
在本公开的一些实施例中,上述神经网络处理器包括用于执行人工神经网络正向运算的装置。
在本公开的一些实施例中,上述用于执行人工神经网络正向运算的装置包括指令缓存单元和直接内存访问单元,其中:
指令缓存单元用于通过直接内存访问单元读入指令并缓存读入的指令。
在本公开的一些实施例中,上述用于执行人工神经网络正向运算的装置还包括:
控制器单元,用于从指令缓存单元读取指令,并将该指令译码成微指令。
在本公开的一些实施例中,上述用于执行人工神经网络正向运算的装置还包括H树模块,H树模块可以包括分支处理模块,其中,
主运算模块与分支处理模块连接,分支处理模块与多个从处理模块连接;
分支处理模块,用于执行转发主运算模块与从处理模块之间的数据或指令。
在本公开的一些实施例中,上述直接内存访问单元,还用于从外部地址空间向主运算模块和各从运算模块的相应数据缓存单元中写数据,或从所述数据缓存单元向外部地址空间读数据。
在本公开的一些实施例中,上述至少两个处理模块包括两个互异结构的处理器;该两个互异结构的处理器的其中之一为神经网络处理器。
在本公开的一些实施例中,上述至少两个处理模块包括处理器的至少两个处理器内核;该至少两个处理器内核为相同/互异结构的处理器内核。
在本公开的一些实施例中,上述至少两个处理模块包括处理器内核的至少两个运算单元;该至少两个运算单元为相同/互异结构的运算单元。
在本公开的一些实施例中,上述共享***还包括:
至少两个存储单元,分别连接至少两个运算单元的至少一个,至少两个运算单元中的任一个连接一个或多个存储单元;且至少两个存储单元共享所述存储模块。
在本公开的一些实施例中,上述至少两个运算单元共享同一个存储单元、或独享一个存储单元、或部分共享同一个存储单元,且部分独享一个存储单元。
在本公开的一些实施例中,上述至少两个处理模块包括处理器内核的三个运算单元,至少两个存储单元为两个,其中的两个运算单元同时连接其中的一个存储单元,其中的另外一个运算单元连接其中的另一个存储单元。
为了实现上述目的,作为本公开的另一个方面,本公开提出了一种数据共享方法,包括以下步骤:
至少两个处理模块之间通过预设的规则进行通信,以实现数据共享;
其中,两个处理模块共用存储模块。
在本公开的一些实施例中,上述预设的规则包括通信协议、传送协议、握手协议和/或总线协议。
在本公开的一些实施例中,上述通过预设的规则通信包括:至少两个处理模块包括第一处理模块和第二处理模块,第一处理模块向第二处理模块发送请求信号和相应的数据地址,第二处理模块根据请求信号和相应的数据地址,向第一处理模块回复有效信号和数据,以实现数据共享。
在本公开的一些实施例中,上述至少两个处理模块包括物理处理器。
在本公开的一些实施例中,上述物理处理器包括神经网络处理器。
在本公开的一些实施例中,上述神经网络处理器包括用于执行人工神经网络正向运算的装置。
在本公开的一些实施例中,上述用于执行人工神经网络正向运算的装置包括指令缓存单元和直接内存访问单元,其中:
指令缓存单元通过直接内存访问单元读入指令,并缓存读入指令。
在本公开的一些实施例中,上述用于执行人工神经网络正向运算的装置还包括控制器单元,该控制器单元从指令缓存单元读取指令,并译码该指令生成微指令。
在本公开的一些实施例中,上述用于执行人工神经网络正向运算的装置还包括H数模块、主运算模块、以及多个从运算模块,其中:
H树模块,在每层神经网络反向训练开始计算的阶段,主运算模块通过H树模块向所有的从运算模块传输本层的输入神经元向量,以及在从计算模块的计算过程完成后,H树模块逐级将各从计算模块的输出神经元值拼成中间结果向量;
主运算模块,利用中间结果向量完成后续计算。
在本公开的一些实施例中,上述直接内存访问单元,还从外部地址空间向主运算模块和各从运算模块的相应数据缓存单元中写数据,或从数据缓存单元向外部地址空间读数据。
在本公开的一些实施例中,上述至少两个处理模块包括两个互异结构的处理器;该两个互异结构的处理器的其中之一为神经网络处理器。
在本公开的一些实施例中,上述至少两个处理模块包括处理器的至少两个处理器内核;该至少两个处理器内核为相同/互异结构的处理器内核。
在本公开的一些实施例中,上述至少两个处理模块包括处理器内核的至少两个运算单元;该至少两个运算单元为相同/互异结构的运算单元。
在本公开的一些实施例中,上述数据共享方法还采用:
至少两个存储单元,分别连接至少两个运算单元的至少一个,至少两个运算单元中的任一个连接一个或多个存储单元;且至少两个存储单元共享所述存储模块。
在本公开的一些实施例中,上述至少两个运算单元共享同一个存储单元、或独享一个存储单元、或部分共享同一个存储单元,且部分独享一个存储单元。
在本公开的一些实施例中,上述至少两个处理模块包括处理器内核的三个运算单元,至少两个存储单元为两个,其中的两个运算单元同时连接其中的一个存储单元,其中的另外一个运算单元连接其中的另一个存储单元。
本公开一方面提供了一种信息处理装置,该装置包括存储模块和数据处理模块,其中,存储模块,用于接收并存储输入数据、指令和输出数据,其中输入数据包含一个或多个关键特征;数据处理模块,用于对输入数据包含的关键特征进行判断,并根据判断结果对存储模块中的输入数据进行评分。
上述方案中,所述输入数据为原始输入数据,或对原始输入数据进行预处理后的数据。
上述方案中,所述数据处理模块对输入数据包含的关键特征进行判断,包括:数据处理模块计算输入数据包含的关键特征的置信度,该置信度即为判断结果。
上述方案中,所述存储模块中存储有数据和指令,所述数据包括输入数据,输入神经元,权值,输出神经元,输出数据;输入数据传给人工神经网络中的各个输入神经元,从而参与后续运算;输出神经元的值即判断结果和评分,作为输出数据。
上述方案中,所述数据处理模块包括运算模块,用于根据所述存储模块中存储的指令对所述存储模块中存储的数据执行相应的计算,并将运算结果输出至存储模块。
上述方案中,所述运算模块用于根据所述存储模块中存储的指令对所述存储模块中存储的数据执行相应的计算,在神经网络的各个层中,运算模块执行运算包括:
第一部分为乘法器;
第二部分为一个或者多个加法器;
第三部分为激活函数单元;以及
第四部分为向量处理单元。
上述方案中,第二部分为多个加法器时,多个加法器组成加法树。
上述方案中,所述激活函数是sigmoid、tanh、relu、softmax。
上述方案中,第四部分为向量处理单元,该向量处理单元进行池化运算。
上述方案中,所述数据处理模块还包括指令缓存和神经网络数据缓存;指令缓存,用于缓存指令;神经网络数据缓存,用于缓存所述存储模块中的权值数据、输入神经元和输出神经元。
上述方案中,所述神经网络数据缓存包括权值缓存、输入神经元缓存和输出神经元缓存;权值缓存,用于缓存权值数据;输入神经元缓存,用于缓存输入神经元;输出神经元缓存,用于缓存并输出运算模块输出的运算结果,即判断结果和/或评分。
上述方案中,所述数据处理模块还包括直接内存存取,该直接内存存取起到沟通存储模块和各个缓存之间桥梁的作用,用于对存储模块中存储的数据和/或指令进行读写,将读写出的指令存储至指令缓存,将读出的权值存储至权值缓存,将读出的输入神经元,即输入数据存储至输入神经元缓存,并将接收自输出神经元缓存的输出神经元,即将判断结果和/或评分存储至存储模块。
上述方案中,所述数据处理模块还包括控制单元,该控制单元用于从所述指令缓存中读取指令,将其译码为运算模块能够执行的指令并输出至运算模块。
上述方案中,所述数据处理模块还包括评分单元,该评分单元用于:当信息处理装置中运行的人工神经网络得到判断结果,进而得到评分时,该评分单元不参与数据处理;当信息处理装置中运行的人工神经网络仅得到判断结果而不得到评分时,该评分单元用于根据判断结果得到评分。
上述方案中,所述判断结果,即信息处理装置中运行的人工神经网络的最终输出层的输出神经元的值,输出神经元的值即为关键特征出现的置信度,置信度为一定范围内的自然数;所述评分,为在信息处理装置中运行的人工神经网络的最终输出层后面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度;该层只有一个输出神经元,其值即为评分;该新的最终输出层运算中的权值对应各个关键特征的重要程度;或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000001
上述方案中,所述评分或者为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。
上述方案中,该信息处理装置为人工神经网络芯片。
本公开另一方面提供了一种信息处理方法,采用所述的信息处理装置,包括:
存储模块接收并存储输入数据、指令和输出数据,其中输入数据包含一个或多个关键特征;
数据处理模块对输入数据包含的关键特征进行判断,并根据判断结果对存储模块中的输入数据进行评分。
上述方案中,所述输入数据采用原始输入数据,或采用对原始输入数据进行预处理后的数据。
上述方案中,所述数据处理模块对输入数据包含的关键特征进行判断,包括:数据处理模块计算输入数据包含的关键特征的置信度,该置信度即为判断结果。
上述方案中,所述存储模块存储有数据和指令,所述数据包括输入数据,输入神经元,权值,输出神经元,输出数据;输入数据传给人工神经网络中的各个输入神经元,从而参与后续运算;输出神经元的值即判断结果和评分,作为输出数据。
上述方案中,所述数据处理模块包括运算模块,该运算模块根据所述存储模块中存储的指令对所述存储模块中存储的数据执行相应的计算,并将运算结果输出至存储模块。
上述方案中,所述运算模块根据所述存储模块中存储的指令对所述存储模块中存储的数据执行相应的计算,在神经网络的各个层中,运算模块执行运算包括:
第一部分为乘法器;
第二部分为一个或者多个加法器;
第三部分为激活函数单元;以及
第四部分为向量处理单元。
上述方案中,第二部分为多个加法器时,多个加法器组成加法树。
上述方案中,所述激活函数采用sigmoid、tanh、relu、softmax。
上述方案中,第四部分为向量处理单元,该向量处理单元进行池化运算。
上述方案中,所述数据处理模块还包括指令缓存和神经网络数据缓存;使用指令缓存缓存指令;使用神经网络数据缓存缓存所述存储模块中的权值数据、输入神经元和输出神经元。
上述方案中,所述神经网络数据缓存包括权值缓存、输入神经元缓存和输出神经元缓存;使用权值缓存缓存权值数据;使用输入神经元缓存缓存输入神经元;使用输出神经元缓存缓存并输出运算模块输出的运算结果,即判断结果和/或评分。
上述方案中,所述数据处理模块还包括直接内存存取,该直接内存存取起到沟通存储模块和各个缓存之间桥梁的作用,并对存储模块中存储的数据和/或指令进行读写,将读写出的指令存储至指令缓存,将读出的权值存储至权值缓存,将读出的输入神经元,即输入数据存储至输入神经元缓存,并将接收自输出神经元缓存的输出神经元,即将判断结果和/或评分存储至存储模块。
上述方案中,所述数据处理模块还包括控制单元,该控制单元读取指令缓存中的指令,将其译码为运算模块能够执行的指令并输出至运算模块。
上述方案中,所述数据处理模块还包括评分单元,当信息处理装置中运行的人工神经网络得到判断结果,进而得到评分时,该评分单元不参与数据处理;当信息处理装置中运行的人工神经网络仅得到判断结果而不得到评分时,该评分单元根据判断结果得到评分。
上述方案中,所述判断结果,即信息处理装置中运行的人工神经网络的最终输出层的输出神经元的值,输出神经元的值即为关键特征出现的置信度,置信度为一定范围内的自然数;所述评分,为在信息处理装置中运行的人工神经网络的最终输出层后面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度;该层只有一个输出神经元,其值即为评分;该新的最终输出层运算中的权值对应各个关键特征的重要程度;或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000002
上述方案中,所述评分或者为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。
上述方案中,该信息处理方法采用的信息处理装置为人工神经网络芯片。
本公开再一方面还提供了一种信息处理***,包括信息获取装置、所述的信息处理装置、交互界面和控制装置,其中:
信息获取装置,用于获取外部数据,并传递给信息处理装置;
信息处理装置,用于对接收自信息获取装置的外部数据进行运算处理,并将运算处理结果输出给交互界面;
交互界面,用于显示接收自信息处理装置的运算结果,以及将接收自外部的操作或命令传输给控制装置;
控制装置,用于根据接收自交互界面的操作或命令控制信息获取装置、信息处理装置和交互界面的运作。
上述方案中,所述信息获取装置包括字符识别装置、图像识别装置和语音识别装置;
所述字符识别装置,用于获取外部数据中的文字信息;
所述图像识别装置,用于获取外部数据中的图片或视频信息;
所述语音识别装置,用于获取外部数据中的音频信息。
上述方案中,交互界面为手机、电脑、笔记本或平板电脑的显示屏。
本公开再一方面还提供了一种信息处理方法,采用所述的信息处理***,包括:
信息获取装置获取外部数据,并将外部数据直接或经预处理后传递给信息处理装置;
信息处理装置对接收自信息获取装置的外部数据或经预处理后的外部数据进行运算处理,并将运算处理结果输出给交互界面;以及
交互界面显示接收自信息处理装置的运算结果。
上述方案中,所述信息获取装置包括字符识别装置、图像识别装置和语音识别装置,所述信息获取装置获取外部数据,包括:
信息获取装置采用字符识别装置获取外部数据中的文字信息;
信息获取装置采用图像识别装置获取外部数据中的图片或视频信息;
信息获取装置采用语音识别装置获取外部数据中的音频信息。
根据本公开的一个方面,提供了一种任务切分装置,包括:粒度任务切分单元,用于采用至少一种粒度对任务进行切分形成子任务;以及任务切分粒度选择单元,用于选择采用的粒度。
在一些实施例中,任务切分装置用于神经网络,粒度任务切分单元包括以下单元中的至少一个,第一粒度任务切分单元,用于将任务整体作为一子任务;第二粒度任务切分单元,用于将选取任务中部分样本计算作为子任务来切分任务;第三粒度任务切分单元,用于按照神经网络的层类型进行任务切分,相同类型层的计算作为一子任务;第四粒度任务切分单元,用于按照神经网络的层间结构进行任务切分,若干相邻层的计算作为一子任务;第五粒度任务切分单元,用于按照神经网络的层内结构进行任务切分,将神经网络层内的计算切分为子任务。
在一些实施例中,所述任务切分粒度选择单元基于神经网络需要处理的样本数量、神经网络的拓扑结构以及每一层的计算量中的至少一个选择第一至第五粒度任务切分单元中的至少一个来进行任务切分。
在一些实施例中,所述按照神经网络的层内结构进行任务切分包括:对神经网络的卷积层计算、全连接层计算、池化层计算或激活层计算进行任务切分。
在一些实施例中,所述对神经网络的卷积层计算进行切分包括:当所述神经网络的卷积层输入神经元是三维矩阵(Nfin,Nxin,Nyin),权值是四维矩阵(Nfout,Nfout,Kx,Ky),输出神经元是三维矩阵(Nfout,Nxout,Nyout)时,其中Nfin是输入特征图像数量,(Nxin,Nyin)是输入特征图像大小,Nfout是输出特征图像数量,(Kx,Ky)是卷积核大小,(Nxout,Nyout)是输出特征图像大小,Nfin,Nxin,Nyin,Kx,Ky,Nfout,Nxout,Nyout均为正整数,将输出神经元按照(Bfout,Bxout,Byout)的块大小进行切分,同时对权值按照(Bfout,Bfin,Bx,By)的块大小进行切分,其中,Bfout,Bxout,Byout,Bfout,Bfin,Bx,By均为正整数,且0<Bfout≤Nfout,0<Bxout≤Nxout,0<Byout≤Nyout,0<Bfin≤Nfin,0<Bx≤Kx,0<By≤Ky。
根据本公开的另一个方面,提供一种任务处理装置,包括:任务切分装置;以及任务调度装置,所述任务调度装置包括:任务队列单元,用于缓存未调度的任务,监测单元,用于实时监测多核处理器各核工作状态;任务调度单元,用于从未调度任务中选择待调度任务,并根据所述各核工作状态向目标核分配调度待调度任务。
在一些实施例中,所述任务调度单元采用以下方式中的至少一种来分配调度待调度任务至目标核:统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核;统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核;统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核;以及采用启发式算法将待调度任务分配到目标核。
在一些实施例中,所述启发式算法包括遗传算法,蚁群算法,模拟退火算法中的至少一个。
在一些实施例中,所述任务调度单元每隔时间T进行一次任务调度,待调度任务采用以下方式中的至少一种进行选择:随机选择未调度的任务;
选择预计执行时间最长的未调度的任务;选择预计执行时间最短的未调度的任务;选择占用资源最多的未调度的任务;选择占用资源最少的未调度的任务。
在一些实施例中,所述各核工作状态包括利用率,工作负载,工作频率,核内私有任务队列中的任务数量,核内任务完成时间中的至少一个。
根据本公开的另一个方面,提供一种多核处理器,包括:J个处理核,J为正整数;以及任务处理装置。
在一些实施例中,所述处理核之间的拓扑结构采用一维线性、二维mesh,二维星形、三维立方中的至少一种。
在一些实施例中,所述处理核包括神经网络处理核,所述神经网络处理核包括:存储单元,用于存储神经网络的神经元、权值以及指令;选数单元,用于接收输入神经元和非零权值位置信息,选出非零权值对应的神经元;运算单元,用于接收输入非零权值对应的神经元和对应的非零权值,完成神经网络训练运算;以及控制单元,用于接收神经网络的指令,经过译码后生成控制信息控制所述选数单元和运算单元。
在一些实施例中,所述指令包括控制指令,数据传输指令,运算指令和逻辑指令中的至少一个。
在一些实施例中,所述运算指令用于完成神经网络的算术运算,包括矩阵运算指令,向量运算指令,标量运算指令,卷积神经网络运算指令,全连接神经网络运算指令,池化神经网络运算指令,RBM神经网络运算指令,LRN神经网络运算指令,LCN神经网络运算指令,LSTM神经网络运算指令,RNN神经网络运算指令,RELU神经网络运算指令,PRELU神经网络运算指令,SIGMOID神经网络运算指令,TANH神经网络运算指令,MAXOUT神经网络运算指令中的至少一个。
根据本公开的再一个方面,提供一种任务切分方法,用于神经网络,选择以下任务切分方式中的至少一个来进行任务切分:将任务整体作为一子任务;将选取任务中部分样本计算作为子任务来切分任务;按照神经网络的层类型进行任务切分,相同类型层的计算作为一子任务;按照神经网络的层间结构进行任务切分,若干相邻层的计算作为一子任务;按照神经网络的层内结构进行任务切分,将神经网络层内的计算切分为子任务。
在一些实施例中,基于神经网络需要处理的样本数量、神经网络的拓扑结构以及每一层的计算量中的至少一个来选择所述任务切分装置中的至少一个来进行任务切分。
在一些实施例中,所述按照神经网络的层内结构进行任务切分包括:对神经网络的卷积层计算、全连接层计算、池化层计算或激活层计算进行任务切分。
在一些实施例中,所述对神经网络的卷积层计算进行切分包括:当所述神经网络的卷积层输入神经元是三维矩阵(Nfin,Nxin,Nyin),权值是四维矩阵(Nfout,Nfout,Kx,Ky),输出神经元是三维矩阵(Nfout,Nxout,Nyout)时,其中Nfin是输入特征图像数量,(Nxin,Nyin)是输入特征图像大小,Nfout是输出特征图像数量,(Kx,Ky)是卷积核大小,(Nxout,Nyout)是输出特征图像大小,Nfin,Nxin,Nyin,Kx,Ky,Nfout,Nxout,Nyout均为正整数,将输出神经元按照(Bfout,Bxout,Byout)的块大小进行切分,同时对权值按照(Bfout,Bfin,Bx,By)的块大小进行切分,其中,Bfout,Bxout,Byout,Bfout,Bfin,Bx,By均为正整数,且0<Bfout≤Nfout,0<Bxout≤Nxout,0<Byout≤Nyout,0<Bfin≤Nfin,0<Bx≤Kx,0<By≤Ky。
根据本公开的进一步的一个方面,提供一种任务处理方法,包括:务切分方法;以及任务调度方法,所述任务调度方法包括:缓存未调度的任务,所述任务包括权利要求中任一任务切分装置切分的子任务;实时监测多核处理器各核工作状态;以及从未调度任务中选择待调度任务并根据所述各核工作状态向目标核分配调度待调度任务。
在一些实施例中,所述向目标核分配调度所述待调度任务采用以下方式中的至少一种执行:统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核;统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核;统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核;以及采用启发式算法将待调度任务分配到目标核。
在一些实施例中,所述启发式算法包括遗传算法,蚁群算法,模拟退火算法中的至少一个。
在一些实施例中,每隔时间T进行一次任务调度,待调度任务采用以下方式中的至少一种进行选择:随机选择未调度的任务;选择预计执行时间最长的未调度的任务;选择预计执行时间最短的未调度的任务;选择占用资源最多的未调度的任务;选择占用资源最少的未调度的任务。
在一些实施例中,所述各核工作状态包括利用率,工作负载,工作频率,核内私有任务队列中的任务数量,核内任务完成时间中的至少一个。
根据本公开的一个方面,提供了一种处理器,包括:
任务切分装置,用于根据任务切分粒度进行任务切分;以及
硬件资源划分装置,用于根据任务切分结果对所述处理器的硬件资源进行划分。
在一些实施例中,所述处理器还包括多个计算单元,所述硬件资源划分装置用于根据任务切分结果对所述处理器的多个计算单元进行划分,即所述多个计算单元根据所述任务切分结果分成多个计算组,以分别计算batch中不同的正向和反向通路,或运行不同的服务的请求。
在一些实施例中,所述处理器在运行过程中,根据所述任务切分结果对所述多个计算单元的分组进行动态调整。
在一些实施例中,所述任务切分装置包括:
任务切分粒度选择单元,用于选择采用的粒度;以及
粒度任务切分单元,用于采用至少一种粒度对任务进行切分形成子任务。
在一些实施例中,所述粒度任务切分单元包括以下单元中的至少一个:
第一粒度任务切分单元,用于将任务整体作为一子任务;
第二粒度任务切分单元,用于将选取任务中部分样本计算作为子任务来切分任务;
第三粒度任务切分单元,用于按照神经网络的层类型进行任务切分,相同类型层的计算作为一子任务;
第四粒度任务切分单元,用于按照神经网络的层间结构进行任务切分,若干相邻层的计算作为一子任务;
第五粒度任务切分单元,用于按照神经网络的层内结构进行任务切分,将神经网络层内的计算切分为子任务。
在一些实施例中,所述任务切分粒度选择单元基于神经网络需要处理的样本数量、神经网络的拓扑结构以及每一层的计算量中的至少一个选择第一至第五粒度任务切分单元中的至少一个来进行任务切分。
在一些实施例中,所述的处理器还包括:任务调度装置;其中,所述处理器为多核处理器;所述任务调度装置包括:
任务队列单元,用于缓存未调度的任务;
监测单元,用于实时监测各核的工作状态;以及
任务调度单元,用于从未调度任务中选择待调度任务,并根据所述各核的工作状态向目标核分配调度待调度任务。
在一些实施例中,所述任务调度单元采用以下方式中的至少一种来分配调度待调度任务至目标核:
统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核;
统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核;
统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核;以及
采用启发式算法将待调度任务分配到目标核。
根据本公开的另一个方面,提供了一种组合处理装置,其中,所述组合处理装置包括所述的处理器,通用互联接口和其他处理装置进行交互,共同完成用户指定的计算操作。
根据本公开的另一个方面,提供了一种神经网络芯片,其中,所述神经网络芯片包括所述的处理器或所述的组合处理装置。
根据本公开的另一个方面,提供了一种电子设备,其中,所述电子设备包括所述的芯片。
根据本公开的另一个方面,提供了一种处理方法,包括:
任务切分装置根据任务切分粒度进行任务切分;以及
硬件资源划分装置根据任务切分结果对处理器的硬件资源进行划分。
在一些实施例中,在所述硬件资源划分装置根据任务切分结果对处理器的硬件资源进行划分的步骤中:
所述硬件资源划分装置根据任务切分结果对所述处理器的多个计算单元进行划分,即所述多个计算单元根据所述任务切分结果分成多个计算组,以分别计算batch中不同的正向和反向通路,或运行不同的服务的请求。
在一些实施例中,所述处理器在运行过程中,根据所述任务切分结果对所述多个计算单元的分组进行动态调整。
在一些实施例中,所述任务切分装置根据任务切分粒度进行任务切分的步骤包括:
任务切分粒度选择单元选择采用的任务切分粒度;以及
粒度任务切分单元采用至少一种所述粒度对划分后的各硬件资源的任务进行切分形成子任务。
在一些实施例中,所述任务切分粒度选择单元基于神经网络需要处理的样本数量、神经网络的拓扑结构以及每一层的计算量中的至少一个选择多个所述粒度任务切分单元中的至少一个来进行任务切分。
在一些实施例中,所述的处理方法还包括:在任务切分之后,对任务进行分配调度,其包括:
缓存未调度的任务;
实时监测所述处理器各核工作状态;以及
从未调度任务中选择待调度任务,并根据所述各核工作状态向目标核分配调度待调度任务。
在一些实施例中,采用以下方式中的至少一种来分配调度待调度任务至目标核:
统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核;
统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核;
统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核;以及
采用启发式算法将待调度任务分配到目标核。
根据本公开的一方面,提供一种信息处理装置,包括:存储模块,用于获取信息数据,所述信息数据包括至少一个关键特征,所述存储模块预存所述关键特征对应的真实置信度;运算电路,根据所述信息数据,确定所述关键特征对应的预测置信度,并判断所述关键特征的预测置信度是否超过关键特征对应的真 实置信度预设阈值范围;控制电路,当所述预测置信度超过真实置信度预设阈值,控制所述存储模块修改关键特征,或向外部发出修改信号。
在进一步的实施方案中,所述存储模块包括直接内存存取DMA,所述直接内存存取DMA与所述运算电路电性连接,用于存储所述运算电路运算确定的预测置信度,并将所述真实置信度和预测置信度送入所述运算电路以进行比较。
在进一步的实施方案中,所述存储模块还包括存储单元,所述存储单元用于从信息处理装置外部获取信息数据,并传入所述直接存储存取DMA,供运算电路调用。
在进一步的实施方案中,所述存储模块还用于存储神经网络专用指令、神经网络中的输入神经元、输出神经元和权值,所述信息处理装置还包括:
指令缓存,用于从所述存储模块缓存专用指令,供控制电路调用;输入神经元缓存,用于从所述存储模块缓存神经元,供运算电路调用;权值缓存,用于从所述存储模块缓存权值,供运算电路调用;输入神经元缓存,用于存储从所述运算电路运算获得的输出神经元。
在进一步的实施方案中,所述运算电路还用于根据各关键特征的判断结果对所述信息数据进行评分,或者所述运算电路还用于对所述神经网络进行自适应性训练。
在进一步的实施方案中,所述运算电路中,根据所述信息数据,确定所述关键特征对应的预测置信度包括:以所述信数据作为神经网络的输入,进行神经网络运算,所述预测置信度作为神经网络的输出。
在进一步的实施方案中,所述信息数据包括以下至少一种:图片、文本、音频、视频帧和视频。
在进一步的实施方案中,还包括预处理模块,用于对外部的原始信息数据进行预处理后传入所述存储模块;优选的,所述预处理包括对原始信息数据切分、高斯滤波、二值化、正则化和/或归一化,以获得符合神经网络输入格式的数据。
根据本公开的另一方面,提供一种信息处理设备,包括:信息获取装置,用于获取外部的信息数据;以上所述的信息处理装置,用于处理所述信息数据,获得关键特征的预测置信度,且当所述预测置信度超过真实置信度阈值时,修改所述关键特征,或发出修改信号。
根据本公开的再一方面,提供一种信息处理设备,包括:信息获取装置,用于获取外部的信息数据;以上所述的信息处理装置,用于处理所述信息数据,获得关键特征的预测置信度,且当所述预测置信度超过真实置信度预设阈值时,修改所述关键特征,或发出修改信号;交互界面,接收修改的关键特征或者修改信号,向用户示出修改内容。
在进一步的实施方案中,所述交互装置还包括预处理模块,用于对信息获取装置获取的信息数据进行预处理后送入信息处理装置。
在进一步的实施方案中,还包括控制器,用于控制所述信息获取装置、信息处理装置和/或交互界面。
在进一步的实施方案中,所述交互界面还用于响应用户的操作或命令,对预设阈值进行修改。
根据本公开的又一方面,提供一种信息处理方法,包括:通过存储模块获取信息数据,所述信息数据包括至少一个关键特征,所述存储模块预存所述关键特征对应的真实置信度;运算电路根据所述信息数据,确定所述关键特征对应的预测置信度,并判断所述关键特征的预测置信度是否超过关键特征对应的真实置信度预设阈值范围;当所述预测置信度超过真实置信度预设阈值范围,控制电路控制存储模块修改所述关键特征,或发出修改信号。
在进一步的实施方案中,所述存储模块包括直接内存存取DMA,所述方法还包括步骤:采用直接内存存取DMA存储运算电路所确定的预测置信度,并将所述真实置信度和预测置信度送入所述运算电路以进行比较。
在进一步的实施方案中,所述通过存储模块获取信息数据包括:使用存储单元从外部获取信息数据,并传入所述直接存储存取DMA,供运算电路调用。
在进一步的实施方案中,还包括步骤:使用存储模块存储神经网络专用指令;通过指令缓存从所述存储模块缓存专用指令,供控制电路调用;
采用存储模块存储神经网络中的输入神经元、输出神经元和权值;采用输入神经元缓存从所述存储模块缓存神经元,供运算电路调用;采用权值缓存从所述存储模块缓存权值,供运算电路调用;采用输入神经元缓存,存储从所述运算电路运算获得的输出神经元。
在进一步的实施方案中,还包括步骤:采用运算电路根据各关键特征的所述判断结果对所述信息数据进行评分,或者通过运算电路对所述神经网络进行自适应性训练。
在进一步的实施方案中,所述运算电路根据所述信息数据,确定所述关键特征对应的预测置信度包括:以所述信数据作为神经网络的输入,进行神经网络运算,所述预测置信度作为神经网络的输出。
在进一步的实施方案中,还包括步骤:通过预处理模块对外部的原始信息数据进行预处理后传入所述存储模块。
根据本公开的一方面,提供一种用于执行生成对抗网络的处理装置,包括:
存储器,用于接收输入数据,所述输入数据包括随机噪声和参考数据,以及存储判别器神经网络参数与生成器神经网络参数;
运算器,用于将随机噪声输入数据传入生成器神经网络进行运算,得到噪声生成结果;还用于将噪声生成结果和参考数据共同输入判别器神经网络进行运算,得到判别结果;还用于根据所述判别结果更新所述判别器神经网络参数与生成器神经网络参数。
根据本公开的另一方面,提供一种应用上述处理装置进行机器创作的方法,包括:
输入随机噪声和参考数据至存储器;
运算器将随机噪声输入数据传入生成器神经网络进行运算,得到噪声生成结果;
通过运算器将噪声生成结果和参考数据共同输入判别器神经网络进行运算,得到判别结果;
通过运算器,根据所述判别结果更新所述判别器神经网络参数与生成器神经网络参数。
根据本公开的再一方面,提供一种电子设备,包括权上述的处理装置。
本公开提出的数据共享***及其数据共享方法,具有以下有益效果:
本公开中的至少两个处理模块之间可通过预设的规则直接通信,实现数据共享;因此无需通过共享的存储模块,从而可降低存储通信的开销,有效降低数据访问的延时;本公开的至少两个处理模块可包括不同结构的处理器,及不同结构处理器中的内核,因此可维护相同或不同结构的处理器的外部存储模块和内核对应的核外部存储模块;本公开在不降低原有的存储效率和不增加原有的存储成本的情况下,每个存储单元可以允许一个或多个运算单元进行直接访问,其具体数量无需固定和同意,支持非对称的结构,允许根据需求进行配置和调整,从而减少了片内外访存的交互次数,降低了功耗;本公开对于运算单元独自享有的私有存储模块,允许其可以将数据传递给其他运算单元。即在保护数据私有性的同时,允许数据的快速交互,提高了数据利用率,避免了片上存储多份相同数据带来的资源浪费和反复读取相同数据的访存开销,进一步提高了访存速度,降低了访存功耗。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是现有技术中的数据处理***的结构示意图;
图2是本公开一实施例提出的数据共享***的结构示意图;
图3是图2***中处理器的结构示意图;
图4是图3中H树模块的结构示意图;
图5是图3中主运算模块的结构示意图;
图6是图3中从运算模块的结构示意图;
图7是本公开另一实施例提出的数据共享***的结构示意图;
图8是本公开另一实施例提出的数据共享***的结构示意图;
图9是本公开另一实施例提出的数据共享***的结构示意图;
图10是本公开实施例中的信息处理装置结构示意图;
图11是本公开实施例中的包括运算模块的信息处理装置结构示意图;
图12是本公开实施例中的包括指令缓存和神经网络数据缓存的信息处理装置结构示意图;
图13是本公开实施例中的神经网络数据缓存的结构示意图;
图14是本公开实施例中的包括直接内存存取和控制单元的信息处理装置结构示意图;
图15是本公开实施例中的信息处理装置的具体结构示意图;
图16是本公开实施例中的信息处理装置的信息处理方法流程图;
图17是本公开实施例中的信息处理***结构示意图;
图18是本公开实施例中的信息处理***的信息处理方法流程图;
图19是本公开一实施例任务切分装置的结构框图;
图20是本公开一实施例任务调度装置的结构框图;
图21是本公开再一实施例多核处理器的结构框图;
图22是本公开再一实施例中神经网络处理的每一个神经网络处理核的结构框图;
图23是本公开一实施例处理器的结构框图;
图24是本公开另一实施例处理器的结构框图;
图25是本公开另一实施例处理器的结构框图;
图26是本公开另一实施例处理器的结构框图;
图27是本公开另一实施例处理器的结构框图;
图28是本公开实施例任务切分装置的结构框图;
图29是本公开实施例任务调度装置的结构框图;
图30是本公开实施例多核处理器的结构框图;
图31是本公开实施例中神经网络处理的每一个神经网络处理核的结构框图;
图32是本公开实施例组合处理装置的结构框图;
图33是本公开实施例处理方法流程图;
图34是本公开一实施例计算单元划分后的结构示意图;
图35是本公开另一实施例计算单元划分后的结构示意图;
图36是本公开另一实施例计算单元划分后的结构示意图;
图37是本公开实施例信息处理装置的方框图;
图38是本公开另一实施例信息处理装置的方框图;
图39是本公开再一实施例信息处理装置的方框图;
图40是本公开实施例的信息处理设备的方框图;
图41是本公开实施例的信息处理方法流程图;
图42是本公开实施例的用于执行生成对抗网络的处理装置的基本方块图;
图43是本公开又一实施例的用于执行生成对抗网络的处理装置的基本方块图;
图44是本公开实施例的进行机器创作的方法的流程图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开作进一步的详细说明。
本公开提出了机器学习ASIC运算单元可以直接访问SoC片内存储模块,与其他SoC内的其他模块实现快速的数据交互的方法。该方法能够有效提高数据交互效率,大大降低交互延迟。对于各层次公用的存储模块,可以由有权限的访问单元进行访问,对于私有的存储模块,访问单元间可以直接或者通过某种规则或者某种协议完成数据的交互和访问。
本公开提出了一种数据共享***,包括存储模块和至少两个处理模块,其中:
至少两个处理模块共用存储模块;
至少两个处理模块之间通过预设的规则通信,以实现数据共享。
本公开的数据共享***,支持异构的多处理器情况。处理器外部有外部存储模块,是多个处理器的公用存储模块,这些处理器可以为相同的处理器、可以为不同的处理器,亦或是部分相同的情况。
在本公开的一些实施例中,上述至少两个处理模块可包括相同/互异结构的处理器、相同/互异结构的处理器内核,及相同/互异结构处理器内核中相同/互异结构的运算单元。
在本公开的一些实施例中,上述预设的规则包括通信协议、传送协议、握手协议和/或总线协议。
在本公开的一些实施例中,上述通过预设的规则通信包括:至少两个处理模块包括第一处理模块和第二处理模块,第一处理模块向第二处理模块发送请求信号和相应的数据地址,第二处理模块根据所述请求信号和相应的数据地址,向第一处理模块回复有效信号和数据,以实现数据共享。需要说明的是,此处的至少两个处理模块并不以包括第一处理模块和第二处理模块为限,例如还可包括第三处理模块,则此三个模块中的任意两个均可采用上述预设的规则进行通信。
本公开还提出了一种数据共享方法,包括以下步骤:
至少两个处理模块之间通过预设的规则进行通信,以实现数据共享;
其中,该两个处理模块共用一存储模块。
如图2所示,在本公开的一些实施例中,至少两个处理模块包括两个处理器,例如可以为处理器1、处理器2,两个处理器之间的通信是指处理器内部的内部存储模块之间的通信。外部存储模块允许处理器1和处理器2直接进行访问,分别读取数据至内部存储模块1和内部存储模块2所需要的位置。通过某种一致性协议维护外部存储模块和处理器内部存储模块的数据的一致性问题。现有技术中,如当处理器1改变了自己内部存储模块中的数据时,采用“写穿透”的方式,改变内部存储模块1中的相应位置的数据,同时改变外部存储模块中该数据的相应位置;则外部存储模块同时给内部存储模块2中的相应数据发送一个失效信号。待处理器2使用该数据时,发现失效信号后,从外部存储模块读取新值,并写到内部存储模块2中的相应位置。在本实施例中,对于内部存储模块1中的数据,处理器2可以通过某种预设的规则,如先向处理器1发送请求信号和相应的数据地址,处理器1收到请求信号后,回复有效信号和数据来完成数据交互;因此对于具有多个处理器的结构,可维护同一个存储空间,且可通过某种定义好的规则实现多个处理器相互之间的直接通信,从而降低存储通信开销,降低数据访问延时。
其中,本实施例中涉及的处理器1、处理器2等可以为相同的处理器,也可以为不同的处理器。可以适用于新型的人工神经网络处理器和传统的通用处理器之间的合作。如可假定处理器1为通用处理器CPU,处理器2为人工神经网络处理器。
具体地,如图3所示,人工神经网络处理器可包括用于执行人工神经网络正向运算的结构,执行人工神经网络正向运算的结构包括指令缓存单元1、控制器单元2、直接内存访问单元3、H树模块4、主运算模块5和多个从运算模块6。其中,指令缓存单元1、控制器单元2、直接内存访问单元3、H树模块4、主运算模块5和从运算模块6均可以通过硬件电路(例如专用集成电路ASIC)实现。
指令缓存单元1通过直接内存访问单元3读入指令并缓存读入的指令;控制器单元2从指令缓存单元1中读取指令,将指令译成控制其他模块行为的微指令,其中的其他模块例如可以为直接内存访问单元3、主运算模块5和从运算模块6等;直接内存访问单元3能够访存外部地址空间,直接向处理器内部的各个缓存单元读写数据,完成数据的加载和存储。
如图4所示,H树模块可以包括分支处理模块103;其具体的连接结构如图4所示,其中,
主运算模块101与分支处理模块103连接,分支处理模块103与多个从处理模块102连接;
分支处理模块103,用于执行转发主运算模块101与从处理模块102之间的数据或指令。
在一种可选实施例中,以神经网络运算中的全连接运算为例,过程可以为:y=f(wx+b),其中,x为输入神经元矩阵,w为权值矩阵,b为偏置标量,f为激活函数,具体可以为:sigmoid函数,tanh、relu、softmax函数中的任意一个。这里假设为二叉树结构,具有8个从处理电路,其实现的方法可以为:
控制器单元从存储模块内获取输入神经元矩阵x,权值矩阵w以及全连接运算指令,将输入神经元矩阵x,权值矩阵w以及全连接运算指令传输给主运算模块;
主运算模块将输入神经元矩阵x拆分成8个子矩阵,然后将8个子矩阵通过树型模块分发给8个从处理模块,将权值矩阵w广播给8个从处理模块;
从处理模块并行执行8个子矩阵与权值矩阵w的乘法运算和累加运算得到8个中间结果,将8个中间结果发送给主运算模块;
主运算模块,用于将8个中间结果排序得到wx的运算结果,将该运算结果执行偏置b的运算后执行激活操作得到最终结果y,将最终结果y发送至控制器单元,控制器单元将该最终结果y输出或存储至存储模块。
如图5所示,为主运算模块5的结构示例框图,主运算模块5包括运算单元51、数据依赖关系判断单元52和神经元缓存单元53。神经元缓存单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据,运算单元51完成主运算模块5的各种运算功能,数据依赖关系判断单元52是运算单元51读写神经元缓存单元53的端口,同时能够保证神经元缓存单元中数据的读写一致性。同时,数据依赖关系判断单元52也用于将读取数据通过H树模块4发送给从计算模块6,而从计算模块6的输出数据通过H树模块4直接发送给运算单元51。控制器单元2输出的指令发送给计算单元51和数据依赖关系判断单元52,来控制其行为。
如图6所示,为从运算模块6的结构示例框图,每个从运算模块6包括运算单元61、数据依赖关系判断单元62、神经元缓存单元63和权值缓存单元64。运算单元61用于接收控制器单元2发出的微指令并进行算数逻辑运算;数据依赖关系判断单元62用于计算过程中对神经元缓存单元63的读写操作。数据依赖关系判断单元62执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突,例如,所有发往数据依赖关系单元62的微指令都会被存入数据依赖关系单元62内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行;神经元缓存单元63缓存该从运算模块6的输入神经元向量数据和输出神 经元值数据。权值缓存单元64缓存该从运算模块6在计算过程中需要的权值数据。对于每一个从运算模块6,都只会存储全部输入神经元与部分输出神经元之间的权值。以全连接层为例,输出神经元按照从运算单元的个数N进行分段,每段的第n个输出神经元对应的权值存放在第n个从运算单元中。
从运算模块6实现每层人工神经网络正向运算过程中可以并行的算数逻辑运算。以人工神经网络全连接层(MLP)为例,过程为y=f(wx+b),其中权值矩阵w和输入神经元向量x的乘法可以划分为不相关的并行计算子任务,即由于out与in是列向量,每个从运算模块6只计算in中相应的部分标量元素与权值矩阵w对应的列的乘积,得到的每个输出向量都是最终结果的一个待累加的部分和,这些部分和在H树模块4中逐级两两相加得到最后的结果。所以计算过程变成了并行的计算部分和的过程和后面的累加的过程。每个从运算模块6计算出输出神经元值,所有的输出神经元值在H树模块4中拼成最后的中间结果向量。因此,每个从运算模块6只需要计算出中间结果向量y中与本模块对应的输出神经元的值即可。H树模块4对所有从运算模块6输出的神经元值求和,得到最终的中间结果向量y。主运算模块5基于中间结果向量y进行后续计算,比如加偏置、池化(例如最大值池化(MAXPOOLING)或平均值池化(AVGPOOLING)等)、做激活和做采样等。
在该结构中,包括一个CPU和人工神经网络处理器的公用存储模块,允许两个处理器直接进行访问,分别读取数据至CPU的缓存之中和人工神经网络处理器的缓存单元之中。当CPU将要改变缓存中的数据时,采用“写穿透”的方式,改变缓存中数据的相应位置的同时,改变外部存储模块中该数据的相应位置,同时给人工神经网络处理器中的相应数据发送一个失效信号。待人工神经网络处理器使用该数据时,发现失效信号后,从外部存储模块读取新值,并写到人工神经网络处理器中的缓存单元的相应位置。另外,对于CPU中的数据,人工神经网络处理器可以通过定义好的规则,即先向CPU发送请求信号和相应的数据地址,CPU收到请求信号后,回复有效信号和数据来完成数据交互。从而,对于异构的多处理器结构,本实施例提出的数据共享***通过维护同一个存储空间,可降低存储通信开销,降低数据访问延时。
每个处理器内有多个核,核内有核内部存储模块和核外部存储模块,核外部存储模块的数据可以由几个或者所有的核直接进行访问。在本公开的一些实施例中,如图7所示,提出一种数据共享***,其中至少两个处理模块为两个处理器内核,其之间的数据共享通过其内部的核内部存储模块来实现,存储模块则指核外部存储模块。在本实施例中,一个核1需要访问核2的核内部存储模块时,可通过通信协议进行访问。核外部存储模块允许核1和核2进行访问,那么,核1和核2分别读取所需要的数据至核内部存储模块1和核内部存储模块2的相应的位置。通过某种一致性协议维护核外部存储模块和核内部存储模块的数据的一致性问题。现有技术中,当核1改变了自己核内部存储模块中的数据,采用“写回”的方式,只改变核内部存储模块1中的相应位置的数据,同时核外部存储模块发送无效信号至核内部存储模块2。待核内部存储模块1中该部分数据被换出时,或者待核2使用该数据时,发现失效信号后,从核外部存储模块读取新值,并写到核内部存储模块2中的相应位置。但在本实施例中,对于核内部存储模块1中的数据,核2还可以通过某种定义好的规则,如先向核1发送请求信号和相应的数据地址,核1收到请求信号后,回复有效信号和数据来完成数据交互。其中,核与核的种类可以相同,如均为神经网络核,也可以不同,如神经网络核和CPU核。这样能够在对数据进行一定的保护的同时,允许相同或不同结构核对数据存储的访问,维护了数据的一致性。同时降低了访存开销,减少了访存延时。
每个神经网络核内包含多个神经网络运算单元,因此,如图8所示,在本公开的一些实施例中,提出一种数据共享***,其中的至少两个处理模块是指三个运算单元,该三个运算单元可以直接访问核内部存储模块,也可以以一定方向直接传递相关数据,以此,有利于通过数据在运算单元之间的传递,减少对存储模块的访问次数,从而降低功耗和访问延时。不妨假定在完成神经网络运算时,运算单元1计算输出值1,其结果用out1表示,对应的神经元为n=(n1,n2,……,nk),突触值为w=(w1,w2,……,wk), 那么,out1=n1*w1+n2*w2+……+nk*wk。类似的,运算单元2的输出结果为out2,对应的神经元为m=(m1,m2,……,mk),突触值为w=(w1,w2,……,wk),那么,out2=m1*w1+m2*w2+……+mk*wk。运算单元3的输出结果为out3,对应的神经元为q=(q1,q2,……,qk),突触值为w=(w1,w2,……,wk),那么,out3=q1*w1+q2*w2+……+qk*wk。具体的,首先运算单元1从核内部存储模块中读取出n和w,直接进行运算,得到out1;运算单元2从核内部存储模块中读取出m,并接收从运算单元1中传来的突触值w进行相应的运算,得到out2;运算单元3从核内部存储模块中读取出q,并接收从运算单元1中传来的突触值w进行相应的运算,得到out3。从而,减少了对核内部存储模块的访存次数,降低了延迟和功耗,提升了运算速度,节省了运算能耗。
在本公开的一些实施例中,上一实施例中的数据共享***中,还可以在核内增设一层或多层存储单元,允许1个存储单元被几个运算单元共用或1个存储单元被1个运算单元私有。如图9所示,此处假定共享***包括两个存储单元,且存储单元1由运算单元1和运算单元2所共用,运算单元1和运算单元2可以直接访问存储单元1,运算单元3不能直接访问;存储单元2为运算单元3所私有,运算单元3可以直接访问,而运算单元1和运算单元2不能直接访问。这样,如果运算单元1想要访问运算单元3中的运算结果,可以直接通过运算单元3获取,无需经过存储单元1访问核内部存储模块,而后让存储单元2更新核内部存储模块后传入存储单元1,再允许运算单元1进行访问这样一个漫长的过程,从而在对数据进行有效保护作用的同时,即其他无权限的运算单元(如运算单元1)不能随意更改存储单元(如存储单元2)的同时,又可大大缩减访存次数,避免了片上存储多份相同数据对片上存储资源的浪费,从而,降低了延迟和功耗,进一步提升运算速度,节省运算能耗。
图10是本公开实施例中的信息处理装置结构示意图,该装置包括存储模块和数据处理模块;存储模块,用于接收并存储输入数据、指令和输出数据;其中,输入数据包含一个或多个关键特征,输入数据为原始输入数据,或对原始输入数据进行预处理后的数据;数据处理模块,用于对输入数据包含的关键特征进行判断,即数据处理模块计算输入数据包含的关键特征的置信度,置信度即判断结果,并根据判断结果对存储模块中的输入数据进行评分。
存储模块中存储有数据和指令,数据包括输入数据,输入神经元,权值,输出神经元,输出数据;输入数据传给人工神经网络中的各个输入神经元,从而参与后续运算;输出神经元的值即判断结果和/或评分,作为输出数据。
图11是本公开实施例中的包括运算模块的信息处理装置结构示意图,其中,数据处理模块包括运算模块,用于根据存储模块中存储的指令对存储模块中存储的数据执行相应的计算,并将运算结果输出至存储模块。
运算模块执行运算包括神经网络计算,运算模块包括但不仅限于:第一部分乘法器;第二部分一个或者多个加法器(更具体的,第二个部分的加法器组成加法树);第三部分为激活函数单元;和/或第四部分向量处理单元。更具体的,向量处理单元可以处理向量运算和/或池化运算。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1×in2;第二部分将输入数据in1通过加法器相加得到输出数据(out)。更具体的,第二部分为加法树时,将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过称为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法数累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in 2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。 向量处理单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。以上几个部分的运算可以自由选择一个多个部分进行不同顺序的组合,从而实现各种不同功能的运算。
图12是本公开实施例中的包括指令缓存和神经网络数据缓存的信息处理装置结构示意图;如图12所示,其中,该信息处理装置的数据处理模块还包括指令缓存和神经网络数据缓存;指令缓存,用于缓存指令;神经网络数据缓存,用于缓存存储模块中的权值数据、输入神经元和输出神经元。
图13是本公开实施例中的神经网络数据缓存的结构示意图;如图13所示,神经网络数据缓存包括权值缓存、输入神经元缓存和输出神经元缓存;指令缓存,用于缓存指令;神经网络数据缓存,用于缓存所述存储模块中的权值数据、输入神经元和输出神经元。
图14是本公开实施例中的包括直接内存存取和控制单元的信息处理装置结构示意图;如图14所示,其中,该信息处理装置的数据处理模块还包括直接内存存取,起到沟通存储模块和各个缓存之间桥梁的作用,用于对存储模块中存储的数据和/或指令进行读写,将读写出的指令存储至指令缓存,将读出的权值存储至权值缓存,将读出的输入神经元,即输入数据存储至输入神经元缓存,并将接收自输出神经元缓存的输出神经元,即判断结果和/或评分存储至存储模块;指令缓存,用于存储直接内存存取缓存的指令;权值缓存,用于缓存直接内存存取缓存的权值数据;输入神经元缓存,用于缓存直接内存存取缓存的输入神经元。同样地,如图14所示,该信息处理装置的数据处理模块还包括控制单元,用于从指令缓存中读取指令,将其译码为运算模块能够执行的指令并输出至运算模块;输出神经元缓存,用于缓存运算模块输出的运算结果,即判断结果和/或评分,并输出给直接内存存取。
图15是本公开实施例中的信息处理装置的具体结构示意图;如图15所示,数据处理模块还可以包括评分单元,该单元用于:当信息处理装置中运行的人工神经网络得到判断结果,进而得到评分时,该单元不参与数据处理;当信息处理装置中运行的人工神经网络仅得到判断结果而不得到评分时,该单元用于根据判断结果得到评分。
其中,判断结果即信息处理装置中运行的人工神经网络的最终输出层的输出神经元的值,输出神经元的值即为关键特征出现的置信度,置信度为一定范围内的自然数,例如:置信度在【0,1】之间,表示关键特征出现的概率;置信度二值化{0,1},0表示未出现关键特征,1表示出现关键特征或1表示不出现关键特征,0表示出现关键特征。置信度的表示方式不仅限于以上两种。
其中,评分为在信息处理装置中运行的人工神经网络的最终输出层后面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度;该层只有一个输出神经元,其值即为评分,该新的最终输出层运算中的权值对应各个关键特征的重要程度;或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000003
其中,评分还可以为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。评分单元得到评分的方法有很多种,可以是复杂的机器学***均,然后乘100得到百分制的评分。
其中,如上所述的信息处理装置为人工神经网络芯片。
图16是本公开实施例中的信息处理装置的信息处理方法流程图;具体包括:
S101:存储模块接收并存储输入数据,输入数据包含一个或多个关键特征;
S102:数据处理模块对输入数据包含的关键特征进行判断,并根据判断结果对存储模块中的输入数据进行评分,其中评分即可以由信息处理装置中运行的人工神经网络得到,也可以由数据处理模块中的评分单元得到。
图17是本公开实施例中的信息处理***结构示意图,该信息处理***包括:
信息获取装置,用于获取外部数据,并传递给信息处理装置;
信息处理装置,用于对接收自信息获取装置的外部数据进行运算处理,并将运算处理结果输出给交互界面;
交互界面,用于显示接收自信息处理装置的运算结果,以及将接收自外部的操作或命令传输给控制装置;
控制装置,用于根据接收自交互界面的操作或命令控制信息获取装置、信息处理装置和交互界面的运作。
信息获取装置,用于获取外部数据,并将外部数据直接或经预处理后传递给信息处理装置;外部数据包括文字、图片、音频和/或视频;其中,信息获取装置至少包括字符识别装置、图像识别装置和语音识别装置,字符识别装置用于获取外部数据中的文字信息,文字信息为一种或多种语言文字和/或符号的组合,一种或多种语言文字和/或符号的组合至少为语文、数学、物理等科目的试卷答案;图像识别装置用于获取外部数据中的图片或视频信息,图像识别装置是摄像头;所述图片为二维图和/或二维透视图,二维图和/或二维透视图至少为美术、制图等科目的试卷答案;语音识别装置用于获取外部数据中的音频信息,语音识别装置是麦克风。预处理操作能使输入数据更适于人工神经网络处理,去除输入数据中的噪声和冗余,提高分类、识别精度等。
信息处理装置,用于对接收自信息获取装置的外部数据或经预处理后的外部数据进行运算处理,并将运算结果输出给交互界面;在公开的实施例中信息处理装置采用人工神经网络芯片实现。运算结果即判断结果或评分。
其中判断结果,即信息处理装置中运行的人工神经网络的最终输出层的输出神经元的值,输出神经元的值即为关键特征出现的置信度,置信度为一定范围内的自然数,例如:置信度在【0,1】之间,表示关键特征出现的概率;置信度二值化{0,1},0表示未出现关键特征,1表示出现关键特征或1表示不出现关键特征,0表示出现关键特征。置信度的表示方式不仅限于以上两种。评分为:在信息处理装置中运行的人工神经网络的最终输出层后面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度;该层只有一个输出神经元,其值即为评分,该新的最终输出层运算中的权值对应各个关键特征的重要程度;或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000004
所述评分,还可以为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。评分单元得到评分的方法有很多种,可以是复杂的机器学***均,然后乘100得到百分制的评分。
本公开的具体实施例的信息处理装置,采用人工神经网络芯片。人工神经网络芯片能够自适应性训练,芯片积累用户的数据自我学习,会逐渐适应用户的譬如笔迹,习惯书写错误,体态特征,习惯动作,不断提高准确率和提高对用户的动作/姿势调整能力。人工神经网络芯片计算能力强大,支持离线运行神经网络,在没有云端服务器协助计算的情况下用户终端/前端离线即可实现
自动评分监控的工作;当芯片联网,获得云端服务器协助计算的时候,芯片计算能力更加强大。人工神经网络芯片的使用,对手写,文字,图片动作自动评分,代替了人工,而且相对人工评分更精确,快速;对主观题评价更客观,忽略了人的喜好影响和测试者书法水平的影响。
交互界面,用于显示接收自信息处理装置的输出结果,以及将接收自外部的操作或命令传输给控制装置。其中,用户交互界面为手机,电脑,笔记本,平板电脑等的显示屏。
控制装置,用于根据接收自交互界面的操作或命令控制信息获取装置、信息处理装置和交互界面的运作。
图18是本公开实施例中的信息处理***的信息处理方法流程图,如图所示,该信息处理方法,包括:
S201:信息获取装置获取外部数据,并将外部数据直接或经预处理后传递给信息处理装置;
S202:信息处理装置对接收自信息获取装置的外部数据或经预处理后的外部数据进行运算处理,并将运算结果输出给交互界面;
S203:交互界面,用于显示接收自信息处理装置的运算结果。
其中,信息获取装置获取外部数据,并将外部数据直接或经预处理后传递给信息处理装置,外部输入数据包括文字、图片、音频和/或视频,进行预处理,得到与信息处理装置相契合的数据,预处理包括切分、高斯滤波、二值化、正则化或归一化等;预处理能使输入数据更适于人工神经网络处理,去除输入数据中的噪声和冗余,提高分类、识别精度等。
人工神经网络芯片能够自适应性训练,芯片积累用户的数据自我学***的影响。
实施例一
本实施例的信息处理装置,用于对信息获取装置中识别字符装置获取的一组包含一个或多个关键特征的试卷进行评分,试卷中的关键特征包括关键词,通过人工神经网络芯片的运算,人工神经网络芯片的最终输出层的输出神经元输出判断结果,判断结果即试卷的关键特征出现的置信度,例如关键词出现的置信度,置信度在【0,1】之间,表示关键特征出现的概率;其中,置信度越高,该关键词出现的概率越大。置信度二值化{0,1},0表示未出现关键特征,1表示出现关键特征,或1表示不出现关键特征,0表示出现关键特征;置信度的表示方式不仅限于以上两种。在人工神经网络芯片的最终输出层上面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度。
其中,评分可以为:在信息处理装置中运行的人工神经网络的最终输出层后面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度。该层只有一个输出神经元,其值即为评分值,该新的最终输出层运算中的权值对应各个关键特征的重要程度;或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000005
其中,评分还可以为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。评分单元得到评分的方法有很多种,可以是复杂的机器学***均,然后乘100得到百分制的评分。
通过获取试卷中关键词,经人工神经网络的运算给出关键词及出现的概率,进而通过增加新的一层最终输出层或将其作为评分单元的输入给出试卷的评分。该评分显示于手机,电脑,笔记本,平板电脑等的显示屏上。用户能够通过显示屏获得试卷的评分。
请参照图12,在人工神经网络芯片中对关键词的具体处理过程为:
步骤1,信息获取装置中的字符识别装置、图像识别装置、语音识别装置获取的外部数据经预处理或直接传入人工神经网络芯片的存储模块;外部数据经预处理,能使外部数据更适于人工神经网络处理,去除输入数据中的噪声和冗余,提高分类、识别精度等。
步骤2,直接内存存取(DMA)将存储模块中的数据分批传入相应的片上缓存(即指令缓存,输入神经元缓存,权值缓存)中;人工神经网络芯片中,采用专用的片上缓存(即指令缓存、输入神经元缓存、输出神经元缓存和权值缓存)和专用的人工神经网络运算、访存指令能有效提高运算、访存效率。
步骤3,控制单元从指令缓存中读取指令,将其译码后传入运算模块;
步骤4,运算模块根据指令执行相应的运算,在神经网络的各个层中,运算模块执行运算包括但不仅限于:第一部分乘法器;第二部分一个或者多个加法器(更具体的,第二个部分的加法器组成加法树);第三部分为激活函数单元;和/或第四部分向量处理单元。更具体的,向量处理单元可以处理向量运算和/或池化运算。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1×in2;第二部分将输入数据in1通过加法器相加得到输出数据(out)。更具体的,第二部分为加法树时,将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过称为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法树累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。向量处理单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。以上几个部分的运算可以自由选择一个多个部分进行不同顺序的组合,从而实现各种不同功能的运算。
人工神经网络芯片中,运算模块采用的加法树运算能对多组权值和输入神经元并行处理,能够提高运算效率。
步骤5,重复步骤2到步骤4,直到存储模块中所有的数据运算完毕,即得到功能需求的最终结果。其中所述最终结果由神经网络最后一层的输出神经元得到,从运算模块输出到输出神经元缓存中,然后经DMA返回存储模块。
根据所述功能需求:若要求得到判断结果,则上述神经网络最后一层输出神经元的值即为关键词出现的置信度;该层只有一个输出神经元,其值即为试卷评分,该新的最终输出层运算中的权值对应各个关键特征的重要程度;或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000006
该最大的评分值即试卷评分。
评分还可以为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。评分单元得到评分的方法有很多种,可以是复杂的机器学***均,然后乘100得到百分制的评分。
实施例二:
本实施例的信息处理装置,用于对视频进行评分,视频即一组包含一个或多个关键特征的图片。人工神经网络芯片中的存储模块预存一个或多个关键图片;存储模块从外部获取视频,并将其传入至运算模块,通过人工神经网络芯片的运算,人工神经网络芯片的最终输出层的输出神经元输出判断结果,判断结果即各个输入图片与每个关键图片的相似度,详细来说,如果输入图片有N个,关键图片有M个,则得到NM个相似度。该实施例的相似度即置信度,置信度为一定范围内的自然数,置信度在【0,1】之间,表示关键特征出现的概率;置信度二值化{0,1},0表示未出现关键特征,1表示出现关键特征或1表示不出现关键特征,0表示出现关键特征;置信度的表示方式不仅限于以上两种。
在人工神经网络芯片的最终输出层上面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为各个关键特征出现的置信度,置信度即输入图片与每个关键图片的相似度,如果该层只有一个输出神经元,其值即为对视频的评分,该新的最终输出层运算中的权值对应各个相似度的重要程度。或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000007
置信度最大的评分值即对视频的评分。
其中,评分还可以为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。评分单元得到评分的方法有很多种,可以是复杂的机器学***均,然后乘100得到百分制的评分。
该评分显示于手机,电脑,笔记本,平板电脑等的显示屏上。用户能够通过显示屏获得对视频的评分。
其中,视频还包括音频,音频分为多段音频,多段音频与多个图片对应。芯片可以比较视频中所有图片与各个关键图片的相似度,和/或比较视频中所有音频分解得到的各个波形和关键波形的相似度,对视频进行评分。
其中,获得相似度的另一种方法为:神经网络的最终输出层的每个输出神经元对应一个输入图片,输出神经元的值即为与该输入图片最相似的关键图片与该输入图片的相似度。如果和前面的例子保持一致,则该层共N个输出神经元。
其中,得到相似度的再一种方法为:神经网络的最终输出层的每个输出神经元对应一个关键图片,输出神经元的值即为与该关键图片最相似的输入图片与该关键图片的相似度。如果和前面的例子保持一致,则该层共M个输出神经元。
请参照图12,在人工神经网络芯片中对视频数据的具体处理过程为:步骤1,信息获取装置中的字符识别装置、图像识别装置、语音识别
装置获取的外部数据经预处理或直接传入人工神经网络芯片的存储模块;装置中的预处理模块,能使输入数据更适于人工神经网络处理,去除输入数据中的噪声和冗余,提高分类、识别精度等。
步骤2,直接内存存取(DMA)将存储模块中的数据分批传入相应的片上缓存中,即指令缓存,输入神经元缓存,权值缓存中;人工神经网络芯片中,采用专用的片上缓存(即指令缓存、输入神经元缓存、输出神经元缓存和权值缓存)和专用的人工神经网络运算、访存指令能有效提高运算、访存效率。
步骤3,控制单元从指令缓存中读取指令,将其译码后传入运算模块;
步骤4,运算模块根据指令执行相应的运算:在神经网络的各个层中,
运算模块执行运算包括神经网络计算。
运算模块包括但不仅限于:第一部分乘法器;第二部分一个或者多个加法器(更具体的,第二个部分的加法器组成加法树);第三部分为激活函数单元;和/或第四部分向量处理单元。更具体的,向量处理单 元可以处理向量运算和/或池化运算。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1×in2;第二部分将输入数据in1通过加法器相加得到输出数据(out)。更具体的,第二部分为加法树时,将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过称为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法数累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in 2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。向量处理单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。以上几个部分的运算可以自由选择一个多个部分进行不同顺序的组合,从而实现各种不同功能的运算。
人工神经网络芯片中,运算模块采用的加法树运算能对多组权值和输入神经元并行处理,有效提高运算效率。
步骤5,重复步骤2到步骤4,直到存储模块中所有数据运算完毕,即得到功能需求的最终结果。其中所述最终结果由神经网络最后一层的输出神经元得到,从运算模块输出到输出神经元缓存中,然后经DMA返回存储模块。
根据所述功能需求:若要求得到相似度,则上述神经网络最后一层输出神经元的值即为相似度值;若要求进行评分,在最后一层输出层上面再加一层作为新的最终输出层,该新的最终输出层的输入神经元值为相似度值;该新的最终输出层包括一个输出神经元,其值即为视频评分;该新的最终输出层运算中的权值对应各个相似度值的重要程度。或者该层有N+1个输出神经元,评分的取值范围为[0,N],若将该层输出神经元编号为0,1,2,...,N,则第i个输出神经元的值对应评分值取i的置信度P i,最终评分为置信度最大的评分值,即评分=i 0,
Figure PCTCN2018092829-appb-000008
该最大的评分值即视频评分。
评分还可以为:在信息处理装置中运行的人工神经网络的最终输出层得到各关键特征出现的置信度后,将其作为评分单元的输入,评分单元据此得到评分。评分单元得到评分的方法有很多种,可以是复杂的机器学***均,然后乘100得到百分制的评分。
人工神经网络芯片计算能力强大,支持离线运行神经网络,在没有云端服务器协助计算的情况下用户终端/前端离线即可实现自动评分监控的工作;当芯片联网,获得云端服务器协助计算的时候,芯片计算能力更加强大。人工神经网络芯片,对视频中的图片动作自动评分,代替了人工,而且相对人工评分更精确,快速;对主观题评价更客观,忽略了人的喜好影响。本实施例的装置和方法,即时监控用户的动作/姿势,自动即时发出提醒调整用户的动作/姿势,代替了人工的教练和监护工作,并且相对人工更准确,即时。
人工神经网络芯片的自适应性训练使得芯片积累用户的数据,自我学习,会逐渐适应用户的譬如笔迹,习惯书写错误,体态特征,习惯动作,不断提高准确率和提高对用户的动作/姿势调整能力。
本公开的实施例中的所有的模块都可以是硬件结构,硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器,DNA计算机。
本公开一实施例提供了任务切分装置,图19为本公开一实施例任务切分装置的结构框图,如图19所示,任务切分装置100包括粒度任务切分单元10和任务切分粒度选择单元20。粒度任务切分单元10采用至少一种粒度对任务进行切分形成子任务,为神经网络应用提供多粒度的任务切分选择,任务切分粒度选 择单元20选择任务划分采用的粒度,指导神经网络选择最合适的任务切分粒度,使得切分后的子任务能够满足***实时性。
在一实施例中,如图19所示,粒度任务切分单元10包括第一粒度任务切分单元11、第二粒度任务切分单元12、第三粒度任务切分单元13、第四粒度任务切分单元以及第五粒度任务切分单元15。
以下具体介绍该五个粒度任务切分单元假设神经网络应用需要完成M个样本计算,神经网络拓扑结构结构由N个层组成。其中M,N是大于0的正整数。
第一粒度任务切分单元11将任务整体作为一子任务,具体的,将完成M个样本计算作为一个子任务。这种任务切分方式只生成一个子任务,子任务之间不存在依赖关系。
第二粒度任务切分单元12将完成若干个样本计算作为一个子任务。神经网络被切分成为m个子任务,第i个任务完成Mi个样本的计算,其中m是大于1小于等于M的正整数,i=1,2,3,……m,Mi是大于0小于M的正整数,且满足M1+M2+…+Mm=M。这种任务切分方式的m个子任务之间不存在依赖关系。
第三粒度任务切分单元13可以按照神经网络的层类型对神经网络应用进行任务切分,相同类型层的计算作为一个任务。神经网络的层类型包括但不仅限于卷积层,全连接层,LSTM层,池化层,激活层,LRN层,BN层。这种任务切分方式的子任务之间存在复杂的依赖关系。
第四粒度任务切分单元14可以按照神经网络的层间结构对神经网络应用进行任务切分,相邻若干个层的计算作为一个子任务。神经网络应用被切分为n个子任务,第一个子任务完成神经网络第一层到第N1层,共计N1层计算,第二个子任务完成第N1+1层到第N1+N2层,共计N2层神经网络计算,第i个子任务完成第N1+…+Ni-1+1层到第N1+…+Ni层,共计Ni层计算。其中n是大于0小于等于N的正整数,i=1,2,3,……n,Ni是大于0小于等于N的正整数且满足N1+N2+…+Ni+…+Nn=N。这种任务切分方式的子任务之间存在链式的依赖关系,其中第i个子任务是第i+1个子任务的前驱任务,第i+1个任务是第i个任务的后继任务,第i+1个任务必须等待第i个任务完成才能开始执行。
第五粒度任务切分单元15可以按照神经网络的层内结构对神经网络应用进行任务切分,神经网络层内的计算可以进一步被切分为子任务。按神经网络层内的计算的切分包括但不限于对神经网络的一卷积层计算、全连接层计算、池化层计算或激活层计算进行任务切分。
对神经网络的一个卷积层计算进行任务切分,卷积层输入神经元是三维矩阵(Nfin,Nxin,Nyin),权值是四维矩阵(Nfout,Nfout,Kx,Ky),输出神经元是三维矩阵(Nfout,Nxout,Nyout),其中Nfin是输入特征图像数量,(Nxin,Nyin)是输入特征图像大小,Nfout是输出特征图像数量,(Kx,Ky)是卷积核大小,(Nxout,Nyout)是输出特征图像大小。完成一个输出神经元需要Nfin×Kx×Ky次乘加运算,输出神经元数量为Nfout×Nxout×Nyout,完成整个卷积层总共需要Nfout×Nxout×Nyout×Nfin×Kx×Ky次乘加运算。在进行任务切分时,将输出神经元按照(Bfout,Bxout,Byout)的块大小进行切分,同时对权值按照(Bfout,Bfin,Bx,By)的块大小进行切分,则每一个子任务用(Bfout,Bfin,Bx,By)权值计算Bfout×Bxout×Byout个输出神经元的中间结果,每个输出神经元中间结果进行Bfin×Bx×By次乘加运算,共需要完成Bfout×Bxout×Byout×Bfin×Bx×By次乘加运算。其中Bfout是大于0小于等于Nfout的正整数,Bxout是大于0小于等于Nxout的正整数,Byout是大于0小于等于Nyout的正整数,Bfin是大于0小于等于Nfin的正整数,Bx是大于0小于等于Kx的正整数,By是大于0小于等于Ky的正整数。这种任务切分方式的子任务之间不存在依赖关系。
对神经网络的一个全连接层计算进行任务切分,全连接层输入神经元是Nin,权值是二维矩阵(Nout,Nin),输出神经元Nout,其中Nin是输入神经元数量,Nout是输出神经元数量。完成一个输出神经元需要Nin次乘加运算,输出神经元数量为Nout,完成整个全连接层总共需要Nout×Nin次乘加运算。在进行任务切分时,将输出神经元按照Bout的块大小进行切分,同时对权值按照(Bout,Bin)的块大小进行切分,则每一个子任务用(Bout,Bin)的权值矩阵计算Bout个输出神经元的中间结果,每一个输出神经元的中间需要完成Bin 次乘加运算,共需要完成Bout×Bin次乘加运算。其中Bout是大于0小于等于Nout的正整数,Bin是大于0小于等于Nin的正整数。这种任务切分方法的子任务之间不存在依赖关系。
对神经网络的一个池化层计算进行任务切分,池化层输入神经元是Nin,输出神经元Nout,其中Nin,Nout是大于0的正整数,池化操作包括但不仅限于平均值池化,最大值池化,中值池化。在进行任务切分时,将输出神经元按照Bout的块大小进行切分,则每一个子任务完成Bout个输出神经元的计算。其中Bout是大于0小于等于Nout的正整数,Bin是大于0小于等于Nin的正整数。这种任务切分方式的子任务之间不存在依赖关系。
对神经网络的一个激活层计算进行任务切分,激励输入神经元是Nin,输出神经元Nout,其中Nin,Nout是大于0的正整数,激活函数包括但不仅限于sigmoid、tanh、relu、softmax。在进行任务切分时,将输出神经元按照Bout的块大小进行切分,则每一个子任务完成Bout个输出神经元的计算。其中Bout是大于0小于等于Nout的正整数。这种任务切分方式的子任务之间不存在依赖关系。
任务切分粒度选择单元20选择任务划分采用的粒度,并不限于仅选择上述的一种粒度,还可以是多种粒度的组合,例如一个神经网络应用可以组合第四粒度任务单元和第五粒度任务切分单元的切分方式。将神经网络应用首先按照第四粒度任务切分单元14的切分方法分为n个子任务,再将其中的p个子任务按照第五粒度任务切分单元1的切分方式进行切分。
在其他实施例中,粒度任务切分单元10可以包括第一至第五粒度任务切分单元中的至少一个,不一定包括全部第一至第五粒度任务切分单元。
在其他实施例中,粒度任务切分单元10还可以包括混合粒度任务切分单元,用于组合第一至第五粒度任务切分单元的切分方式,供任务切分粒度选择单元20选择。
本公开另一实施例提供一任务调度装置,图20为本公开一实施例任务调度装置的结构框图,如图20所示,任务调度装置300包括任务队列单元30、监测单元40以及任务调度单元50。神经网络任务调度装置300能够综合考虑任务之间的依赖关系,任务的局部性,任务切分粒度,核的运行频率及负载进行任务调度,提高服务质量,提高核的利用率,保证核之间的任务均衡,减少能耗。
任务队列单元30缓存所有未调度的神经网络任务,并且可选择性地存储每一个待调度任务的执行时间,任务依赖关系图,任务资源在核内处理分布情况,神经网络任务例如是上一实施例中切分的子任务。
监测单元40实时检测多核神经网络处理器的整体服务质量以及各核的工作状态,例如为每一个核的利用率,工作负载,工作频率,核内私有任务队列中的任务数量,任务完成时间。
任务调度单元50从未调度任务中选择待调度任务,根据待调度任务信息及所述各核工作状态,确定待调度任务和目标核之间的映射关系,将待调度任务分配到目标核中。
任务调度单元50可以每隔时间T对任务队列中未调度任务进行调度,T是大于0的实数。若未调度任务t的与其他任务存在依赖关系且前驱任务没有完成,则任务调度单元50不会调度任务t。
任务调度单元50选择从未调度任务中选择待调度任务方式可以采用如下至少一种方式:随机选择任务,选择预计执行时间最长的任务,选择预计执行时间最短的任务,选择占用资源最多的任务,选择占用资源最少的任务。
任务调度单元50可以采用以下调度方式中的至少一种将待调度任务分配调度至目标核。
第一种调度方式:统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核,将待调度任务分给该目标核;
第二种调度方式:统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核,将待调度任务分给该目标核;
第三种调度方式:统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核,将待调度任务分给该目标核;
第四种调度方式:采用启发式算法将待调度任务分配到目标核,启发式算法包括但不仅限于是遗传算法,蚁群算法,模拟退火算法。
本公开再一实施例提供一种多核处理器,例如为多核神经网络处理器,图21为本公开再一实施例多核处理器的结构框图,如图21所示,多核神经网络处理器1000包括:J个处理核,J是大于1的正整数,前述实施例中的任务切分装置100以及任务调度装置300。
任务切分装置100切分输入的神经网络应用,使得切分后的子任务能够满足***实时性,任务调度装置300进行神经网络子任务调度,能够提高服务质量,提高处理核的利用率,保证处理核之间的任务均衡,减少能耗。神经网络处理核进行神经网络运算,完成神经网络子任务,J个神经网络处理核之间的拓扑结构包括但不仅限于是一维线性,二维mesh,二维星形,三维立方等。
图22为本公开再一实施例中神经网络处理的每一个神经网络处理核的结构框图,如图22所示,神经网络处理核500包括存储单元501,控制单元502、选数单元503和运算单元504。
存储单元501,用于存储神经网络的神经元、权值以及指令;当神经网络子任务处理稀疏神经网络时,存放的权值为非零权值以及非零权值的位置信息。
指令控制单元502,用于接收神经网络专用指令,经过译码后生成控制信息控制选数单元和运算单元;
所述神经网络专用指令,包括所有专用于完成人工神经网络运算的指令。神经网络专用指令包括但不仅限于控制指令,数据传输指令,运算指令和逻辑指令。其中控制指令控制神经网络执行过程。数据传输指令完成不同存储介质之间的数据传输,数据格式包括但不仅限于矩阵,向量和标量。运算指令完成神经网络的算术运算,包括但不仅限于矩阵运算指令,向量运算指令,标量运算指令,卷积神经网络运算指令,全连接神经网络运算指令,池化神经网络运算指令,RBM神经网络运算指令,LRN神经网络运算指令,LCN神经网络运算指令,LSTM神经网络运算指令,RNN神经网络运算指令,RELU神经网络运算指令,PRELU神经网络运算指令,SIGMOID神经网络运算指令,TANH神经网络运算指令,MAXOUT神经网络运算指令。逻辑指令完成神经网络的逻辑运算,包括但不仅限于向量逻辑运算指令和标量逻辑运算指令。
其中,RBM神经网络运算指令用于实现Restricted Boltzmann Machine(RBM)神经网络运算。
其中,LRN神经网络运算指令用于实现Local Response Normalization(LRN)神经网络运算。
其中,LSTM神经网络运算指令用于实现Long Short-Term Memory(LSTM)神经网络运算。
其中,RNN神经网络运算指令用于实现Recurrent Neural Networks(RNN)神经网络运算。
其中,RELU神经网络运算指令用于实现Rectified linear unit(RELU)神经网络运算。
其中,PRELU神经网络运算指令用于实现Parametric Rectified Linear Unit(PRELU)神经网络运算。
其中,SIGMOID神经网络运算指令用于实现S型生长曲线(SIGMOID)神经网络运算
其中,TANH神经网络运算指令用于实现双曲正切函数(TANH)神经网络运算。
其中,MAXOUT神经网络运算指令用于实现(MAXOUT)神经网络运算。
更具体的,它包括Cambricon指令集。
所述Cambricon指令集的特征在于,指令集中每一条指令长度为64bit,指令由操作码和操作数组成。指令集包含四种类型的指令,分别是控制指令(control instructions),数据传输指令(data transfer instructions),运算指令(computational instructions),逻辑指令(logical instructions)。
进一步的,控制指令用于控制执行过程。控制指令包括跳转(jump)指令和条件分支(conditional branch)指令。
进一步的,数据传输指令用于完成不同存储介质之间的数据传输。数据传输指令包括加载(load)指令,存储(store)指令,搬运(move)指令。load指令用于将数据从主存加载到缓存,store指令用于将数据从缓存存储到主存,move指令用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据。数据传输指令支持三种不同的数据组织方式,包括矩阵,向量和标量。
进一步的,运算指令用于完成神经网络算术运算。运算指令包括矩阵运算指令,向量运算指令和标量运算指令。
更进一步的,矩阵运算指令完成神经网络中的矩阵运算,包括矩阵乘向量(matrix multiply vector),向量乘矩阵(vector multiply matrix),矩阵乘标量(matrix multiply scalar),外积(outer product),矩阵加矩阵(matrix add matrix),矩阵减矩阵(matrix subtract matrix)。
更进一步的,向量运算指令完成神经网络中的向量运算,包括向量基本运算(vector elementary arithmetics),向量超越函数运算(vector transcendental functions),内积(dot product),向量随机生成(random vector generator),向量中最大/最小值(maximum/minimum of a vector)。其中向量基本运算包括向量加,减,乘,除(add,subtract,multiply,divide),向量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。
更进一步的,标量运算指令完成神经网络中的标量运算,包括标量基本运算(scalar elementary arithmetics)和标量超越函数运算(scalar transcendental functions)。其中标量基本运算包括标量加,减,乘,除(add,subtract,multiply,divide),标量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。
进一步的,逻辑指令用于神经网络的逻辑运算。逻辑运算包括向量逻辑运算指令和标量逻辑运算指令。
更进一步的,向量逻辑运算指令包括向量比较(vector compare),向量逻辑运算(vector logical operations)和向量大于合并(vector greater than merge)。其中向量比较包括但大于,小于,等于,大于等于,小于等于和不等于。向量逻辑运算包括与,或,非。
更进一步的,标量逻辑运算包括标量比较(scalar compare),标量逻辑运算(scalar logical operations)。其中标量比较包括但大于,小于,等于,大于等于,小于等于和不等于。标量逻辑运算包括与,或,非。
选数单元503,用于接收输入神经元和非零权值位置信息,选出非零权值对应的神经元。也就是说:对于每个输出神经元数据,选数单元去除掉与该输出神经元数据没有对应的非零权值数据的输入神经元数据。
运算单元504,用于接收输入非零权值对应的神经元和对应的非零权值,完成神经网络训练运算并将输出神经元重新传输给存储部分。
具体地,运算单元504根据存储单元中存储的指令对所述数据执行相应运算。运算单元504包括但不仅限于三个部分,第一部分为乘法器,第二部分为一个或多个加法器,第三部分为激活函数单元。优选的,第二部分的一个或多个加法器组成加法树。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1×in2;第二部分将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过称为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法数累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。
运算单元还可以包括池化单元,池化单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
所述运算单元执行运算包括但不仅限于,第一部分是将所述输入数据1和输入数据2相乘,得到相乘之后的数据;第二部分执行加法树运算,用于将输入数据1通过加法树逐级相加,或者将所述输入数据1通过和输入数据2相加得到输出数据;第三部分执行激活函数运算,对输入数据通过激活函数(active)运算得到输出数据。以上几个部分的运算可以自由组合,从而实现各种不同功能的运算。
神经网络处理核500还可包括预处理模块505,如图4所示,该模块对原始数据进行预处理,包括切分、高斯滤波、二值化、正则化、归一化等等。
神经网络处理核500还可包括指令缓存506,非零权值缓存507,非零权值位置缓存508,输入神经元缓存509,输出神经元缓存510。指令缓存506,用于存储专用指令;非零权值缓存507,用于缓存非零权值数据;非零权值位置缓存508,用于缓存非零权值位置数据并根据非零权值位置数据将输入数据中每个权值一一对应到相应的输入神经元;输入神经元缓存509,用于缓存输入神经元;输出神经元缓510,用于缓存运算单元输出的输出神经元。
非零权值位置数据表示每个输入神经元数据和每个输出神经元数据是否有对应的权值非零的权值数据。
一种情形下非零权值位置缓存一一对应的方法为采用1表示有连接,0表示无连接,每组输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示该输出神经元的连接关系。另一种情形下非零权值位置缓存一一对应的方法为采用1表示有连接,0表示无连接,每组输入神经元与所有输出神经元的连接状态组成一个0和1的字符串来表示该输入神经元的连接关系。另一种情形下非零权值位置缓存一一对应的方法为将一组输出神经元第一个连接所在的输入神经元位置距离第一个输入神经元的距离、所述输出神经元第二组输入神经元距离上一个输入神经元的距离,所述输出神经元第三组输入神经元距离上一个输入神经元的距离,……,依次类推,直到穷举所述输出神经元的所有输入神经元,来表示所述输出神经元的连接关系。
上述的有连接关系为每个输入神经元数据和每个输出神经元数据有对应的非零的权值数据,无连接意思为每个输入神经元数据和每个输出神经元数据是否有对应的非零的权值数据。
神经网络处理核500还可包括直接数据存取单元DMA 512(direct memory access)。
DMA用于在所述存储单元、指令缓存、非零权值缓存、非零权值位置缓存,输入神经元缓存和输出神经元缓存中进行数据或者指令读写。
在一些实施例里,公开了一种芯片,其包括了上述神经网络处理器。
在一些实施例里,公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,公开了一种板卡,其包括了上述芯片封装结构。
在一些实施例里,公开了一种电子装置,其包括了上述板卡。
电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
本公开又一实施例提供一种任务切分方法,用于神经网络,选择以下五种粒度任务切分方式中的至少一个来进行任务切分。
第一粒度任务切分方式将任务整体作为一子任务,具体的,将完成M个样本计算作为一个子任务。这种任务切分方式只生成一个子任务,子任务之间不存在依赖关系。
第二粒度任务切分方式将完成若干个样本计算作为一个子任务。神经网络被切分成为m个子任务,第i个任务完成Mi个样本的计算,其中m是大于1小于等于M的正整数,i=1,2,3,……m,Mi是大于0小于M的正整数,且满足M1+M2+…+Mm=M。这种任务切分方式的m个子任务之间不存在依赖关系。
第三粒度任务切分方式可以按照神经网络的层类型对神经网络应用进行任务切分,相同类型层的计算作为一个任务。神经网络的层类型包括但不仅限于卷积层,全连接层,LSTM层,池化层,激活层,LRN层,BN层。这种任务切分方式的子任务之间存在复杂的依赖关系。
第四粒度任务切分方式可以按照神经网络的层间结构对神经网络应用进行任务切分,相邻若干个层的计算作为一个子任务。神经网络应用被切分为n个子任务,第一个子任务完成神经网络第一层到第N1层,共计N1层计算,第二个子任务完成第N1+1层到第N1+N2层,共计N2层神经网络计算,第i个子任务完成第N1+…+Ni-1+1层到第N1+…+Ni层,共计Ni层计算。其中n是大于0小于等于N的正整数,,i=1,2,3,……n,Ni是大于0小于等于N的正整数且满足N1+N2+…+Ni+…+Nn=N。这种任务切分方式的子任务之间存在链式的依赖关系,其中第i个子任务是第i+1个子任务的前驱任务,第i+1个任务是第i个任务的后继任务,第i+1个任务必须等待第i个任务完成才能开始执行。
第五粒度任务切分单元方式按照神经网络的层内结构对神经网络应用进行任务切分,神经网络层内的计算可以进一步被切分为子任务。按神经网络层内的计算的切分包括但不限于对神经网络的一卷积层计算、全连接层计算、池化层计算或激活层计算进行任务切分。
本公开进一步一实施例提供一种任务调度方法,能够综合考虑任务之间的依赖关系,任务的局部性,任务切分粒度,核的运行频率及负载进行任务调度,提高服务质量,提高核的利用率,保证核之间的任务均衡,减少能耗。该任务调度方法包括以下步骤:
缓存所有未调度的神经网络任务;
具体地,可选择性地存储每一个待调度任务的执行时间,任务依赖关系图,任务资源在核内处理分布情况,神经网络任务例如是上一实施例中切分的子任务;
实时检测多核神经网络处理器的整体服务质量以及各核的工作状态;
具体地,各核的工作状态,例如为每一个核的利用率,工作负载,工作频率,核内私有任务队列中的任务数量,任务完成时间。
从未调度任务中选择待调度任务,根据待调度任务信息及所述各核工作状态,确定待调度任务和目标核之间的映射关系,将待调度任务分配到目标核中。
任务调度可以每隔时间T对任务队列中未调度任务进行调度,T是大于0的实数。若未调度任务t的与其他任务存在依赖关系且前驱任务没有完成,则不调度任务t。
选择从未调度任务中选择待调度任务方式可以采用如下至少一种方式:随机选择任务,选择预计执行时间最长的任务,选择预计执行时间最短的任务,选择占用资源最多的任务,选择占用资源最少的任务。
将待调度任务分配调度至目标核可以采用以下调度方式中的至少一种:第一种调度方式:统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核,将待调度任务分给该目标核;
第二种调度方式:统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核,将待调度任务分给该目标核;
第三种调度方式:统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核,将待调度任务分给该目标核;
第四种调度方式:采用启发式算法将待调度任务分配到目标核,启发式算法包括但不仅限于是遗传算法,蚁群算法,模拟退火算法。
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被承载在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。
本公开提供了一种处理器,如图23所示,所述处理器,包括:
任务切分装置,用于根据任务切分粒度进行任务切分;以及
硬件资源划分装置,用于根据任务切分结果对所述处理器的硬件资源进行划分。
在一实施例中,如图24-25所示,所述硬件资源划分装置可包括分发配置模块,用于分发所述配置信息。所述配置信息可包括根据任务切分结果所确定的对硬件资源进行划分的配置信息(此时,根据任务切分结果确定相应的配置信息,根据配置信息对硬件资源进行划分)。
所述处理器还包括计算模块,该计算模块包括多个计算单元,所述硬件资源划分装置用于根据任务切分结果对所述处理器的多个计算单元进行划分,即所述多个计算单元根据所述任务切分结果分成多个计算组,以分别计算batch中不同的正向和反向通路,或运行不同的服务的请求。
在一实施例中,如图26所示,所述处理器还包括:外部存储模块,内部存储模块,以及控制模块。
外部存储模块,用于存储计算模块、内部存储模块、控制模块和分发配置模块的数据信息。以神经网络计算为例,该数据信息包括:权值数据、神经元数据(包括输入)、指令数据,配置信息等。
另外,所述外部存储模块,可提供对外部存储器的读写接口,并且可以配置相关寄存器以灵活实现对不同外部存储器的操作。
内部存储模块,用于存储供计算模块使用的数据,包括:权值、神经元(包括输入)、指令数据等。
内部存储模块,还提供和外部存储模块的读写接口,用以完成内部存储模块和外部存储模块的数据交换。
控制模块,提供和外部存储模块进行控制信号交换的接口,用以接受并解析外部控制信号,从而完成对其他模块的控制。
控制模块,还提供和计算模块的信号交换接口,用以配置和控制计算模块,从而完成不同的计算。
控制模块,还提供和硬件资源划分装置的分发配置模块的信号交换接口,用以发送配置信号到分发配置模块,从而控制分发配置所完成的功能。所述控制模块可包括存储单元,也可在其外部配置存储单元,用于存储不同的控制信息。
控制模块,还提供和任务切分装置的信号交换接口,用以控制任务切分装置进行任务切分。
分发配置模块,提供和计算模块的信号交换接口,从而分发配置信息,该配置信息用以配置计算模块中的功能和数据连接,从而支持计算模块完成batch和多服务请求。其中,所述功能主要是完成内积操作、外积操作、非线性函数操作、超越函数操作等计算功能;相应的,数据连接则是根据计算功能计算模块所需的连接状态,例如,具体将计算模块包括的多个计算单元分成多少个计算组。
其中,所述的分发配置模块可包括存储单元,也可在其外部配置存储单元,用于存储不同的配置信息。
任务切分装置,提供和计算模块的信号交换接口,从而在计算模块上进行任务进行划分。其中,任务切分装置可在计算模块的全部计算单元上对任务进行划分,也可选择性的在计算模块的部分计算单元上对任务进行划分。
所述的计算模块,包括多个计算单元(processing elements,简称PE)。
所述多个计算单元可以分成多个计算组,用以完成不同的操作。进一步的,所述多个计算单元可以是同样的计算单元,即同构模式;也可以是不同的计算单元,即异构模式。
具体的,所述的计算单元,其结构可以是完成简单运算的计算单元,如完成如标量乘法、标量加法、标量乘加等简单操作;也可以是完成向量运算的计算单元,如完成向量乘法、向量加法、向量内积等操作;也可以是混合计算单元,如用于矩阵乘法加法等操作的矩阵计算单元、用于包含向量内积计算和非线性计算的混合计算单元、包含脉冲阵列积进行卷积计算的混合计算单元。
在一实施例中,如图27所示,所述处理器,包括:外部存储模块、控制模块;还包括:权值缓存单元、输入神经元缓存单元、输出神经元缓存单元以及指令缓存单元。
其中,所述指令缓存单元,用于缓存指令;
所述权值缓存单元,用于缓存权值数据;
所述输入神经元缓存单元,用于缓存输入神经元数据;
所述输出神经元缓存单元,用于缓存计算模块输出的运算结果,并输出给外部存储模块。
进一步的,所述控制模块用于从指令缓存中读取指令,将其译码为计算模块能够执行的指令并输出至计算模块。本实施例中,其他模块及功能可与上一实施例相同,此处不再赘述。
上述实施例中,所述处理器的输入数据,包括图片、视频、音频、文字等。所述装置的输出数据包括数值数据,其结果表示含义包括但不限于分类结果、生成结果。
所述处理器的控制模块根据控制信号对计算模块、硬件资源划分装置及任务切分装置进行控制,其控制方式包括直接控制和解析控制,直接控制方式为直接将控制信号输入其他到其他模块中,而不需要经过控制模块解析;解析控制方式为控制信号需要在控制模块中完成解析,得到解析后的控制信号再输入到其他模块中用于配置和控制。
在一实施例中,如图28所示,任务切分装置包括粒度任务切分单元和任务切分粒度选择单元。粒度任务切分单元采用至少一种粒度对任务进行切分形成子任务,为神经网络应用提供多粒度的任务切分选择,任务切分粒度选择单元选择任务划分采用的粒度,指导神经网络选择最合适的任务切分粒度,使得切分后的子任务能够满足***实时性。
如图28所示,粒度任务切分单元可包括第一粒度任务切分单元、第二粒度任务切分单元、第三粒度任务切分单元、第四粒度任务切分单元以及第五粒度任务切分单元。
以下具体介绍该五个粒度任务切分单元假设神经网络应用需要完成M个样本计算,神经网络拓扑结构结构由N个层组成。其中M,N是大于0的正整数。
第一粒度任务切分单元将任务整体作为一子任务,具体的,将完成M个样本计算作为一个子任务。这种任务切分方式只生成一个子任务,子任务之间不存在依赖关系。
第二粒度任务切分单元将完成若干个样本计算作为一个子任务。神经网络被切分成为m个子任务,第i个任务完成Mi个样本的计算,其中m是大于1小于等于M的正整数,i=1,2,3,……m,Mi是大于0小于M的正整数,且满足M1+M2+…+Mm=M。这种任务切分方式的m个子任务之间不存在依赖关系。
第三粒度任务切分单元可以按照神经网络的层类型对神经网络应用进行任务切分,相同类型层的计算作为一个任务。神经网络的层类型包括但不仅限于卷积层,全连接层,LSTM层,池化层,激活层,LRN层,BN层。这种任务切分方式的子任务之间存在复杂的依赖关系。
第四粒度任务切分单元可以按照神经网络的层间结构对神经网络应用进行任务切分,相邻若干个层的计算作为一个子任务。神经网络应用被切分为n个子任务,第一个子任务完成神经网络第一层到第N1层,共计N1层计算,第二个子任务完成第N1+1层到第N1+N2层,共计N2层神经网络计算,第i个子任务完成第N1+…+Ni-1+1层到第N1+…+Ni层,共计Ni层计算。其中n是大于0小于等于N的正整数, i=1,2,3,……n,Ni是大于0小于等于N的正整数且满足N1+N2+…+Ni+…+Nn=N。这种任务切分方式的子任务之间存在链式的依赖关系,其中第i个子任务是第i+1个子任务的前驱任务,第i+1个任务是第i个任务的后继任务,第i+1个任务必须等待第i个任务完成才能开始执行。
第五粒度任务切分单元可以按照神经网络的层内结构对神经网络应用进行任务切分,神经网络层内的计算可以进一步被切分为子任务。按神经网络层内的计算的切分包括但不限于对神经网络的一卷积层计算、全连接层计算、池化层计算或激活层计算进行任务切分。
上述提及的各任务切分功能可以采用独立的硬件单元分别来实现,例如采用第一粒度任务切分单元、第二粒度任务切分单元、第三粒度任务切分单元、第四粒度任务切分单元和第五粒度任务切分单元分别实现上述各功能,也可以采用同一个硬件单元实现上述这些功能。
对神经网络的一个卷积层计算进行任务切分,卷积层输入神经元是三维矩阵(Nfin,Nxin,Nyin),权值是四维矩阵(Nfout,Nfout,Kx,Ky),输出神经元是三维矩阵(Nfout,Nxout,Nyout),其中Nfin是输入特征图像数量,(Nxin,Nyin)是输入特征图像大小,Nfout是输出特征图像数量,(Kx,Ky)是卷积核大小,(Nxout,Nyout)是输出特征图像大小。完成一个输出神经元需要Nfin×Kx×Ky次乘加运算,输出神经元数量为Nfout×Nxout×Nyout,完成整个卷积层总共需要Nfout×Nxout×Nyout×Nfin×Kx×Ky次乘加运算。在进行任务切分时,将输出神经元按照(Bfout,Bxout,Byout)的块大小进行切分,同时对权值按照(Bfout,Bfin,Bx,By)的块大小进行切分,则每一个子任务用(Bfout,Bfin,Bx,By)权值计算Bfout×Bxout×Byout个输出神经元的中间结果,每个输出神经元中间结果进行Bfin×Bx×By次乘加运算,共需要完成Bfout×Bxout×Byout×Bfin×Bx×By次乘加运算。其中Bfout是大于0小于等于Nfout的正整数,Bxout是大于0小于等于Nxout的正整数,Byout是大于0小于等于Nyout的正整数,Bfin是大于0小于等于Nfin的正整数,Bx是大于0小于等于Kx的正整数,By是大于0小于等于Ky的正整数。这种任务切分方式的子任务之间不存在依赖关系。
对神经网络的一个全连接层计算进行任务切分,全连接层输入神经元是Nin,权值是二维矩阵(Nout,Nin),输出神经元Nout,其中Nin是输入神经元数量,Nout是输出神经元数量。完成一个输出神经元需要Nin次乘加运算,输出神经元数量为Nout,完成整个全连接层总共需要Nout×Nin次乘加运算。在进行任务切分时,将输出神经元按照Bout的块大小进行切分,同时对权值按照(Bout,Bin)的块大小进行切分,则每一个子任务用(Bout,Bin)的权值矩阵计算Bout个输出神经元的中间结果,每一个输出神经元的中间需要完成Bin次乘加运算,共需要完成Bout×Bin次乘加运算。其中Bout是大于0小于等于Nout的正整数,Bin是大于0小于等于Nin的正整数。这种任务切分方法的子任务之间不存在依赖关系。
对神经网络的一个池化层计算进行任务切分,池化层输入神经元是Nin,输出神经元Nout,其中Nin,Nout是大于0的正整数,池化操作包括但不仅限于平均值池化,最大值池化,中值池化。在进行任务切分时,将输出神经元按照Bout的块大小进行切分,则每一个子任务完成Bout个输出神经元的计算。其中Bout是大于0小于等于Nout的正整数,Bin是大于0小于等于Nin的正整数。这种任务切分方式的子任务之间不存在依赖关系。
对神经网络的一个激活层计算进行任务切分,激励输入神经元是Nin,输出神经元Nout,其中Nin,Nout是大于0的正整数,激活函数包括但不仅限于sigmoid、tanh、relu、softmax。在进行任务切分时,将输出神经元按照Bout的块大小进行切分,则每一个子任务完成Bout个输出神经元的计算。其中Bout是大于0小于等于Nout的正整数。这种任务切分方式的子任务之间不存在依赖关系。
任务切分粒度选择单元选择任务划分采用的粒度,并不限于仅选择上述的一种粒度,还可以是多种粒度的组合,例如一个神经网络应用可以组合第四粒度任务单元和第五粒度任务切分单元的切分方式。将神 经网络应用首先按照第四粒度任务切分单元的切分方法分为n个子任务,再将其中的p个子任务按照第五粒度任务切分单元的切分方式进行切分。
在其他实施例中,粒度任务切分单元可以包括第一至第五粒度任务切分单元中的至少一个,不一定包括全部第一至第五粒度任务切分单元。
在其他实施例中,粒度任务切分单元还可以包括混合粒度任务切分单元,用于组合第一至第五粒度任务切分单元的切分方式,供任务切分粒度选择单元选择。
在一实施例中,所述处理器可以为多核处理器,其还包括任务调度装置,如图29所示,任务调度装置包括任务队列单元、监测单元以及任务调度单元。神经网络任务调度装置能够综合考虑任务之间的依赖关系,任务的局部性,任务切分粒度,核的运行频率及负载进行任务调度,提高服务质量,提高核的利用率,保证核之间的任务均衡,减少能耗。
其中,任务队列单元缓存所有未调度的神经网络任务,并且可选择性地存储每一个待调度任务的执行时间,任务依赖关系图,任务资源在核内处理分布情况,神经网络任务例如是上一实施例中切分的子任务。
监测单元实时检测多核神经网络处理器的整体服务质量以及各核的工作状态,例如为每一个核的利用率,工作负载,工作频率,核内私有任务队列中的任务数量,任务完成时间。
任务调度单元从未调度任务中选择待调度任务,根据待调度任务信息及所述各核工作状态,确定待调度任务和目标核之间的映射关系,将待调度任务分配到目标核中。
任务调度单元可以每隔时间T对任务队列中未调度任务进行调度,T是大于0的实数。若未调度任务t的与其他任务存在依赖关系且前驱任务没有完成,则任务调度单元不会调度任务t。
任务调度单元选择从未调度任务中选择待调度任务方式可以采用如下至少一种方式:随机选择任务,选择预计执行时间最长的任务,选择预计执行时间最短的任务,选择占用资源最多的任务,选择占用资源最少的任务。
任务调度单元可以采用以下调度方式中的至少一种将待调度任务分配调度至目标核。
第一种调度方式:统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核,将待调度任务分给该目标核;
第二种调度方式:统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核,将待调度任务分给该目标核;
第三种调度方式:统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核,将待调度任务分给该目标核;
第四种调度方式:采用启发式算法将待调度任务分配到目标核,启发式算法包括但不仅限于是遗传算法,蚁群算法,模拟退火算法。
在一实施例中,所述处理器为一种多核处理器,例如为多核神经网络处理器,如图30所示,多核神经网络处理器包括:J个处理核,J是大于1的正整数,前述实施例中的任务切分装置以及任务调度装置。
任务切分装置切分输入的神经网络应用,使得切分后的子任务能够满足***实时性,任务调度装置进行神经网络子任务调度,能够提高服务质量,提高处理核的利用率,保证处理核之间的任务均衡,减少能耗。神经网络处理核进行神经网络运算,完成神经网络子任务,J个神经网络处理核之间的拓扑结构包括但不仅限于是一维线性,二维mesh,二维星形,三维立方等。
在一实施例中,如图31所示,神经网络处理核包括存储单元,控制单元、选数单元和运算单元。
存储单元,用于存储神经网络的神经元、权值以及指令;当神经网络子任务处理稀疏神经网络时,存放的权值为非零权值以及非零权值的位置信息。
指令控制单元,用于接收神经网络专用指令,经过译码后生成控制信息控制选数单元和运算单元;
所述神经网络专用指令,包括所有专用于完成人工神经网络运算的指令。神经网络专用指令包括但不仅限于控制指令,数据传输指令,运算指令和逻辑指令。其中控制指令控制神经网络执行过程。数据传输指令完成不同存储介质之间的数据传输,数据格式包括但不仅限于矩阵,向量和标量。运算指令完成神经网络的算术运算,包括但不仅限于矩阵运算指令,向量运算指令,标量运算指令,卷积神经网络运算指令,全连接神经网络运算指令,池化神经网络运算指令,RBM神经网络运算指令,LRN神经网络运算指令,LCN神经网络运算指令,LSTM神经网络运算指令,RNN神经网络运算指令,RELU神经网络运算指令,PRELU神经网络运算指令,SIGMOID神经网络运算指令,TANH神经网络运算指令,MAXOUT神经网络运算指令。逻辑指令完成神经网络的逻辑运算,包括但不仅限于向量逻辑运算指令和标量逻辑运算指令。
其中,RBM神经网络运算指令用于实现Restricted Boltzmann Machine(RBM)神经网络运算。
其中,LRN神经网络运算指令用于实现Local Response Normalization(LRN)神经网络运算。
其中,LSTM神经网络运算指令用于实现Long Short-Term Memory(LSTM)神经网络运算。
其中,RNN神经网络运算指令用于实现Recurrent Neural Networks(RNN)神经网络运算。
其中,RELU神经网络运算指令用于实现Rectified linear unit(RELU)神经网络运算。
其中,PRELU神经网络运算指令用于实现Parametric Rectified Linear Unit(PRELU)神经网络运算。
其中,SIGMOID神经网络运算指令用于实现S型生长曲线(SIGMOID)神经网络运算
其中,TANH神经网络运算指令用于实现双曲正切函数(TANH)神经网络运算。
其中,MAXOUT神经网络运算指令用于实现(MAXOUT)神经网络运算。
更具体的,它包括Cambricon指令集。
所述Cambricon指令集中每一条指令长度为64bit,指令由操作码和操作数组成。指令集包含四种类型的指令,分别是控制指令(control instructions),数据传输指令(data transfer instructions),运算指令(computational instructions),逻辑指令(logical instructions)。
进一步的,控制指令用于控制执行过程。控制指令包括跳转(jump)指令和条件分支(conditional branch)指令。
进一步的,数据传输指令用于完成不同存储介质之间的数据传输。数据传输指令包括加载(load)指令,存储(store)指令,搬运(move)指令。load指令用于将数据从主存加载到缓存,store指令用于将数据从缓存存储到主存,move指令用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据。数据传输指令支持三种不同的数据组织方式,包括矩阵,向量和标量。
进一步的,运算指令用于完成神经网络算术运算。运算指令包括矩阵运算指令,向量运算指令和标量运算指令。
更进一步的,矩阵运算指令完成神经网络中的矩阵运算,包括矩阵乘向量(matrix multiply vector),向量乘矩阵(vector multiply matrix),矩阵乘标量(matrix multiply scalar),外积(outer product),矩阵加矩阵(matrix add matrix),矩阵减矩阵(matrix subtract matrix)。
更进一步的,向量运算指令完成神经网络中的向量运算,包括向量基本运算(vector elementary arithmetics),向量超越函数运算(vector transcendental functions),内积(dot product),向量随机生成(random vector generator),向量中最大/最小值(maximum/minimum of a vector)。其中向量基本运算包括向量加,减,乘,除(add,subtract,multiply,divide),向量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。
更进一步的,标量运算指令完成神经网络中的标量运算,包括标量基本运算(scalar elementary arithmetics)和标量超越函数运算(scalar transcendental functions)。其中标量基本运算包括标量加,减,乘, 除(add,subtract,multiply,divide),标量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。
进一步的,逻辑指令用于神经网络的逻辑运算。逻辑运算包括向量逻辑运算指令和标量逻辑运算指令。
更进一步的,向量逻辑运算指令包括向量比较(vector compare),向量逻辑运算(vector logical operations)和向量大于合并(vector greater than merge)。其中向量比较包括但大于,小于,等于,大于等于,小于等于和不等于。向量逻辑运算包括与,或,非。
更进一步的,标量逻辑运算包括标量比较(scalar compare),标量逻辑运算(scalar logical operations)。其中标量比较包括但大于,小于,等于,大于等于,小于等于和不等于。标量逻辑运算包括与,或,非。
选数单元,用于接收输入神经元和非零权值位置信息,选出非零权值对应的神经元。也就是说:对于每个输出神经元数据,选数单元去除掉与该输出神经元数据没有对应的非零权值数据的输入神经元数据。
运算单元,用于接收输入非零权值对应的神经元和对应的非零权值,完成神经网络训练运算并将输出神经元重新传输给存储部分。
具体地,运算单元根据存储单元中存储的指令对所述数据执行相应运算。运算单元包括但不仅限于三个部分,第一部分为乘法器,第二部分为一个或多个加法器,第三部分为激活函数单元。优选的,第二部分的一个或多个加法器组成加法树。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1×in2;第二部分将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过称为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法数累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。
运算单元还可以包括池化单元,池化单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
所述运算单元执行运算包括但不仅限于,第一部分是将所述输入数据1和输入数据2相乘,得到相乘之后的数据;第二部分执行加法树运算,用于将输入数据1通过加法树逐级相加,或者将所述输入数据1通过和输入数据2相加得到输出数据;第三部分执行激活函数运算,对输入数据通过激活函数(active)运算得到输出数据。以上几个部分的运算可以自由组合,从而实现各种不同功能的运算。
神经网络处理核还可包括预处理模块,如图31所示,该模块对原始数据进行预处理,包括切分、高斯滤波、二值化、正则化、归一化等等。
神经网络处理核还可包括指令缓存,非零权值缓存,非零权值位置缓存,输入神经元缓存,输出神经元缓存。指令缓存,用于存储专用指令;非零权值缓存,用于缓存非零权值数据;非零权值位置缓存,用于缓存非零权值位置数据并根据非零权值位置数据将输入数据中每个权值一一对应到相应的输入神经元;输入神经元缓存,用于缓存输入神经元;输出神经元缓,用于缓存运算单元输出的输出神经元。
非零权值位置数据表示每个输入神经元数据和每个输出神经元数据是否有对应的权值非零的权值数据。
一种情形下非零权值位置缓存一一对应的方法为采用1表示有连接,0表示无连接,每组输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示该输出神经元的连接关系。另一种情形下非零权值位置缓存一一对应的方法为采用1表示有连接,0表示无连接,每组输入神经元与所有输出神经元 的连接状态组成一个0和1的字符串来表示该输入神经元的连接关系。另一种情形下非零权值位置缓存一一对应的方法为将一组输出神经元第一个连接所在的输入神经元位置距离第一个输入神经元的距离、所述输出神经元第二组输入神经元距离上一个输入神经元的距离,所述输出神经元第三组输入神经元距离上一个输入神经元的距离,……,依次类推,直到穷举所述输出神经元的所有输入神经元,来表示所述输出神经元的连接关系。
上述的有连接关系为每个输入神经元数据和每个输出神经元数据有对应的非零的权值数据,无连接意思为每个输入神经元数据和每个输出神经元数据是否有对应的非零的权值数据。
神经网络处理核还可包括直接数据存取单元DMA(direct memory access)。
DMA用于在所述存储单元、指令缓存、非零权值缓存、非零权值位置缓存,输入神经元缓存和输出神经元缓存中进行数据或者指令读写。
在一实施例中,本公开还提供了一种组合处理装置,如图32所示,所述组合处理装置包括所述的处理器,通用互联接口和其他处理装置进行交互,共同完成用户指定的计算操作。
所述其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为神经网络运算装置与外部数据和控制的接口,包括数据搬运,完成对本神经网络运算装置的开启、停止等基本控制;其他处理装置也可以和神经网络运算装置协作共同完成运算任务。
通用互联接口,用于在所述神经网络运算装置与其他处理装置间传输数据和控制指令。该神经网络运算装置从其他处理装置中获取所需的输入数据,写入神经网络运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入神经网络运算装置片上的控制缓存;也可以读取神经网络运算装置的存储模块中的数据并传输给其他处理装置。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上***,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一实施例中,本公开还提供了一种处理方法,如图33所示,所述处理方法包括:
S1、任务切分装置根据任务切分粒度进行任务切分;以及
S2、硬件资源划分装置根据任务切分结果对处理器的硬件资源进行划分。
在一实施例中,在所述硬件资源划分装置根据任务切分结果对处理器的硬件资源进行划分的步骤中:
输入数据和控制信号序列被存储至外部存储模块以供使用;
数据和控制信号被载入至内部存储器;
控制模块解析控制信号,分发配置模块解析分发配置信号;例如,在执行过程中,由任务切分结果确定相应的配置信息之后,控制模块解析的控制信号包括指令及配置信息(配置信息也可以指令的方式给出),若控制模块确定是配置信息,则将配置信息发送给分发配置模块,由分发配置模块进一步将配置信息发送给计算模块;处理器根据不同的信号含义调度各个模块完成相应的操作;例如,在执行多batch操作时,调度分发配置模块分发配置信息,调度计算模块分组并进行计算,调度存储模块发送或接收数据等。另外,配置信息除了由外部存储模块经由控制模块发送至分发配置模块之外,也可以在控制模块的控制下由外部存储模块直接发送至分发配置模块;
相应的计算结果从计算模块输出至内部存储模块,再传输至外部存储模块,以供后续或其他使用。
采用本公开处理器,在执行batch计算神经网络时,包括训练过程和测试过程,可以并行执行batch中的每个正向通路,其中并行执行的每个正向通路计算是独立的(特别的,权值可以共享也可以不共享),此时装置根据配置将计算单元划分成N个独立的计算组以独立计算batch中不同的正向通路。如若是测试 过程,则该装置可以离线计算最优配置并配置完成,其中所述最优配置可以是计算组的个数配置,例如针对一具体的计算场景,将计算模块中的多个计算单元分成多少个计算组可达到最优的计算效果;也可在执行过程中动态调整配置以达到最优的过程,其中,所述动态调整配置例如可以是在执行卷积层的时,配置成多个独立的计算组分别计算不同的输出图像,而在计算全连接层时,配置成1个计算组,也即全部的计算单元用来计算同样的层。另外,在训练过程中,相较于测试过程,需要反向计算梯度并更新网络中的权值,此时可以将装置划分成多个组完成batch中不同输入样本对应的梯度,在线将装置配置成一个组从而快速的进行权值的更新计算(特别的,也可以在线配置成一个组完成batch中对应不同的输入样本的对应的梯度计算)。
采用本公开处理器,在执行多服务计算过程中,包括训练过程和测试过程,不同服务所需要的输入和权值可能是不同的,也可能是相同的。此时装置需要配置成不同的独立的组以运行不同的服务所对应的请求。这里由于不同服务所对应的计算负载可能截然不同,对应所需要的计算资源需求也不相同。装置在运行过程中对于计算单元的分组动态的进行调整,以满足多服务中对于服务质量的要求。
在一实施例中,如图34所示,所述处理器的计算模块中,PE按照一维阵列组织,多个PE可以配置成为不同的组,不同的组可以用来计算不同的输入。
下面以卷积神经网络中卷积层正向计算为例,详细说明本实施例处理器和相应PE配置如何计算卷积神经网络的batch。
1)神经网络的不同输入通过外部存储经内部存储模块输入到不同的计算组,而权值则通过外部存储经内部存储模块广播至不同的组,也即不同的组采用同样的权值数据。
2)不同的组开始计算属于各自的样本,直到该组的样本的正向过程完成。
3)不同的组将其计算结果写回内部存储,该结果或被写回外部存储,或被暂存在内部存储以便后续计算。
4)处理器载入新的一批输入,分配至不同的组继续进行计算。
在一实施例中,如图35所示,所述PE按照二维阵列组织,多个相邻的PE可以配置成不同的组,不同的组可以用来计算不同的输入。
在一实施例中,如图36所示,所述PE按照二维阵列组织,多个相邻的PE可以配置成不同的组,不同的组可以用来计算不同的输入。
其中,所述计算单元执行运算包括神经网络计算。
具体的,所述计算模块包括:乘法器,用于将输入其中的数据相乘得到相乘之后的输出;和/或一个或多个加法器,用于将输入其中的数据相加得到输出数据。其中,所述多个加法器可构成加法树,用于进行加法树运算,即将输入其中的数据逐级相加得到输出数据。
更具体而言,计算模块包括但不仅限于:第一部分乘法器,第二部分加法树,第三部分为激活函数单元,和/或第四部分池化单元。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1*in2;第二部分将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过称为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法数累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。池化单元将输入数据(in)通过池化运算得到池化操作之后的输出数 据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
相应的,所述计算模块执行运算包括第一部分是将所述输入数据1和输入数据2相乘,得到相乘之后的数据;和/或第二部分执行加法树运算,用于将输入数据1通过加法树逐级相加,或者将所述输入数据1通过和输入数据2相加得到输出数据;和/或第三部分执行激活函数运算,对输入数据通过激活函数(active)运算得到输出数据;和/或第四部分执行池化运算,out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。以上几个部分的运算可以自由选择一个多个部分进行不同顺序的组合,从而实现各种不同功能的运算。
以上几个部分的运算元件可以自由选择一个多个部分进行不同顺序的组合,从而实现各种不同功能的运算。
在一实施例中,所述处理方法用于神经网络,所述任务切分装置根据任务切分粒度在划分后的各硬件资源上进行任务切分的步骤中,选择以下五种粒度任务切分方式中的至少一个来进行任务切分。
第一粒度任务切分方式将任务整体作为一子任务,具体的,将完成M个样本计算作为一个子任务。这种任务切分方式只生成一个子任务,子任务之间不存在依赖关系。
第二粒度任务切分方式将完成若干个样本计算作为一个子任务。神经网络被切分成为m个子任务,第i个任务完成Mi个样本的计算,其中m是大于1小于等于M的正整数,i=1,2,3,……m,Mi是大于0小于M的正整数,且满足M1+M2+…+Mm=M。这种任务切分方式的m个子任务之间不存在依赖关系。
第三粒度任务切分方式可以按照神经网络的层类型对神经网络应用进行任务切分,相同类型层的计算作为一个任务。神经网络的层类型包括但不仅限于卷积层,全连接层,LSTM层,池化层,激活层,LRN层,BN层。这种任务切分方式的子任务之间存在复杂的依赖关系。
第四粒度任务切分方式可以按照神经网络的层间结构对神经网络应用进行任务切分,相邻若干个层的计算作为一个子任务。神经网络应用被切分为n个子任务,第一个子任务完成神经网络第一层到第N1层,共计N1层计算,第二个子任务完成第N1+1层到第N1+N2层,共计N2层神经网络计算,第i个子任务完成第N1+…+Ni-1+1层到第N1+…+Ni层,共计Ni层计算。其中n是大于0小于等于N的正整数,i=1,2,3,……n,Ni是大于0小于等于N的正整数且满足N1+N2+…+Ni+…+Nn=N。这种任务切分方式的子任务之间存在链式的依赖关系,其中第i个子任务是第i+1个子任务的前驱任务,第i+1个任务是第i个任务的后继任务,第i+1个任务必须等待第i个任务完成才能开始执行。
第五粒度任务切分单元方式按照神经网络的层内结构对神经网络应用进行任务切分,神经网络层内的计算可以进一步被切分为子任务。按神经网络层内的计算的切分包括但不限于对神经网络的一卷积层计算、全连接层计算、池化层计算或激活层计算进行任务切分。
在一实施例中,为了综合考虑任务之间的依赖关系,任务的局部性,任务切分粒度,核的运行频率及负载进行任务调度,提高服务质量,提高核的利用率,保证核之间的任务均衡,减少能耗,所述处理方法还包括:在任务切分之后,对任务进行分配调度。具体而言,任务调度方法包括:
缓存所有未调度的神经网络任务;
具体地,可选择性地存储每一个待调度任务的执行时间,任务依赖关系图,任务资源在核内处理分布情况,神经网络任务例如是上一实施例中切分的子任务;
实时检测多核神经网络处理器的整体服务质量以及各核的工作状态;
具体的,各核的工作状态,例如为每一个核的利用率,工作负载,工作频率,核内私有任务队列中的任务数量,任务完成时间。
从未调度任务中选择待调度任务,根据待调度任务信息及所述各核工作状态,确定待调度任务和目标核之间的映射关系,将待调度任务分配到目标核中。
任务调度可以每隔时间T对任务队列中未调度任务进行调度,T是大于0的实数。若未调度任务t的与其他任务存在依赖关系且前驱任务没有完成,则不调度任务t。
选择从未调度任务中选择待调度任务方式可以采用如下至少一种方式:随机选择任务,选择预计执行时间最长的任务,选择预计执行时间最短的任务,选择占用资源最多的任务,选择占用资源最少的任务。
将待调度任务分配调度至目标核可以采用以下调度方式中的至少一种:第一种调度方式:统计每一个核私有任务队列中任务数量,选择私有任务队列中任务最少的核作为目标核,将待调度任务分给该目标核;
第二种调度方式:统计每一个核完成私有任务队列中所有任务的时间,选择完成任务时间最短的核作为目标核,将待调度任务分给该目标核;
第三种调度方式:统计待调度任务所需资源在所有核的分布情况,选择拥有资源数量最多的核作为目标核,将待调度任务分给该目标核;
第四种调度方式:采用启发式算法将待调度任务分配到目标核,启发式算法包括但不仅限于是遗传算法,蚁群算法,模拟退火算法。
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被承载在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。
另外,在一些实施例中,分发配置模块的信号输入也可直接有外部信号输入,采用直接控制或解析控制的方式。
在一些实施例中,PE组织可以为三维组织,甚至于多维组织。
在一些实施例中,PE的分组也可以按照列来组织,不同的分组方式也可以在运行过程中进行切换。
在一些实施例中,多个分组后的PE也可以执行同一个输入对应的不同运算操作。
在一些实施例中,计算单元可以是任意的计算元件,从简单的计算元件到完成复杂功能的计算元件。
本领域技术人员应当理解的是,本公开处理器及处理方法除了进行神经网络计算之外,还可进行图像处理、视频处理计算等;且神经网络也不限于卷积神经网络,还可以是全连接神经网络、RBM神经网络、及循环神经网络(RNN,Recurrent Neural Networks)等;且不限于卷积层,还可以是全连接层、pooling层等。
在一些实施例中,还提供了一种芯片,其包括了上述神经网络运算装置或组合处理装置。
在一些实施例中,还提供了一种芯片封装结构,其包括了上述芯片。
在一些实施例中,还提供了一种板卡,其包括了上述芯片封装结构。
在一些实施例中,还提供了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
参见图37所示,本公开实施例一方面提供一种信息处理装置,包括:存储模块,用于获取信息数据,所述信息数据包括至少一个关键特征,所述存储模块预存所述关键特征对应的真实置信度;运算电路,根 据所述信息数据,确定所述关键特征对应的预测置信度,并判断所述关键特征的预测置信度是否超过关键特征对应的真实置信度预设阈值范围;以及控制电路,当所述预测置信度超过真实置信度预设阈值范围,控制所述存储模块修改关键特征,或向外部发出修改信号。通过上述信息处理装置,可以对信息数据自动批改修正,代替了人工,而且相对人工评分更精确,快速。
以上已经按照类型描述了信息数据的种类,下面将介绍其功能分类,具体的可以涉及学生的作业或试卷,或者运动项目的动作或表情数据,或者益智类项目操作方式或步骤。例如作业或试卷,可以是电子文本、手写文字和/或图形,所述手写文字和/或图形包括手写的一种或多种语言文字和/或符号的组合,手写的二维图,手写的二维透视图。更进一步的,所述的手写的一种或多种语言文字和/或符号的组合为语文,数学,物理等科目的试卷手写答案。更进一步的,所述的手写的二维图和/或二维透视图为美术,制图等科目的试卷手写答案。例如动作或表情,可以是摄录的图片和/或视频;例如益智类项目操作方式或步骤,可以是体现操作方式或步骤的电子数据、图片或视频。通过对上述种类的信息数据进行及时自动化修改,可以提高教练或教师的效率,使学员及时准确的对错误进行调整。
本公开中,存储模块可以用于存储数据和指令,其中所述数据可以包括信输入神经元(例如经预处理后的数据),输出神经元(例如对应所述关键特征的预测置信度),权值,在神经网络运算和输出过程中的损失函数、梯度和评分,以及错误模式判断结果。
本公开中,运算电路可以用于根据所述存储模块中存储的指令对所述数据执行相应的运算;所述运算电路可以执行三步运算,第一步是将输入神经元和权值数据相乘;第二步执行加法树运算,用于将第一步的结果通过加法树逐级相加,得到加权和,根据需要可以对加权和加偏置或不做处理;第三步对第二步得到的结果执行激活函数运算,得到输出神经元。该输出神经元的值为该关键特征的预测置信度。所述激活函数可以是sigmoid函数、tanh函数、ReLU函数或softmax函数等。
本公开实施例中,预测置信度可为任意自然数——例如置信度的值越大,包含该关键特征的可信度越高。置信度还可以归范化为一定范围内的自然数——例如置信度在【0,1】之间,置信度表示包含该关键特征的置信概率。
在一些实施例中,存储模块可以包括直接内存存取DMA,所述直接内存存取DMA与所述运算电路电性连接,用于存储所述运算电路运算确定的预测置信度,并将所述真实置信度和预测置信度送入所述运算电路以进行比较。
如图38所示,所述存储模块还包括存储单元,存储单元用于从信息处理装置外部获取信息数据,并传入直接存储存取DMA,供运算电路调用。
在一些实施例中,如图38所示,存储模块还用于存储神经网络专用指令,信息处理装置还包括:指令缓存,用于从所述存储模块缓存专用指令,供控制电路调用。
在一些实施例中,存储模块还用于存储神经网络中的输入神经元、输出神经元和权值,信息处理装置还包括:输入神经元缓存,用于从所述存储模块缓存神经元,供运算电路调用;权值缓存,用于从所述存储模块缓存权值,供运算电路调用;输入神经元缓存,用于存储从所述运算电路运算获得的输出神经元。
在一些实施例中,运算电路还用于根据各关键特征的判断结果对所述信息数据进行评分。该评分过程可以是对各关键特征对应的输出神经元进行加权后综合评分。
在一些实施例中,所述运算电路中,根据所述信息数据,确定所述关键特征对应的预测置信度包括:以所述信数据作为神经网络的输入,进行神经网络运算,所述预测置信度作为神经网络的输出。
参见图39所示,在一些实施例中,信息处理装置还包括预处理模块,用于对外部的原始信息数据进行预处理后传入所述存储模块。通过设置预处理模块,一方面能使输入数据更适于人工神经网络处理,去除输入数据中的噪声和冗余,提高分类、识别精度等等,另一方面减少后续存储模块中的空间占用。优选 的,所述预处理包括对原始信息数据切分、高斯滤波、二值化、正则化和/或归一化,以获得神经网络输入数据格式的数据;优选的,神经网络输入数据格式包括但不限于:图像的大小、色彩模式、平均亮度和/或数据规模。
在一些实施例中,所述运算电路还用于对所述神经网络进行自适应性训练。可以通过计算出的预测置信度和已知的真实置信度对比,自适应地更新网络中的参数(如权值、偏置等等),进而提高装置的识别、预测精度。优选的,上述自适应性训练过程是离线处理的。
在一些实施例中,本公开的信息处理装置可以是集成其所包含的各单元、模块和电路的集成芯片,优选的为可以实现神经网络运算的人工神经网络芯片。
参见图40所示,根据本公开的再一方面,提供一种信息处理设备,包括:信息获取装置,用于获取外部的信息数据;以上实施例所述的信息处理装置,用于处理所述信息数据,获得关键特征的预测置信度,且当所述预测置信度超过真实置信度预设阈值时,修改所述关键特征,或发出修改信号。
参见图40所示,根据本公开的又一方面,提供一种信息处理设备,包括:信息获取装置,用于获取外部的信息数据;以上实施例所述的信息处理装置,用于处理所述信息数据,获得关键特征的预测置信度,且当所述预测置信度超过真实置信度预设阈值时,修改所述关键特征,或发出修改信号;交互界面,接收修改的关键特征或者修改信号,向用户示出修改内容。
上述信息处理设备的实施例中,上述信息获取装置可以为仅具有摄像功能的相机、摄影机、扫面仪等。也可以是信息获取装置与交互界面装配为一体的终端性设备(例如手机、电脑或者可穿戴设备)。
本实施例中,交互界面可以包括显示屏、触摸式显示屏和/或数据输出接口。交互界面可以接收信息获取装置的数据(例如包含修改后关键特征),或者接收信息获取装置的原始信息数据以及修改信号,在控制器的控制下对原始信息数据(例如图片)进行修改(包括但不限于涂鸦、添加修改标记、添加视频、添加局部图片、添加文字、添加语音),并通过可视听方式显示。
在一些实施例中,交互装置还可以包括预处理装置,用于对信息获取装置获取的信息数据进行预处理后送入信息处理装置。该处的预处理装置所实现功能与上述的预处理模块类此,可以参照上述实施例,在此不予赘述。
在一些实施例中,信息处理设备还包括控制器,用于控制所述信息获取装置、信息处理装置和/或交互界面。具体的,可以控制信息获取装置从外部获取原始的信息数据,控制信息处理装置接收信息数据后进行处理,以及进行判断、改写或者发出改写信号操作,控制交互界面显示改写内容等。
在一些实施例中,交互界面还用于响应用户的操作或命令,对设定阈值进行修改。例如,当用户对特定关键特征(例如具体的某段文字、某段语音或某段视频)的预定置信度对应的阈值进行调整时,可以通过触摸屏、鼠标、语音命令或者键盘等方式进行该信息获取设备的操作。
如图41所示,本公开实施例的另一方面,还提供一种信息处理方法,包括:
S301:通过存储模块获取信息数据,所述信息数据包括至少一个关键特征,所述存储模块预存所述关键特征对应的真实置信度;
S302:运算电路根据所述信息数据,确定所述关键特征对应的预测置信度,并判断所述关键特征的预测置信度是否超过关键特征对应的真实置信度设定阈值范围;
S303:当所述预测置信度超过真实置信度阈值范围,控制电路控制存储模块修改所述关键特征,或发出修改信号。
该处理方法可对应于上述处理装置的执行步骤,具体执行方式可参照上述步骤的描述,在此不予赘述。
为进一步说明本公开,以下例举具体的实施例进行详细阐述。在下面的详细描述中,为便于解释,阐述了许多具体的细节以提供对本公开实施例的全面理解。然而明显地,一个或多个实施例在没有这些具体 细节的情况下也可以被实施。在其他情况下,公知的结构和装置以图示的方式体现以简化附图。应当理解,以下的详细说明不对本公开构成限制,相反,它们提供本领域内技术人员理解由所附权利要求书的范围描述的实施例涵盖的替代形式、等效物、和修正例的基础。
其中,实施例三对应于对信息数据为图片的处理装置,实施例四对应于信息数据为音频和/或视频的处理装置,实施例五对应于一种信息处理设备。
实施例三:
本实施例中信息处理装置的存储单元接收信息数据,信息数据可包括但不仅限于一组包含一个或多个关键特征的图片;装置计算出信息数据包含各个关键特征的置信度,给出一个判断结果;装置根据判断结果,对存储单元中的信息数据进行评分。其中信息数据可以是原始信息数据,也可以是对原始数据进行预处理后得到的结果。
这里的信息处理装置可以进行自适应性训练,例如:该装置输入一组包含一个或多个关键特征的图片,如包括手写文字的图片,组成视频的图片等等。每个关键特征对应一个置信度,置信度为一个自然数。对用于自适应训练的输入图片来说,其包含各个关键特征的置信度都是已知的,即真实置信度;装置以这些图片作为信息数据,计算出含有各个关键特征的置信度,即预测置信度。计算出的预测置信度和已知的真实置信度对比,自适应地更新网络中的参数(如权值、偏置等等),进而提高装置的识别、预测精度。
其中,置信度可为任意自然数——例如置信度的值越大,包含该关键特征的可信度越高。置信度还可以归范化为一定范围内的自然数——例如置信度在【0,1】之间,置信度表示包含该关键特征的置信概率。
训练集的真实置信度取值二选一——例如{0,1},0表示输入图片不包含该关键特征,为1表示包含该特征;当然也可以反过来,1表示不包含,0表示包含。
其中,上述自适应性训练过程可以是离线处理的。这里的信息处理装置可以为人工神经网络芯片,包括:存储单元,用于存储数据和指令,其中所述数据包括输入神经元,输出神经元,权值,评分,错误模式判断结果等等;运算电路,用于根据所述存储单元中存储的指令对所述数据执行相应的运算;所述运算电路主要执行三步运算,第一步是将输入神经元和权值数据相乘;第二步执行加法树运算,用于将第一步的结果通过加法树逐级相加,得到加权和,根据需要对加权和加偏置或不做处理;第三步对第二步得到的结果执行激活函数运算,得到输出神经元。
信息处理装置还可以包括DMA(Direct Memory Access,直接内存存取),用于在所述存储单元、指令缓存、权值缓存、输入神经元缓存和输出神经元缓存中进行数据或者指令读写;
信息处理装置中,控制电路,用于从所述指令缓存中读取专用指令,并将其译码成运算电路指令并输入至运算电路;指令缓存,用于存储专用指令;权值缓存,用于缓存权值数据;输入神经元缓存,用于缓存输入到映射单元的输入神经元;输出神经元缓存,用于缓存运算电路输出的输出神经元(对应各关键特征的置信度);
DMA(直接内存存取)与运算电路之间的直接数据通路,用于直接对DMA存储数据进行运算并返回。
作为优选,芯片还包括预处理模块。该模块对原始信息数据,即一个或多个包含手写文字或图形的图片,进行预处理,得到与芯片所使用的人工神经网络的位于最底层的输入层规模相契合的图像数据。其中预处理包括切分、高斯滤波、二值化、正则化、归一化等等。
作为优选,人工神经网络芯片得到判断结果的方法包括:神经网络的最终输出层的每个输出神经元对应一个关键词,输出神经元的值为该关键词出现的置信度。
修改的方法包括:将标准答案拆分成许多标准关键特征的集合,这些关键特征可以是字、词、短语(文本数据输入)或者图片的一部分(图像数据输入),芯片的存储单元中预先存储有每个关键特征标准正确模式。神经网络最终输出层的各个输出神经元给出各个关键特征部分与相应标准正确模式的置信度。(若 某错误模式出现或其出现的置信度大于预设的阈值,则将该错误模式修改为标准答案中对应的关键特征)输出神经元的结果存入DMA中,并再次传入运算电路进行修改置信度阈值比较,如果该关键特征置信度低于预设阈值,则根据该关键特征的标准正确模式对该关键特征进行修改。
上述得到判断结果、评分及修改的过程均在人工神经网络芯片中完成:
步骤1,信息数据经预处理模块或直接传入存储单元;
步骤2,DMA将其分批传入相应的片上缓存(即指令缓存,输入神经元缓存,权值缓存)中;
步骤3,控制电路从指令缓存中读取指令,将其译码后传入运算电路;
步骤4,根据指令,运算电路执行相应的运算,在神经网络的各个层中,运算主要分为三步:步骤4.1,,将对应的输入神经元和权值相乘;步骤4.2,执行加法树运算,即将步骤4.1的结果通过加法树逐级相加,得到加权和,根据需要对加权和加偏置或不做处理;步骤4.3,对步骤4.2得到的结果执行激活函数运算,得到输出神经元,并将其传入输出神经元缓存中。
步骤5,重复步骤2到步骤4,直到所有数据运算完毕,即得到功能需求的最终结果。其中所述最终结果由神经网络最后一层的输出神经元得到,从运算电路输出到输出神经元缓存中,然后暂存入DMA等待下一步运算。
步骤6,将DMA中的神经网络输出神经元中存储的评分结果,即各关键特征置信度通过DMA与运算器之间的数据通路直接输入到运算器中与预设阈值进行比较,如果关键特征置信度比预设阈值小,则将DMA中的输入关键特征置换为相应关键特征的标准正确模式。当所有关键特征均按照上述步骤进行比较并替换后,DMA中完成了信息数据的修改工作。
步骤7,将修改过后的DMA中信息数据存回存储单元中,并作为最终修改后的输出数据输出。
根据所述功能需求:若要求得到判断结果,则上述神经网络最后一层输出神经元的值即为关键词出现的置信度;若要求进行修改,则最终经过步骤7后的存储单元中的修改数据,即为最终修改后的数据。
根据功能要求,本结构可实现评分和/或修改的功能,评分的结果输出为步骤1-5执行完毕后的输出;修改的输出为完整执行步骤1-7的最终存储单元输出。
实施例四:
本实施例提供的人工神经网络芯片(对应于信息处理装置)中的存储单元用于预存一个或多个关键帧图片(对应关键特征);存储单元从外部获取视频,并将其传入至运算电路,其中视频包括多个输入图片;运算电路计算出各个输入图片与每个关键帧图片的相似度(详细来说,如果输入图片有N个,关键图片有M个,则得到NM个相似度)和/或对视频进行规范化修改。
更进一步的,所述视频还包括音频,音频分为多段音频,所述多段音频与所述多个图片对应。芯片可以比较视频中所有图片与各个关键帧图片的相似度,和/或比较视频中所有音频分解得到的各个波形和关键波形的相似度,对视频进行规范化修改。
更进一步的,所述视频为一个或多个测试者的动作视频。更进一步的,所述动作视频为跳舞,武术,或者课间操表演,体育运动的动作和/或姿势,写字动作和/或姿势,打字动作和/或姿势,看书动作和/或姿势。
其中得到相似度的方法可以是:神经网络的最终输出层的每个输出神经元对应一个相似度,输出神经元的值即为相似度值。(如果和前面的例子保持一致,则该层共NM个输出神经元)
其中得到相似度的方法也可以是:神经网络的最终输出层的每个输出神经元对应一个输入图片,输出神经元的值即为与该输入图片最相似的关键帧图片与该输入图片的相似度。(如果和前面的例子保持一致,则该层共N个输出神经元)
其中得到相似度的方法还可以是:神经网络的最终输出层的每个输出神经元对应一个关键图片,输出神经元的值即为与该关键帧图片最相似的输入图片与该关键帧图片的相似度。(如果和前面的例子保持一致,则该层共M个输出神经元)
其中评分的方法可以是:在神经网络中上述最终输出层的上面再加一层作为新的最终输出层,以前的最终输出层中的输出神经元作为该层的输入神经元;该层只有一个输出神经元,其值即为评分;该层中的权值对应各个相似度的重要程度,即权重。
其中修改的方法可以是:将上述得到的相似度计算结果从DMA中直接输入到运算电路中,并与预设阈值进行比较,如果相似度低于预设阈值,则判定该关键特征(在此可表述为视频关键帧图片)不符合规范化标准,需要进行修改。从而将相应输入图片用相应标准关键帧图片进行置换,并写回DMA,最终输出到存储单元中准备输出。
对视频和音频等连续数据输入,将其按时间分解为多个关键帧,并将关键帧图片与标准关键帧图片进行相似度计算,相似度低于预设阈值则利用标准图片对输入进行修改。
上述得到相似度和评分过程均在人工神经网络芯片中完成,可以包括如下步骤:
步骤1,信息数据经预处理模块或直接传入存储单元;
步骤2,DMA将其分批传入相应的片上缓存(即指令缓存,输入神经元缓存,权值缓存)中;
步骤3,控制电路从指令缓存中读取指令,将其译码后传入运算电路;
步骤4,根据指令,运算电路执行相应的运算,:在神经网络的各个层中,运算主要分为三步:步骤4.1,,将对应的输入神经元和权值相乘;步骤4.2,执行加法树运算,即将步骤4.1的结果通过加法树逐级相加,得到加权和,根据需要对加权和加偏置或不做处理;步骤4.3,对步骤4.2得到的结果执行激活函数运算,得到输出神经元,并将其传入输出神经元缓存中。
步骤5,重复步骤2到步骤4,直到所有数据运算完毕,即得到功能需求的最终结果。其中所述最终结果由神经网络最后一层的输出神经元得到,从运算电路输出到输出神经元缓存中,然后写入DMA中准备下一步操作。
步骤6,将DMA中的神经网络输出神经元中存储的相似度结果,即各关键特征(关键帧)评分通过DMA与运算器之间的数据通路直接输入到运算器中与预设阈值进行比较,如果关键特征置信度比预设阈值小,则将DMA中的输入关键特征置换为相应标准关键帧。当所有关键特征均按照上述步骤进行比较并替换后,DMA中完成了信息数据的修改工作。
步骤7,将修改过后的DMA中信息数据存回存储单元中,并作为最终修改后的输出数据输出。
根据所述功能需求:若要求得到判断结果,则上述神经网络最后一层输出神经元的值即为各关键帧与标准关键帧的相似度(评分);若要求进行修改,则最终经过步骤7后的存储单元中的修改数据,即为最终修改后的数据。
实施例五:
装置包括信息获取装置,信息处理装置(例如人工神经网路芯片)(结构同实施例三),交互界面和控制电路。
其中信息获取装置(这个装置可以是预处理装置的扩展,相当于接口+预处理装置)用于接收外部信息,信息包括文字、图像、音频、视频等等。并将原始数据或经预处理后的数据作为信息数据传递给人工神经网络芯片。
其中交互界面用于和用户进行交互,即接收用户的操作或命令,并将其传给控制电路。交互界面还用于接收人工神经网络芯片的输出数据,并将其转化为合适形式的反馈信息显示给用户。其中控制电路接收用户的操作或命令,并控制整个装置的运作。
交互界面可以让用户自由修改上述预设阈值,以达到不同程度效果的修改结果,更加友好。同时交互界面还可以给用户反馈信息,如坐姿错误是的报警以及握笔方式的修改矫正等。
更进一步的,信息获取装置为图像获取装置,声音获取装置。图像获取装置为摄像头。声音获取装置为麦克风。更进一步的,所述终端为识别字符装置,手机,电脑,笔记本,平板电脑。
本公开所提供的实施例中,应理解到,所揭露的相关设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述部分或模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个部分或模块可以结合或者可以集成到一个***,或一些特征可以忽略或者不执行。
本公开中各功能部分/单元/子单元/模块/子模块/部件都可以是硬件,比如该硬件可以是电路,包括数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器等等。所述计算装置中的计算模块可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如RRAM,DRAM,SRAM,EDRAM,HBM,HMC等等。
在本公开中所述的“存储器”可以集成在用于执行生成对抗网络的处理装置的内部,也可以是一个单独的器件,作为外部存储器与用于执行生成对抗网络的处理装置进行数据传输。
根据本公开的基本构思,提供一种用于执行生成对抗网络的处理装置,如图42所示,包括:
存储器110,用于接收输入数据,所述输入数据包括随机噪声和参考数据,以及存储判别器神经网络参数与生成器神经网络参数;
运算器120,用于将随机噪声输入数据传入生成器神经网络进行运算,得到噪声生成结果;还用于将噪声生成结果和参考数据共同输入判别器神经网络进行运算,得到判别结果;还用于根据所述判别结果更新所述判别器神经网络参数与生成器神经网络参数。
本公开实施例的处理装置,针对对抗网络的具体实现方式规划出合理的运算器以及存储器相配合的硬件结构,提高了计算效率。用于执行生成对抗网络的处理装置的存储器110接收输入数据,输入数据包括随机噪声和参考数据(包括但不限于真实图片、语音或文字)。参考数据包括但不仅限于一组包含一个或多个关键特征的图片,一组包含一个或多个关键采样点的音频,一组包含一个或多个具有词性标签的词组或短语;运算器120根据输入数据进行训练,得到一组生成函数参数,根据该生成函数参数和参考数据(例如参考图像)得到噪声生成结果(如创作图像)。其中输入数据可以是原始输入数据,也可以是对原始数据进行预处理后得到的结果。
在一些实施例中,所述存储器还用于存储计算指令,所述处理装置还包括控制器130,该控制器130用于根据提取所述计算指令并解析为运算指令,并发送至所述运算器。具体的,控制器130用于从所述存储器提取计算指令,解析该计算指令得到多个运算指令,将该多个运算指令以及输入数据发送给所述运算器。
如图43所示,所述存储器110包括:判别器参数存储单元112,用于存储判别器神经网络参数;生成器参数存储单元113,用于存储生成器神经网络参数;判别器指令存储单元114,用于存储进行判别器神经网络运算的计算指令;生成器指令存储单元115,用于存储进行生成器神经网络运算的计算指令;以及数据存储单元111,用于存储数据,这里的数据存储单元包括随机噪声、噪声生成结果(即负样本,例如随机噪声生成的图片)以及参考数据(从外部获得的真实图片、语音或文字等。此处结构主要是为了适应gan(对抗生成网络)具有的生成器与判别器的结构特点,故而可以将生成器与判别器的权值存储进行物 理区分,更加高效的利用存储资源,同时为了适应这种存储结构可以对I/O指令进行修改,以区分判别器I/O指令与生成器I/O指令。
其中,数据存储单元111,用于获取并存储数据,进一步的还可以包括获取并存储网络模型(包括判别器神经网络和生成器神经网络)以及计算指令。
可选的,还包括输入/输出单元150,用于获取外部数据以及将内部计算结果输出至外部设备或其他部件。
可选的,还包括DMA140,把生成器神经网络参数从存储器转发给运算器120,通过DMA把随机噪声和参考数据从数据存储单元111转发给运算器120。
可选的,存储器还可以包括存储介质,存储介质可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据块,该数据块具体可以为n维数据,n为大于等于1的整数,例如,n=1时,为1维数据,即向量,如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维张量。
在一些实施例中,上述控制器130包括:指令缓存单元110、指令存储单元111和存储队列单元113。指令缓存单元110,用于存储所述网络模型关联的计算指令;所述指令处理单元111,用于对所述计算指令解析得到多个运算指令;存储队列单元113,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。
如下表1所示,该计算指令可以包括:一个或多个操作域以及一个操作码。该计算指令可以包括神经网络运算指令。以神经网络运算指令为例,如表1所示,其中,寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以为操作域。其中,每个寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以是一个或者多个寄存器的号码。
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数;COMPUTE指令完成每层人工神经网络的算术逻辑计算;IO指令实现从外部地址空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;NOP指令负责清空当前装至内部所有微指令缓存队列中的微指令,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;JUMP指令负责控制器将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
表1
Figure PCTCN2018092829-appb-000009
所述依赖关系处理单元,用于在具有多个运算指令时,确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,则将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算器;
所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:
依据所述第一运算指令提取所述第一运算指令中所需数据(例如矩阵)的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第零运算指令不具有关联关系。
根据本公开实施例的另一方面,还提供一种应用以上所述的处理装置进行机器创作的方法,如图44所示,包括:
S110:输入随机噪声和参考数据至存储器(例如随机噪声和参考数据存储至存储单元);
随后,可以通过DMA把生成器神经网络参数从存储器转发给运算器120里,通过DMA把随机噪声和参考数据从111转发给运算器120里;
S120:运算器将随机噪声输入数据和生成器神经网络参数进行生成器神经网络运算,得到噪声生成结果;
S130:运算器将噪声生成结果和参考数据进行判别器神经网络运算,得到判别结果;
S140:运算器根据所述判别结果更新所述判别器神经网络参数与生成器神经网络参数。
在一些实施例中,对于步骤S140,其具体包括:根据判别结果分别计算的生成器神经网络与判别器神经网络的损失值;然后根据损失值减小的最大梯度方向,自适应地更新判别器神经网络中的参数,进而提高判别器的判别精度;根据判别器判别的损失值增大的最大梯度方向,自适应地更新同时生成器神经网络中的参数。
通过重复进行步骤S110-S140,也就是进行训练,直至当判别器神经网络的判别精度在设定范围内变化时,输出生成器神经网络进行运算所得到噪声生成结果作为最终创作结果。
以下将结合具体实施例,对本公开的处理装置和利用装置的创作方法进行具体说明,但本领域技术人员应当知晓的是,以下具体的细节仅用于理解本公开,而并不应理解为对本公开的限定。
实施例六:
该实施例中的用于执行生成对抗网络的处理装置用于进行视频和/或图像的创作。
用于执行生成对抗网络的处理装置的存储器接收输入数据,输入数据包括但不仅限于一组包含一个或多个关键特征的图片;运算器根据输入数据进行训练,得到一组生成函数参数,根据该生成函数参数和输入参考图像生成输出创作图像。其中输入数据可以是原始输入数据,也可以是对原始数据进行预处理后得到的结果。
用于执行生成对抗网络的处理装置进行自适应性训练,例如:该装置输入一组包含一个或多个关键特征的训练图片,如包括手绘图片,实景照片,视频关键帧图片等等。装置将输入的训练图片作为真实图片与生成模型根据噪声生成的虚假图片一起混合输入到判别器中判别真假,并根据判别结果加权分别计算的生成器与判别器的损失值,然后根据损失值减小的最大梯度方向,自适应地更新判别器中的参数(如权值、偏置等等),进而提高判别器的判别精度;同时生成器根据判别器判别的损失值增大的最大梯度方向,自适应地更新生成器中的参数(如权值、偏置等等),进而提高生成器的生成能力,使得其根据噪声生成的图像更加接近真实图像,降低判别器的判别精度。最终,当判别器的判别精度在设定范围内变化时,达到最优生成器标准,以这个生成器的参数根据参考真实图片就可以将随机噪声生成创作图片。
判别器的输入图片真假取值二选一——例如{0,1},0表示输入图片为输入训练图片,为1表示输入图片为生成器根据噪声生成的虚假图片;当然也可以反过来,1表示真,0表示假。优选的,上述自适应性训练过程是离线处理的。
具体的视频或图像创造步骤可以包括:
步骤1,将随机噪声输入数据经预处理单元传入存储器或直接传入存储器;
步骤2,DMA(Direct Memory Access,直接内存存取)将其分批传入指令缓存,输入神经元缓存,权值缓存中;
步骤3,控制器从指令缓存中读取指令,将其译码后传入运算器;
步骤4,根据指令,运算器执行相应的运算:在神经网络的各个层中,运算主要分为三步:步骤4.1,在乘法器中将对应的输入神经元和权值相乘;步骤4.2,在加法树中执行加法树运算,即将步骤4.1的结果通过加法树逐级相加,得到加权和,根据需要对加权和加偏置或不做处理;步骤4.3,在激活函数运算单元中对步骤4.2得到的结果执行激活函数运算,得到输出神经元,并将其传入输出神经元缓存中。
步骤5,重复步骤2到步骤4,直到所有数据运算完毕,其中所述生成器的噪声生成结果可以根据神经网络最终输出层得到,结果由DMA存入生成器输出缓存;
步骤6,将部分输入数据与生成器生成结果混合作为判别器模型的输入数据,重复步骤2到步骤4,知道所有数据运算完毕,其中所述判别器的判别结果可以根据神经网络最终输出层的结果得到,结果由DMA存入判别器输出缓存;
步骤7,由DMA将判别器输出结果传入运算器,做偏导运算后分别得到生成器的优化梯度和判别器的优化梯度,分别将其与生成器、判别器的神经元权值相加后,将相应结果存入相应神经元缓存;
步骤8,重复步骤5,6,7直到生成器和判别器损失函数达到最优;
步骤9,输入参考数据经过数据预处理单元后传入存储器或直接传入存储器;
步骤10,重复步骤2到步骤4,生成器模型神经网络输出层输出结果即为创作结果。
根据所述功能需求:需要在自适应训练阶段预设输出创作图片大小(也是人工神经网络最终输出层的神经元个数)、与训练数据(输入训练特征)和网络参数更新方式(随机梯度下降、Adam算法等)。
实施例七:
该实施例中的用于执行生成对抗网络的处理装置用于进行音频的创作。
用于执行生成对抗网络的处理装置,用于执行生成对抗网络的处理装置的存储器接收输入数据,输入数据包括但不仅限于一组包含一个或多个关键采样点的音频;运算器根据输入数据进行训练,得到一组生成函数参数,根据该生成函数参数和输入参考图像生成输出生成音频。其中输入数据可以是原始输入数据,也可以是对原始数据进行预处理后得到的结果。
用于执行生成对抗网络的处理装置进行自适应性训练,例如:该装置输入一组包含一个或多个关键采样点的音频数据,如包括语音片段,合成编辑电子音效音频等等。然后将输入的训练音频作为真实音频与生成模型根据噪声生成的虚假音频一起混合输入到判别器中判别真假,并根据判别结果加权分别计算的生成器与判别器的损失值,然后根据损失值减小的最大梯度方向,自适应地更新判别器中的参数(如权值、偏置等等),进而提高判别器的判别精度;同时生成器根据判别器判别的损失值增大的最大梯度方向,自适应地更新生成器中的参数(如权值、偏置等等),进而提高生成器的生成能力,使得其根据噪声生成的音频采样点分布更加接近特征采样点分布,降低判别器的判别精度。最终,当判别器的判别精度不再变化时,达到最优生成器标准,以这个生成器的参数根据参考音频就可以将随机噪声生成具有参考风格的音频。
判别器的输入音频真假取值二选一——例如{0,1},0表示输入图片为输入训练图片,为1表示输入图片为生成器根据噪声生成的虚假图片;当然也可以反过来,1表示真,0表示假。优选的,上述自适应性训练过程是离线处理的。
人工神经网络芯片得到创造图片(视频关键帧)的方法为:根据训练得到的最优生成器权值参数,与输入参考图片进行矩阵乘,得出最终的创作图片(视频关键帧)。
具体的语音创作步骤可以包括:
步骤1,将随机噪声(生成器模型的生成源是随机噪声,根据权值不断生成才会能够生成有意义的音频)输入数据经预处理单元传入存储单元或直接传入存储单元;
步骤2,DMA(Direct Memory Access,直接内存存取)将其分批传入指令缓存,输入神经元缓存,权值缓存中;
步骤3,控制器从指令缓存中读取指令,将其译码后传入运算器;
步骤4,根据指令,运算器执行相应的运算:在神经网络的各个层中,运算主要分为三步:步骤4.1,,将对应的输入神经元和权值相乘;步骤4.2,执行加法树运算,即将步骤4.1的结果通过加法树逐级相加,得到加权和,根据需要对加权和加偏置或不做处理;步骤4.3,对步骤4.2得到的结果执行激活函数运算,得到输出神经元,并将其传入输出神经元缓存中。
步骤5,重复步骤2到步骤4,直到所有数据运算完毕,其中所述生成器的噪声生成结果可以根据神经网络最终输出层得到,结果由DMA存入生成器输出缓存;
步骤6,将部分输入数据与生成器生成结果混合作为判别器模型的输入数据,重复步骤2到步骤4,知道所有数据运算完毕,其中所述判别器的判别结果可以根据神经网络最终输出层的结果得到,结果由DMA存入判别器输出缓存;
步骤7,由DMA将判别器输出结果传入运算器,做偏导运算后分别得到生成器的优化梯度和判别器的优化梯度,分别将其与生成器、判别器的神经元权值相加后,将相应结果存入相应神经元缓存;
步骤8,重复步骤5,6,7直到生成器和判别器损失函数达到最优;
步骤9,输入参考数据经过数据预处理单元后传入存储单元或直接传入存储单元;
步骤10,重复步骤2到步骤4,生成器模型神经网络输出层输出结果即为创作结果。
根据所述功能需求:需要在自适应训练阶段预设输出创作音频采样点个数及音频时间长短(也是人工神经网络最终输出层的神经元个数)、与训练数据(输入训练特征)和网络参数更新方式(随机梯度下降、Adam算法等)。
实施例八:
该实施例中的用于执行生成对抗网络的处理装置用于进行文字类型的创作。
用于执行生成对抗网络的处理装置的存储器接收输入数据,输入数据包括但不仅限于一组包含一个或多个具有词性标签的词组或短语(文字类型);装置根据输入数据进行训练,得到一组生成函数参数,根据该生成函数参数和输入参考文字段落生成输出创作文字段落。其中输入数据可以是原始输入数据,也可以是对原始数据进行预处理后得到的结果。输出数据可以是文字段落,也可以是诗句等严格格式的特殊格式。
用于执行生成对抗网络的处理装置进行自适应性训练,例如:
该装置输入一组包含一个或多个具有词性标签的词组或短语,如包括语音片段,合成编辑电子音效音频等等。装置将输入的训练文字段落作为特征文字段落与生成模型根据噪声在同词性词语组中选择的创造文字段落一起混合输入到判别器中判别真假,并根据判别结果加权分别计算的生成器与判别器的损失值,然后根据损失值减小的最大梯度方向,自适应地更新判别器中的参数(如权值、偏置等等),进而提高判别器的判别精度;同时生成器根据判别器判别的损失值增大的最大梯度方向,自适应地更新生成器中的参数(如权值、偏置等等),进而提高生成器的生成能力,使得其根据噪声生成的音频采样点分布更加接近特征采样点分布,降低判别器的判别精度。最终,当判别器的判别精度不再变化时,达到最优生成器标准,以这个生成器的参数根据参考文字段落就可以将随机噪声生成具有参考风格的文字创作。
判别器的输入文字段落真假取值二选一——例如{0,1},0表示输入词组或短语为输入训练段落包含的词组或短语,为1表示输入图片为生成器根据噪声生成的随机短语;当然也可以反过来,1表示真,0表示假。
优选的,上述自适应性训练过程是离线处理的。优选的,用于执行生成对抗网络的处理装置为人工神经网络芯片。
具体的文字类型创作步骤可以包括:
步骤1,将随机噪声输入数据经预处理单元传入存储器或直接传入存储器;
步骤2,DMA(Direct Memory Access,直接内存存取)将其分批传入指令缓存,输入神经元缓存,权值缓存中;
步骤3,控制器从指令缓存中读取指令,将其译码后传入运算器;
步骤4,根据指令,运算器执行相应的运算:在神经网络的各个层中,运算主要分为三步:步骤4.1,,将对应的输入神经元和权值相乘;步骤4.2,执行加法树运算,即将步骤4.1的结果通过加法树逐级相加,得到加权和,根据需要对加权和加偏置或不做处理;步骤4.3,对步骤4.2得到的结果执行激活函数运算,得到输出神经元,并将其传入输出神经元缓存中。
步骤5,重复步骤2到步骤4,直到所有数据运算完毕,其中所述生成器的噪声生成结果可以根据神经网络最终输出层得到,结果由DMA存入生成器输出缓存;
步骤6,将部分输入数据与生成器生成结果混合作为判别器模型的输入数据,重复步骤2到步骤4,知道所有数据运算完毕,其中所述判别器的判别结果可以根据神经网络最终输出层的结果得到,结果由DMA存入判别器输出缓存;
步骤7,由DMA将判别器输出结果传入运算器,做偏导运算后分别得到生成器的优化梯度和判别器的优化梯度,分别将其与生成器、判别器的神经元权值相加后,将相应结果存入相应神经元缓存;
步骤8,重复步骤5,6,7直到生成器和判别器损失函数达到最优;
步骤9,输入参考数据经过数据预处理单元后传入存储单元或直接传入存储单元;
步骤10,重复步骤2到步骤4,生成器模型神经网络输出层输出结果即为创作结果。
根据所述功能需求:需要在自适应训练阶段预设输出创作音频采样点个数及音频时间长短(也是人工神经网络最终输出层的神经元个数)、与训练数据(输入训练特征)和网络参数更新方式(随机梯度下降、Adam算法等)。
本公开实施例还提供了一种电子设备,其包括了上述用于执行生成对抗网络的处理装置。
电子设备可包括但不限于机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备交通工具、家用电器、和/或医疗设备。
所述交通工具可包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
本公开中各功能单元/模块/子模块/子单元都可以是硬件,比如该硬件可以是电路,包括数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器等等。所述计算装置中的计算模块可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如RRAM,DRAM,SRAM,EDRAM,HBM,HMC等等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
以上所述的具体实施例,对本公开的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本公开的具体实施例而已,并不用于限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (10)

  1. 一种数据共享***,包括存储模块和至少两个处理模块,其中:
    所述至少两个处理模块共用所述存储模块;
    所述至少两个处理模块之间通过预设的规则进行通信,以实现数据共享。
  2. 如权利要求1所述的数据共享***,其中,所述预设的规则包括通信协议、传送协议、握手协议和/或总线协议。
  3. 如权利要求1至2中任一项所述的数据共享***,其中,所述通过预设的规则通信包括:至少两个处理模块包括第一处理模块和第二处理模块,第一处理模块向第二处理模块发送请求信号和相应的数据地址,所述第二处理模块根据所述请求信号和相应的数据地址,向第一处理模块回复有效信号和数据,以实现数据共享。
  4. 如权利要求1至3中任一项所述的数据共享***,其中,所述至少两个处理模块包括物理处理器。
  5. 如权利要求4所述的数据共享***,其中,所述物理处理器包括神经网络处理器。
  6. 如权利要求5所述的数据共享***,其中,所述神经网络处理器包括用于执行人工神经网络正向运算的装置。
  7. 如权利要求6所述的数据共享***,其中,所述用于执行人工神经网络正向运算的装置包括指令缓存单元和直接内存访问单元,其中:
    所述指令缓存单元用于通过直接内存访问单元读入指令并缓存读入的指令。
  8. 如权利要求7所述的数据共享***,其中,所述用于执行人工神经网络正向运算的装置还包括:
    控制器单元,用于从指令缓存单元读取指令,并将该指令译码成微指令。
  9. 如权利要求7至8中任一项所述的数据共享***,其中,所述用于执行人工神经网络正向运算的装置还包括H树模块、主运算模块、以及多个从运算模块,其中:
    所述H树模块,用于在每层神经网络反向训练开始计算的阶段,主运算模块通过H树模块向所有的从运算模块传输本层的输入神经元向量,以及在从计算模块的计算过程完成后,H树模块用于逐级将各从计算模块的输出神经元值拼成中间结果向量;
    主运算模块,用于利用中间结果向量完成后续计算。
  10. 如权利要求9所述的数据共享***,其中,所述直接内存访问单元,还用于从外部地址空间向主运算模块和各从运算模块的相应数据缓存单元中写数据,或从所述数据缓存单元向外部地址空间读数据。
PCT/CN2018/092829 2017-06-26 2018-06-26 数据共享***及其数据共享方法 WO2019001418A1 (zh)

Priority Applications (7)

Application Number Priority Date Filing Date Title
EP18824582.3A EP3637272A4 (en) 2017-06-26 2018-06-26 DATA-SHARING SYSTEM AND RELATED DATA-SHARING PROCESS
US16/694,124 US20200089535A1 (en) 2018-05-16 2019-11-25 Data sharing system and data sharing method therefor
US16/694,056 US11687467B2 (en) 2018-04-28 2019-11-25 Data sharing system and data sharing method therefor
US16/693,956 US11537843B2 (en) 2017-06-29 2019-11-25 Data sharing system and data sharing method therefor
US16/693,918 US10901815B2 (en) 2017-06-26 2019-11-25 Data sharing system and data sharing method therefor
US16/694,176 US11726844B2 (en) 2017-06-26 2019-11-25 Data sharing system and data sharing method therefor
US16/693,999 US11656910B2 (en) 2017-08-21 2019-11-25 Data sharing system and data sharing method therefor

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
CN201710497394.X 2017-06-26
CN201710497394.XA CN109117415B (zh) 2017-06-26 2017-06-26 数据共享***及其数据共享方法
CN201710515517.8A CN109214616B (zh) 2017-06-29 2017-06-29 一种信息处理装置、***和方法
CN201710515517.8 2017-06-29
CN201710721049.XA CN109426553A (zh) 2017-08-21 2017-08-21 任务切分装置及方法、任务处理装置及方法、多核处理器
CN201710721049.X 2017-08-21
CN201810407185.6 2018-04-28
CN201810407185.6A CN110413551B (zh) 2018-04-28 2018-04-28 信息处理装置、方法及设备
CN201810467383.1 2018-05-16
CN201810467383.1A CN110502330A (zh) 2018-05-16 2018-05-16 处理器及处理方法
CN201810641721.9 2018-06-20
CN201810641721.9A CN110619390A (zh) 2018-06-20 2018-06-20 用于执行生成对抗网络的处理装置及应用其进行机器创作的方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/693,918 Continuation-In-Part US10901815B2 (en) 2017-06-26 2019-11-25 Data sharing system and data sharing method therefor

Publications (1)

Publication Number Publication Date
WO2019001418A1 true WO2019001418A1 (zh) 2019-01-03

Family

ID=64741150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092829 WO2019001418A1 (zh) 2017-06-26 2018-06-26 数据共享***及其数据共享方法

Country Status (3)

Country Link
US (2) US11726844B2 (zh)
EP (1) EP3637272A4 (zh)
WO (1) WO2019001418A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796619A (zh) * 2019-10-28 2020-02-14 腾讯科技(深圳)有限公司 一种图像处理模型训练方法、装置、电子设备及存储介质
WO2020187041A1 (zh) * 2019-03-18 2020-09-24 北京灵汐科技有限公司 一种基于众核处理器的神经网络的映射方法及计算设备
WO2021050590A1 (en) * 2019-09-09 2021-03-18 Qualcomm Incorporated Systems and methods for modifying neural networks for binary processing applications
CN113168589A (zh) * 2019-01-10 2021-07-23 株式会社日立制作所 数据生成装置、预测器学习装置、数据生成方法和学习方法
US11775811B2 (en) * 2019-01-08 2023-10-03 Apple Inc. Scheduling heterogeneous execution on heterogeneous hardware

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019008519A1 (en) 2017-07-03 2019-01-10 Artomatix Ltd. SYSTEMS AND METHODS FOR PROVIDING SYNTHESIS OF NON-PARAMETRIC TEXTURE OF ARBITRARY SHAPE AND / OR MATERIAL DATA IN A UNIFIED FRAMEWORK
US11507429B2 (en) * 2017-09-14 2022-11-22 Electronics And Telecommunications Research Institute Neural network accelerator including bidirectional processing element array
US20190164037A1 (en) * 2017-11-29 2019-05-30 Electronics And Telecommunications Research Institute Apparatus for processing convolutional neural network using systolic array and method thereof
CN108776612A (zh) * 2018-04-11 2018-11-09 深圳大学 一种云计算任务分配方法、装置、设备及存储介质
EP3844749B1 (en) * 2018-08-30 2023-12-27 Dolby International AB Method and apparatus for controlling enhancement of low-bitrate coded audio
WO2020200246A1 (zh) * 2019-04-04 2020-10-08 中科寒武纪科技股份有限公司 数据处理装置及相关产品
JP7283191B2 (ja) * 2019-04-05 2023-05-30 富士フイルムビジネスイノベーション株式会社 情報処理システム
US20200356836A1 (en) * 2019-05-07 2020-11-12 Apple Inc. Fast deep learning fully-connected column-major implementation
EP3800088B1 (en) * 2019-10-01 2022-12-21 Foviatech GmbH Smart vehicle seat
CN112631955B (zh) * 2020-12-18 2024-01-19 北京地平线机器人技术研发有限公司 数据处理方法、装置、电子设备以及介质
CN112784977B (zh) * 2021-01-15 2023-09-08 北方工业大学 一种目标检测卷积神经网络加速器
US11799643B2 (en) 2021-01-19 2023-10-24 Bank Of America Corporation Collaborative architecture for secure data sharing
CN116458894B (zh) * 2023-04-21 2024-01-26 山东省人工智能研究院 基于复合型生成对抗网络的心电信号增强与分类方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0039412A2 (en) * 1980-05-05 1981-11-11 International Business Machines Corporation High density memory system
CN101013410A (zh) * 2006-02-02 2007-08-08 松下电器产业株式会社 直接存储器存取传送装置
CN105357306A (zh) * 2015-11-17 2016-02-24 贵阳朗玛信息技术股份有限公司 多平台数据共享***及其数据共享方法
CN105512723A (zh) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法
CN105844330A (zh) * 2016-03-22 2016-08-10 华为技术有限公司 神经网络处理器的数据处理方法及神经网络处理器

Family Cites Families (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5191649A (en) * 1990-12-21 1993-03-02 Intel Corporation Multiprocessor computer system with data bus and ordered and out-of-order split data transactions
US5408629A (en) * 1992-08-13 1995-04-18 Unisys Corporation Apparatus and method for controlling exclusive access to portions of addressable memory in a multiprocessor system
US5909681A (en) * 1996-03-25 1999-06-01 Torrent Systems, Inc. Computer system and computerized method for partitioning data for parallel processing
US5857025A (en) * 1996-09-09 1999-01-05 Intelligent Security Systems, Inc. Electronic encryption device and method
KR100230454B1 (ko) 1997-05-28 1999-11-15 윤종용 다중처리 시스템의 캐시메모리 검사방법
US6484240B1 (en) * 1999-07-30 2002-11-19 Sun Microsystems, Inc. Mechanism for reordering transactions in computer systems with snoop-based cache consistency protocols
US6654845B1 (en) * 2000-07-06 2003-11-25 Intel Corporation System and method implementing a secondary bus to avoid read data latency
EP1405184A2 (en) 2001-06-29 2004-04-07 Koninklijke Philips Electronics N.V. Data processing apparatus
US7234144B2 (en) * 2002-01-04 2007-06-19 Microsoft Corporation Methods and system for managing computational resources of a coprocessor in a computing system
JP2007504576A (ja) * 2003-01-17 2007-03-01 アヤラ,フランシスコ,ジェイ 人工知能を開発するためのシステム及び方法
US6956579B1 (en) * 2003-08-18 2005-10-18 Nvidia Corporation Private addressing in a multi-processor graphics processing system
US20060041715A1 (en) 2004-05-28 2006-02-23 Chrysos George Z Multiprocessor chip having bidirectional ring interconnect
CN1305002C (zh) 2004-07-15 2007-03-14 清华大学 多注册指纹融合方法
JP5040136B2 (ja) 2006-03-27 2012-10-03 富士通セミコンダクター株式会社 チューニング支援装置、チューニング支援プログラム、チューニング支援プログラムを記録したコンピュータ読み取り可能な記録媒体およびチューニング支援方法
JP4609521B2 (ja) * 2008-04-21 2011-01-12 ソニー株式会社 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
CN101739867B (zh) 2008-11-19 2012-03-28 中国科学院自动化研究所 运用计算机对口语翻译质量进行评分的方法
US8677075B2 (en) 2010-05-18 2014-03-18 Lsi Corporation Memory manager for a network communications processor architecture
CN102741828B (zh) 2009-10-30 2015-12-09 英特尔公司 对计算机平台的异构处理器的双向通信支持
US8635412B1 (en) 2010-09-09 2014-01-21 Western Digital Technologies, Inc. Inter-processor communication
CN101980149B (zh) 2010-10-15 2013-09-18 无锡中星微电子有限公司 主处理器与协处理器通信***及通信方法
CN102184157B (zh) 2011-05-19 2012-10-10 华东师范大学 一种基于双处理器协作的信息显示装置
CN102831011B (zh) 2012-08-10 2015-11-18 上海交通大学 一种基于众核***的任务调度方法及装置
CN102866912A (zh) 2012-10-16 2013-01-09 首都师范大学 一种单指令集异构多核***静态任务调度方法
CN102930866B (zh) 2012-11-05 2014-05-21 广州市神骥营销策划有限公司 一种用于口语练习的学生朗读作业的评判方法
CN103019656B (zh) 2012-12-04 2016-04-27 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理***
CN103177733B (zh) 2013-03-11 2015-09-09 哈尔滨师范大学 汉语普通话儿化音发音质量评测方法与***
CN103347037A (zh) 2013-05-29 2013-10-09 成都瑞科电气有限公司 一种基于wcf实现的通信前置机***及通讯方法
CN103530600B (zh) 2013-06-06 2016-08-24 东软集团股份有限公司 复杂光照下的车牌识别方法及***
US20150012711A1 (en) 2013-07-04 2015-01-08 Vakul Garg System and method for atomically updating shared memory in multiprocessor system
CN105051689A (zh) 2013-09-29 2015-11-11 华为技术有限公司 一种多核***中资源池的调度方法、装置和***
WO2015099730A1 (en) 2013-12-26 2015-07-02 Intel Corporation Sharing memory and i/o services between nodes
CN104978971B (zh) 2014-04-08 2019-04-05 科大讯飞股份有限公司 一种口语评测方法及***
CN103928023B (zh) 2014-04-29 2017-04-05 广东外语外贸大学 一种语音评分方法及***
CN104021042A (zh) 2014-06-18 2014-09-03 哈尔滨工业大学 基于arm、dsp及fpga的异构多核处理器及任务调度方法
CN110992935B (zh) 2014-09-12 2023-08-11 微软技术许可有限责任公司 用于训练神经网络的计算***
CN104268603B (zh) 2014-09-16 2017-04-12 科大讯飞股份有限公司 用于文字性客观题的智能阅卷方法及***
US9971397B2 (en) 2014-10-08 2018-05-15 Apple Inc. Methods and apparatus for managing power with an inter-processor communication link between independently operable processors
CN104463101B (zh) 2014-11-06 2017-08-25 科大讯飞股份有限公司 用于文字性试题的答案识别方法及***
CN104464423A (zh) 2014-12-19 2015-03-25 科大讯飞股份有限公司 一种口语考试评测的校标优化方法及***
EP3035204B1 (en) * 2014-12-19 2018-08-15 Intel Corporation Storage device and method for performing convolution operations
KR20160091786A (ko) 2015-01-26 2016-08-03 삼성전자주식회사 사용자 관리 방법 및 사용자 관리 장치
US10972371B2 (en) * 2015-03-27 2021-04-06 Intel Corporation Technologies for GPU assisted network traffic monitoring and analysis
US20160342887A1 (en) * 2015-05-21 2016-11-24 minds.ai inc. Scalable neural network system
EP3317823A4 (en) * 2015-06-30 2019-03-13 Arizona Board of Regents on behalf of Arizona State University METHOD AND DEVICE FOR MACHINERY LEARNING IN A LARGE SCALE
CN106407145A (zh) 2015-08-03 2017-02-15 联想(北京)有限公司 接口访问方法、***及存储卡
CN105159762B (zh) 2015-08-03 2018-09-07 冷子阳 基于贪心策略的启发式云计算任务调度方法
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
US9792896B2 (en) 2015-12-15 2017-10-17 Facebook, Inc. Providing intelligent transcriptions of sound messages in a messaging application
CN105678253B (zh) 2016-01-04 2019-01-18 东南大学 半监督人脸年龄估计装置及半监督人脸年龄估计方法
CN106056212B (zh) 2016-05-25 2018-11-23 清华大学 一种人工神经网络计算核
US10740107B2 (en) * 2016-06-01 2020-08-11 International Business Machines Corporation Operation of a multi-slice processor implementing load-hit-store handling
US10192281B2 (en) * 2016-07-07 2019-01-29 Intel Corporation Graphics command parsing mechanism
CN106502806B (zh) 2016-10-31 2020-02-14 华为技术有限公司 一种总线协议命令处理装置及相关方法
US10157045B2 (en) * 2016-11-17 2018-12-18 The Mathworks, Inc. Systems and methods for automatically generating code for deep learning systems
CN106781784A (zh) 2017-01-04 2017-05-31 王骁乾 一种智能批改***
CN106897248A (zh) 2017-01-08 2017-06-27 广东工业大学 基于异构多处理器阵列的低功耗重构技术
CN106682702A (zh) 2017-01-12 2017-05-17 张亮 深度学习方法和***
US10621586B2 (en) * 2017-01-31 2020-04-14 Paypal, Inc. Fraud prediction based on partial usage data
CN106909971A (zh) 2017-02-10 2017-06-30 华南理工大学 一种面向多核计算环境的bp神经网络并行化方法
US11003995B2 (en) * 2017-05-19 2021-05-11 Huawei Technologies Co., Ltd. Semi-supervised regression with generative adversarial networks
CN107844322B (zh) 2017-07-20 2020-08-04 上海寒武纪信息科技有限公司 用于执行人工神经网络正向运算的装置和方法
CN107590531A (zh) 2017-08-14 2018-01-16 华南理工大学 一种基于文本生成的wgan方法
CN107730474B (zh) * 2017-11-09 2022-02-22 京东方科技集团股份有限公司 图像处理方法、处理装置和处理设备
CN107832768A (zh) 2017-11-23 2018-03-23 盐城线尚天使科技企业孵化器有限公司 基于深度学习的高效阅卷方法和阅卷***
US10678508B2 (en) * 2018-03-23 2020-06-09 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0039412A2 (en) * 1980-05-05 1981-11-11 International Business Machines Corporation High density memory system
CN101013410A (zh) * 2006-02-02 2007-08-08 松下电器产业株式会社 直接存储器存取传送装置
CN105357306A (zh) * 2015-11-17 2016-02-24 贵阳朗玛信息技术股份有限公司 多平台数据共享***及其数据共享方法
CN105512723A (zh) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法
CN105844330A (zh) * 2016-03-22 2016-08-10 华为技术有限公司 神经网络处理器的数据处理方法及神经网络处理器

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11775811B2 (en) * 2019-01-08 2023-10-03 Apple Inc. Scheduling heterogeneous execution on heterogeneous hardware
CN113168589A (zh) * 2019-01-10 2021-07-23 株式会社日立制作所 数据生成装置、预测器学习装置、数据生成方法和学习方法
CN113168589B (zh) * 2019-01-10 2024-06-04 株式会社日立制作所 数据生成装置、预测器学习装置、数据生成方法和学习方法
WO2020187041A1 (zh) * 2019-03-18 2020-09-24 北京灵汐科技有限公司 一种基于众核处理器的神经网络的映射方法及计算设备
CN111723900A (zh) * 2019-03-18 2020-09-29 北京灵汐科技有限公司 一种基于众核处理器的神经网络的映射方法及计算设备
CN111723900B (zh) * 2019-03-18 2023-10-20 北京灵汐科技有限公司 一种基于众核处理器的神经网络的映射方法及计算设备
WO2021050590A1 (en) * 2019-09-09 2021-03-18 Qualcomm Incorporated Systems and methods for modifying neural networks for binary processing applications
US11790241B2 (en) 2019-09-09 2023-10-17 Qualcomm Incorporated Systems and methods for modifying neural networks for binary processing applications
CN110796619A (zh) * 2019-10-28 2020-02-14 腾讯科技(深圳)有限公司 一种图像处理模型训练方法、装置、电子设备及存储介质
CN110796619B (zh) * 2019-10-28 2022-08-30 腾讯科技(深圳)有限公司 一种图像处理模型训练方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20200117519A1 (en) 2020-04-16
US11726844B2 (en) 2023-08-15
EP3637272A1 (en) 2020-04-15
US20200118004A1 (en) 2020-04-16
US10901815B2 (en) 2021-01-26
EP3637272A4 (en) 2020-09-02

Similar Documents

Publication Publication Date Title
WO2019001418A1 (zh) 数据共享***及其数据共享方法
CN109284823B (zh) 一种运算装置及相关产品
US11307865B2 (en) Data processing apparatus and method
US11106976B2 (en) Neural network output layer for machine learning
US20200089535A1 (en) Data sharing system and data sharing method therefor
US20200104167A1 (en) Data processing apparatus and method
JP2020537784A (ja) ニューラルネットワークアクセラレーションのための機械学習ランタイムライブラリ
CN109726806A (zh) 信息处理方法及终端设备
US11620521B2 (en) Smoothing regularization for a generative neural network
CN115495614A (zh) 使用一个或更多个神经网络的视频上采样
WO2019015541A1 (zh) 一种计算方法及相关产品
US20210295168A1 (en) Gradient compression for distributed training
CN113469355B (zh) 分布式***中的多模型训练管道
US20210011849A1 (en) Processor cluster address generation
US10997102B2 (en) Multidimensional address generation for direct memory access
CN113496271A (zh) 神经网络控制变量
CN116070557A (zh) 使用强化学习的数据路径电路设计
CN113762461A (zh) 使用可逆增强算子采用有限数据训练神经网络
US11521007B2 (en) Accelerator resource utilization by neural networks
CN116206042A (zh) 空间哈希一致采样
US20220391781A1 (en) Architecture-agnostic federated learning system
US20200192797A1 (en) Caching data in artificial neural network computations
US20230206113A1 (en) Feature management for machine learning system
US11605001B2 (en) Weight demodulation for a generative neural network
US11307866B2 (en) Data processing apparatus and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18824582

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018824582

Country of ref document: EP

Effective date: 20191211

NENP Non-entry into the national phase

Ref country code: DE