CN109886416A

CN109886416A - The System on Chip/SoC and machine learning method of integrated AI's module

Info

Publication number: CN109886416A
Application number: CN201910104560.4A
Authority: CN
Inventors: 连荣椿; 王海力; 马明
Original assignee: Jing Wei Qi Li (beijing) Technology Co Ltd
Current assignee: Jing Wei Qi Li (beijing) Technology Co Ltd
Priority date: 2019-02-01
Filing date: 2019-02-01
Publication date: 2019-06-14

Abstract

A kind of System on Chip/SoC and machine learning method of integrated AI AI module.In embodiment, System on Chip/SoC includes: processor, is configured with bus；AI module, is connected in bus by bus interface module；AI module includes the processing unit with first memory；FPGA module is connected in bus by bus interface module, provides the data of machine learning for AI module；Nonvolatile memory is connected in bus by bus interface module, for storing weight coefficient.By the way that the weight coefficient in machine-learning process to be saved in non-volatile memory, machine learning process can be accelerated, can also be avoided because study caused by accident is interrupted.

Description

The System on Chip/SoC and machine learning method of integrated AI's module

Technical field

The present invention relates to technical field of integrated circuits more particularly to a kind of System on Chip/SoC and machine of integrated AI's module Device learning method.

Background technique

In recent years, artificial intelligence welcomes a wave development wave.Artificial intelligence is that research makes computer to simulate the certain of people The subject of thought process and intelligent behavior (such as study, reasoning, thinking, planning), the main original that intelligence is realized including computer The computer for managing, being manufactured similarly to human brain intelligence enables a computer to realize higher level application.

Currently, artificial intelligence module is accessed control by processor by bus, and bus is that have certain bandwidth Limitation, such framework are difficult to adapt to the big bandwidth demand of AI module.

Summary of the invention

According in a first aspect, providing a kind of System on Chip/SoC, comprising: processor is configured with bus；At least one AI module, It is connected in bus by bus interface module；Each AI module at least one AI module includes having the place of first memory Manage unit；At least one FPGA module, is connected in bus by bus interface module, provides machine learning for AI module Data；Nonvolatile memory is connected in bus by bus interface module, for storing weight coefficient；Wherein, it is handling Under device control, AI module reads weight coefficient from nonvolatile memory and first memory is written, and deposits using first Weight coefficient in reservoir carries out machine learning；Processor also controls will be non-based on the weight coefficient write-in updated after machine learning Volatile memory.

Preferably, System on Chip/SoC includes second memory, and the second memory is used as buffer；Wherein, the weight Coefficient can read from nonvolatile memory and first memory is written through second memory, or read and pass through from first memory Nonvolatile memory is written in second memory.

Preferably, a module at least one FPGA module provides register, is used as buffer；Wherein, the power Weight coefficient can be read from nonvolatile memory is written first memory through register, or reads from first memory through posting Nonvolatile memory is written in storage.

Preferably, FPGA module provides the data of machine learning for AI module.

According to second aspect, a kind of machine learning method is provided, is implemented by System on Chip/SoC, the System on Chip/SoC includes: place Device is managed, bus is configured with；AI module, is connected in bus by bus interface module, and AI module includes having first memory Processing unit；FPGA module is connected in bus by bus interface module；Nonvolatile memory, for storing weight Coefficient；The described method includes: reading weight coefficient from nonvolatile memory；Weight coefficient is written and manages list everywhere in AI module In the first memory of member；Start AI module and carries out machine learning using the weight coefficient in first memory.

Preferably, the method includes utilizing depositing for processor after reading weight coefficient from nonvolatile memory The buffering of the register of reservoir or FPGA progress reading data.

According to the third aspect, a kind of machine learning method is provided, is implemented by System on Chip/SoC, the System on Chip/SoC includes: place Device is managed, bus is configured with；AI module, is connected in bus by bus interface module, and AI module includes having first memory Processing unit, the weight coefficient of first memory storage machine learning；FPGA module is connected by bus interface module It is connected in bus；Nonvolatile memory；The described method includes: AI module in the memory of its own each processing unit using depositing The weight coefficient of storage carries out machine learning；The weight coefficient in the memory is updated according to machine learning；According to processor Nonvolatile memory is written in weight coefficient by control instruction.

Preferably, the method includes utilizing depositing for processor after reading weight coefficient from nonvolatile memory Reservoir or the register of FPGA module carry out the buffering of reading data.

According to fourth aspect, a kind of machine learning method is provided, is implemented by System on Chip/SoC, the System on Chip/SoC includes: place Device is managed, bus is configured with；AI module, is connected in bus by bus interface module, and AI module includes having first memory Processing unit；FPGA module is connected in bus by bus interface module；Nonvolatile memory, for storing weight Coefficient；The described method includes: reading weight coefficient from nonvolatile memory；Weight coefficient is written and manages list everywhere in AI module In the first memory of member；Start AI module and carries out machine learning using weight coefficient；AI module is single using its own each processing The weight coefficient stored in the first memory of member carries out machine learning；The first memory is updated according to machine learning result In weight coefficient；Nonvolatile memory periodically is written into weight coefficient.

By the way that the weight coefficient in machine-learning process to be saved in non-volatile memory, machine learning can be accelerated Process can also avoid because study caused by accident is interrupted.

Detailed description of the invention

Fig. 1 is the system chip structure schematic diagram of integrated AI's module according to an embodiment of the present invention；

Fig. 2 is the structural schematic diagram of FPGA circuitry；

Fig. 3 is the structural schematic diagram of artificial intelligence module；

Fig. 4 is the schematic diagram of processing unit；

Fig. 5 is the schematic diagram that the memory MEM in the processing unit of Fig. 4 is realized with the access of word formula；

Fig. 6 is the schematic diagram that the memory MEM in the processing unit of Fig. 4 is realized with the access of bit formula；

Fig. 7 is to write weight to reason unit everywhere in AI module and read the schematic diagram of weight.

Specific embodiment

To make the technical solution of the embodiment of the present invention and becoming apparent from for advantage expression, below by drawings and examples, Technical scheme of the present invention will be described in further detail.

In the description of the present application, term " center ", "upper", "lower", "front", "rear", "left", "right", " east ", " south ", The orientation or positional relationship of the instructions such as " west ", " north ", "vertical", "horizontal", "top", "bottom", "inner", "outside" is based on attached drawing institute The orientation or positional relationship shown is merely for convenience of description the application and simplifies description, rather than the dress of indication or suggestion meaning It sets or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as the limit to the application System.

Fig. 1 is the system chip structure schematic diagram of integrated AI's module according to an embodiment of the present invention.Such as Fig. 1 institute Show, at least one FPGA module and at least one artificial intelligence module are integrated on System on Chip/SoC.

Each FPGA module can realize the various functions such as logic, calculating, control.FPGA using small-sized look-up table (for example, 16 × 1RAM) Lai Shixian combinational logic, each look-up table are connected to the input terminal of a d type flip flop, and trigger drives other to patrol again Circuit or driving I/O are collected, the basic logic that can not only realize combination logic function but also can realize sequential logic function is thus constituted Unit module, these intermodules interconnect or are connected to I/O module using metal connecting line.The logic of FPGA is by internally Static storage cell loads programming data come what is realized, stores the logic function that value in a memory cell determines logic unit Can and each module between or the connecting mode between module and I/O, and finally determine function achieved by FPGA.FPGA mould Block can be configured with configurable input output (C.IO).

Each artificial intelligence module can be realized or accelerate through previously selected specific AI function, including artificial intelligence (Artificial Intelligence AI), deep learning (Deep Learning DL), machine learning (Machine Learning ML) etc. specific function (such as the convolution Convolution, matrix of a certain step in various algorithms or accelerating algorithm Matrix/ tensor operation Tensor Operation etc.).Artificial intelligence (AI) module may include by multiple functional modules (FU) group At array, each functional module may include similar ALU or functional unit, register, the multiplexer MUX of multiply-accumulator (MAC) etc.. Artificial intelligence module is configured with fixed input/output (F.IO), naturally it is also possible to export comprising configurable import and export (Configurable IO)。

The size of FPGA module and artificial intelligence module is simultaneously not limited, and is determined in design by practical application.

In terms of occupying chip layout, usually arrangement FPGA module is adjacent with artificial intelligence module.FPGA module and AI Module can be placed side by side, and FPGA module can be AI module transfer data at this time, provides control.AI module can also be embedded in Among FPGA module；It, also can be in large stretch of FPGA module when the lesser situation of artificial intelligence module for example, FPGA module is larger In hollow out a window, be built into artificial intelligence module；At this point, AI module needs to be multiplexed the winding structure of FPGA module, to lead to The winding structure for crossing the FPGA module of multiplexing sends and receivees data.

Processor is also integrated on System on Chip/SoC.Processor is for example, by using ARM+8051, ARM+RISC_V, RISC_V+ 8051 equal frameworks.Processor has bus B US, can pass through bus access other devices.Processor can have on piece to be locally stored Device.

The respective bus interface module BIM of each FPGA module, is connected respectively to BUS.Equally, AI module is also used respective BIM is connected respectively to BUS.

In one example, interface corresponding with AI module, FPGA module and AI module are additionally provided on System on Chip/SoC It is connected to by interface module.Interface module can be coiling (XBAR) module, and XBAR module is for example by multiple selectors (Multiplexer) it is formed with selection bit.Interface module is also possible to FIFO (first in first out).Interface module can also be same It walks device (Synchronizer), synchronizer is for example connected in series by 2 triggers (Flip-Flop or FF).FPGA module can be with For AI module transfer data, control is provided.Interface module can be additional circuit module, be also possible to the interface of band in FPGA Both module, or have simultaneously.

In embodiment, when applying in machine learning, according in operation result continuous updating AI module in memory MEM Weight (weight) parameter deposited, to realize learning objective.

At the beginning, processor is set as the storage unit MEM for managing unit PE everywhere in AI module initially to learn shape State, by (the memory MEM module that each PE is written in weight weight).

When normal study, via processing mass data, and output is observed as a result, system determines progressive updating weight, by Step optimization output result.In general, being typically employed in study Shi Yaoyong mass data, process is time-consuming persistently.

In study, can periodically (such as the checkpoint set) suspend, and will be in the MEM in each PE of AI module Appearance (weight) data, which are deposited, writes NVM (nonvolatile memory, Non-Volatile Memory), such as flash memory FLASH, magnetic storage In device (MRAM), variable resistance type memory (ReRAM).

After learning training, weight coefficient will be determined.AI module can be optimized based on machine learning after power Weight coefficient implements relevant computer application.

In general, system needs to restart such as after the various problems such as power-off, interruption in learning process.Due to preparatory It stores the parameters in NVM, if restarting, just at this time without from the beginning starting to learn again, deposited newest study need to only be tied Fruit is fetched by NVM, then is placed in the MEM of each PE of AI module, can reply the former state learnt, can be further continued for remaining Learning process.

In one embodiment, it can be done by SRAM or register file (Register File) when practical read-write NVM slow It rushes (Buffer).In one example, the EMB in FPGA can be used as buffering.

In one case, under the control of a processor, weight parameter being read from NVM, AI module is written through bus, BIM. Updated weight can also be read in write-in NVM from AI module along opposite path.

In another scenario, under the control of a processor, weight parameter is read from NVM through bus and SRAM is written, then AI module is written again.Updated weight can also be read in write-in NVM from AI module along opposite path.

In another scenario, the part EMB in FPGA module is configured buffer by first processing device；In processor Under control, the EMB of weight parameter write-in FPGA is read from NVM through bus, AI module is then written again.It can also be along opposite road Diameter reads updated weight in write-in NVM from AI module.

Fig. 2 is the structural schematic diagram of FPGA circuitry.As shown in Fig. 2, FPGA circuitry may include having multiple programmable logic moulds The modules such as block (LOGIC), embedded memory block (EMB), multiply-accumulator (MAC) and corresponding coiling (XBAR).Certainly, FPGA electricity Road is additionally provided with the related resources such as clock/configuration module (trunk spine/ branch seam).If desired EMB or when MAC module, because of it The big many of area ratio PLB, therefore several PLB modules are replaced with this EMB/MAC module.

LOGIC module may include, for example, 86 input look-up tables, 18 registers.

EMB includes several open-ended storage fritters, can be linked to be the storage of a bulk of such as 36Kb, includes various width The selection of degree/depth.

MAC module can be, for example, 25x18 multiplier or 2 18x18 multipliers.MAC module can also arrange in pairs or groups for example 48 accumulators.

In FPGA array, there is no restriction for the accounting of each module number of LOGIC, MAC, EMB, and the size of array is also according to need It wants, is determined in design by practical application.

Coiling resource (XBAR) is the contact of each intermodule interconnection, is evenly distributed in FPGA module.In FPGA module All resources, PLB, EMB, MAC, IO mutual coiling are all to be had an identical interface, i.e. coiling XBAR unit To realize.From the point of view of winding mode, entire array is identical consistent, the XBAR unit formation grid of proper alignment, will be in FPGA All modules are connected.

Fig. 3 is the structural schematic diagram of artificial intelligence module.As shown in figure 3, artificial intelligence AI module is a two-dimensional array, For example including 4X4 execution unit EU.AI module can be divided into two dimensions, the first dimension and the second dimension perpendicular to one another.With For first execution unit, the second execution unit and third execution unit.First execution unit and the second execution unit are along first Dimension is along first direction arranged adjacent；First execution unit is coupled to the second execution unit edge along the first output end of first direction The first input end of the opposite direction of first direction.First execution unit and third execution unit are along the second dimension and along second party To arranged adjacent, the second output terminal of the first execution unit in a second direction is coupled to the phase of third execution unit in a second direction Second input terminal of opposite direction.

One-dimensional data a can be inputted under same clock parallel along the first dimension manages list everywhere in identical second dimension values Member；Data are throughout managed in unit to be multiplied with another dimension data (coefficient) W of storage in the cells；Product is along the second dimension along Two directions are transmitted by each processing unit, and are added each other.It hereafter will be the first dimension with horizontal dimensions for the sake of understanding conveniently Degree, from left to right are first direction, are the second dimension with vertical dimensions, upper downwards for second direction.

Execution unit carries out various operations, such as addition subtraction multiplication and division, logical operation etc. after receiving the data, to data.It holds Row unit exports operation result along the first dimension first direction or the second dimension second direction.

Certainly, by control, the same data or data as derived from it can flow through all under different clocks PE unit.

It is noted that every data line in Fig. 3 can both represent the signal of single-bit, 8 (or 16,32) bits can also be represented Signal.

In one example, matrix multiplication may be implemented in artificial intelligence module.In another example, two-dimensional array can be with Realize convolution algorithm.

Although illustrating the AI module of one-way flow in Fig. 3, the embodiment of the present invention is suitable for other types of AI mould Block, such as the AI module of Mutual data transmission.

Fig. 4 is the schematic diagram of processing unit.As shown in figure 4, processing unit includes multiplier MUL, adder ADD.Data It is inputted along the first dimension first direction from the first data-in port DI, at multiplier MUL and is stored in coefficient memory MEM In weight (weight) coefficient W be multiplied；Then, the product is at adder ADD and from the second data-in port PI's Data P is added, and after being added and value is deposited in register REG1.In next clock, and value S is through second output terminal PO It is exported along the second dimension second direction.It can input and be located below through input port PI after second output terminal PO output with value S Another PE.

Certainly, data a can also be deposited in register REG2, and under clock control through the first output end DO along Dimension first direction is output to the processing unit PE on right side.

Clock CK is for controlling each processing unit synchronous working.

Enable signal EN is used to start or suspend the treatment progress of processing unit.

Fig. 5 is the schematic diagram that the memory MEM in the processing unit of Fig. 4 is realized with the access of word formula.It can be by dedicated Path accesses memory.As shown in figure 5, memory includes multiple d type flip flops, these d type flip flops cascade with one another, i.e., previous The output end of d type flip flop is connected in series to the input terminal of the latter d type flip flop；Coefficient data presses bit from first d type flip flop D input terminal input, then through each output end Q output be Q0-Q7.Q0-Q7 can to the north of provide be used as coefficient data.Clock CK control Make the synchronous working of each d type flip flop.Enable signal EN is for determining whether d type flip flop starts or suspend.It is noted that different portions The enable signal EN of part is difference, and thus the starting or pause of different components are also and asynchronous.

Fig. 6 is the schematic diagram that the memory MEM in the processing unit of Fig. 4 is realized with the access of bit formula.Different from the ground of Fig. 5 Side is only that the access of memory MEM using bit-wise.

Fig. 7 is to write weight to reason unit everywhere in AI module and read the schematic diagram of weight.When needing to write weight, Ke Yili Weight is read from NVM with the weight module of writing in left side, and weight then is written to AI module.It is weighed when needing to read from AI module When weight, the weighted data in AI module can be sequentially written in reading weight module, be then written in NVM.Write weight module and reading Weight module can be realized using individual circuit module, can also be realized using the submodule of FPGA module.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of System on Chip/SoC, comprising:

Processor is configured with bus；

At least one artificial intelligence AI module, is connected in bus by bus interface module；It is each at least one AI module AI module includes the processing unit with first memory；

At least one FPGA module, is connected in bus by bus interface module, provides the number of machine learning for AI module According to；

Nonvolatile memory is connected in bus by bus interface module, for storing weight coefficient；

Wherein, under processor control, AI module reads weight coefficient from nonvolatile memory and first memory is written, And machine learning is carried out using the weight coefficient in first memory；Processor also controls will be based on updating after machine learning Nonvolatile memory is written in weight coefficient.

2. system according to claim 1 chip, which is characterized in that including second memory, the second memory is used Make buffer；Wherein, the weight coefficient can read from nonvolatile memory and first memory is written through second memory, Or it is read from first memory and nonvolatile memory is written through second memory.

3. system according to claim 1 chip, which is characterized in that a module at least one FPGA module provides Register is used as buffer；Wherein, the weight coefficient can read from nonvolatile memory and deposit through register write-in first Reservoir, or read from first memory and nonvolatile memory is written through register.

4. system according to claim 1 chip, which is characterized in that FPGA module provides machine learning for AI module Data.

5. a kind of machine learning method, is implemented by System on Chip/SoC, the System on Chip/SoC includes: processor, is configured with bus；AI mould Block is connected in bus by bus interface module, and AI module includes the processing unit with first memory；FPGA module, It is connected in bus by bus interface module；Nonvolatile memory, for storing weight coefficient；The described method includes: from Nonvolatile memory reads weight coefficient；Weight coefficient is written everywhere in AI module in the first memory for managing unit；It opens Dynamic AI module carries out machine learning using the weight coefficient in first memory.

6. machine learning method as claimed in claim 5, which is characterized in that the method includes from nonvolatile memory After reading weight coefficient, the buffering of reading data is carried out using the memory of processor or the register of FPGA.

7. a kind of machine learning method, is implemented by System on Chip/SoC, the System on Chip/SoC includes: processor, is configured with bus；AI mould Block is connected in bus by bus interface module, and AI module includes the processing unit with first memory, and described first deposits The weight coefficient of reservoir storage machine learning；FPGA module is connected in bus by bus interface module；It is non-volatile to deposit Reservoir；The described method includes: AI module carries out machine using the weight coefficient stored in the memory of its own each processing unit Study；The weight coefficient in the memory is updated according to machine learning；According to the control instruction of processor, weight coefficient is write Enter nonvolatile memory.

8. machine learning method as claimed in claim 7, which is characterized in that the method includes from nonvolatile memory After reading weight coefficient, the buffering of reading data is carried out using the memory of processor or the register of FPGA module.

9. a kind of machine learning method, is implemented by System on Chip/SoC, the System on Chip/SoC includes: processor, is configured with bus；AI mould Block is connected in bus by bus interface module, and AI module includes the processing unit with first memory；FPGA module, It is connected in bus by bus interface module；Nonvolatile memory, for storing weight coefficient；The described method includes:

Weight coefficient is read from nonvolatile memory；Weight coefficient is written everywhere in AI module to the first memory for managing unit In；Start AI module and carries out machine learning using weight coefficient；AI module utilizes the first memory of its own each processing unit The weight coefficient of middle storage carries out machine learning；The weight coefficient in the first memory is updated according to machine learning result； Nonvolatile memory periodically is written into weight coefficient.