CN116822600A

CN116822600A - Neural network search chip based on RISC-V architecture

Info

Publication number: CN116822600A
Application number: CN202311105683.2A
Authority: CN
Inventors: 刘斌; 袁梓涵; 潘彪; 康旺
Original assignee: Beijing Jinghanyu Electronic Engineering Technology Co ltd
Current assignee: Beijing Jinghanyu Electronic Engineering Technology Co ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-09-29

Abstract

The invention provides a neural network search chip based on RISC-V architecture, belonging to the technical field of neural network chips, comprising: the CPU controller module is in communication connection with the Memory storage unit and is used for processing input data and model parameters based on a Resnet neural network searching task of the upper computer to obtain a neural network structure after searching; the DMA unit is respectively in communication connection with the CPU controller module, the Memory unit and the on-chip storage module; and the digital memory and calculation integrated core module is used for training the neural network structure after searching according to the data instruction. The invention completes the training and searching of the neural network through the digital memory integrated core and the CPU of the RISC-V framework, can improve the dispatching efficiency of bus transmission data, and simultaneously, the digital memory integrated core under the framework can fully utilize the computational power resource to realize the network reasoning task, overcomes the limitation of the traditional memory computational power, and realizes the neural network searching with higher speed and higher precision.

Description

Neural network search chip based on RISC-V architecture

Technical Field

The invention relates to the technical field of neural network chips, in particular to a neural network search chip based on a RISC-V architecture.

Background

In recent years, with the development of society, artificial intelligence (Artificial Intelligence, AI) is widely used in cloud data centers, intelligent end devices and edge end devices. Meanwhile, the neural network exhibits strong capabilities in various AI tasks, including but not limited to, image recognition, face recognition, keyword detection, text processing, and other scenarios. The performance of the artificial neural network is continuously improved, and the deep neural network comprising a large number of parameters and operations is widely used. During the calculation process, the AI generates a large amount of intermediate data and requires a large amount of data movement due to operations such as convolution, pooling and activation, which puts higher demands on the performance and power consumption of the AI chip.

There are a number of limitations in training neural networks on traditional computational-isolated von neumann architectures: bus bandwidth for handling data between the processor and the memory is limited; under the hierarchical memory structure, the memory level far away from the processor has large capacity, but the bandwidth is low and the delay is high; the neural network has large calculated data volume and high parallelism, and needs a large amount of data movement and the like. The great amount of time and energy consumption in training is in the handling of data and the reading and writing of memory, and the computational power promotion needs high-bandwidth memory and novel computing architecture. And with the continuous progress of circuit technology, the performance between the processor and the memory is more and more increased, and the system performance is mainly limited by the memory performance. In addition, frequent data movement can occur in a neural network algorithm under the von neumann architecture, the generated high power consumption is far higher than the calculation generation power consumption, and more data under the neural network architecture occupies CPU resource bandwidth, so that the CPU calculation efficiency is lower and the power consumption is higher.

The integrated memory and calculation chip provides a novel neural network accelerator method: the memory-computing integrated chip embeds the computing operations in the memory unit, eliminating the boundaries of the memory and the processing unit. The ever-increasing performance differences between processors and off-chip memory, compared to processing data serially under conventional von neumann architectures, results in processor memory access computational performance that greatly affects system performance. In addition, the memory-calculation integrated chip can process data in parallel, so that the handling of intermediate data is reduced in the calculation process, and fewer off-chip memory access times are provided, so that the calculation efficiency of the memory-calculation integrated architecture is far higher than that of the von neumann calculation architecture, and the memory-calculation integrated chip is particularly obvious in convolution calculation and big data processing scenes.

In the present stage, neural network models are endless, various models play important roles in deep learning tasks such as image processing, machine translation and the like, network structures in the models are special super-parameters, for example, network structures such as Transfomer, resNet and the like need to be designed more accurately to realize the model performance of deep learning, and neural network structure searching (Neural Architecture Search, NAS) represents that the traditional manual network structure design is converted into an automatic design. The prior Google and the like perform neural network structure search through reinforcement learning, and the prior network is surpassed in image classification and language modeling tasks.

However, the existing neural networks are of a plurality of types, and the ideal result is often not achieved by relying on an empirical model.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a neural network search chip based on a RISC-V architecture.

In order to achieve the above object, the present invention provides the following solutions:

a neural network search chip based on RISC-V architecture, comprising:

the CPU controller module is in communication connection with the Memory storage unit and is used for processing input data and model parameters based on a Resnet neural network searching task of the upper computer to obtain a neural network structure after searching; wherein the CPU controller module controls data flow and data buffering using a RISC-V reduced instruction set architecture;

the DMA unit is respectively in communication connection with the CPU controller module, the Memory storage unit and the on-chip storage module;

and the digital memory and calculation integrated core module is in communication connection with the on-chip storage module and is used for training the searched neural network structure according to the data instruction.

Preferably, the training the neural network structure after searching according to the data instruction includes:

the task scheduler is utilized to directly access the data instruction through the DMA unit and store the data instruction in a buffer of the on-chip storage module, the CPU controller module is utilized to compile specific instruction bits through machine codes, identify instruction requirements and send control signals, and the control signals are converted into a group of 8-bit data through data conversion and transmitted to the digital storage and calculation integrated core module so as to configure the digital storage and calculation integrated core module to train the searched neural network structure.

Preferably, in the neural network searching process, a pipeline mode is utilized to perform searching training on the neural network structure.

Preferably, the digital storage integrated core module comprises an ALU unit.

Preferably, the ALU unit comprises an adder, a multiplier and a shift accumulator; and when the operation flow is carried out, the multiplier is utilized to realize multiplication of data and addresses, the adder is utilized to carry out addition operation on the data and the addresses after multiplication operation, and finally the shift accumulator is utilized to process the data and the addresses after addition operation to obtain an operation result.

Preferably, the on-chip memory module contains multiple layers of data buffers, and each layer of data buffer is buffered by a FIFO.

Preferably, the digital memory integrated core module is in communication connection with the on-chip memory module through an on-chip bus interface.

Preferably, the Memory storage unit is an SRAM Memory.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a neural network search chip based on RISC-V architecture, comprising: the CPU controller module is in communication connection with the Memory storage unit and is used for processing input data and model parameters based on a Resnet neural network searching task of the upper computer to obtain a neural network structure after searching; wherein the CPU controller module controls data flow and data buffering using a RISC-V reduced instruction set architecture; the DMA unit is respectively in communication connection with the CPU controller module, the Memory storage unit and the on-chip storage module; and the digital memory and calculation integrated core module is in communication connection with the on-chip storage module and is used for training the searched neural network structure according to the data instruction. According to the invention, the training and searching of the neural network are completed by utilizing the digital memory integrated core and the CPU of the RISC-V architecture, so that the dispatching efficiency of bus transmission data can be improved, meanwhile, the digital memory integrated core under the architecture can fully utilize the computational power resource to realize the network reasoning task, the limitation of the traditional memory computational power is overcome, and the neural network searching with higher speed and higher precision is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a diagram of a neural network structure search architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a neural network search hardware channel design according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a RISC-V based digital storage processor logic architecture according to an embodiment of the present invention;

FIG. 4 is a diagram showing the overall structure of a RISC-V controlled digital storage integrated neural network search according to an embodiment of the present invention;

FIG. 5 is a flowchart of a digital storage integrated operation provided in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a digital memory unit according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a hardware pipeline structure in a search space according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a neural network search chip based on RISC-V architecture, comprising:

the CPU controller module is in communication connection with the Memory storage unit and is used for processing input data and model parameters based on a Resnet neural network searching task of the upper computer to obtain a neural network structure after searching; wherein the CPU controller module controls data flow and data buffering using a RISC-V reduced instruction set architecture; in the neural network searching process, the invention utilizes a pipeline mode to search and train the neural network structure.

and the digital memory and calculation integrated core module is in communication connection with the on-chip storage module and is used for training the searched neural network structure according to the data instruction. It should be noted that the on-chip memory module contains multiple layers of data buffers, and each layer of data buffer is buffered by FIFO. The digital memory and calculation integrated core module is in communication connection with the on-chip memory module through an on-chip bus interface.

Specifically, the training the neural network structure after searching according to the data instruction includes:

Further, the digital memory integrated core module comprises an ALU unit; the ALU unit comprises an adder, a multiplier and a shift accumulator; and when the operation flow is carried out, the multiplier is utilized to realize multiplication of data and addresses, the adder is utilized to carry out addition operation on the data and the addresses after multiplication operation, and finally the shift accumulator is utilized to process the data and the addresses after addition operation to obtain an operation result.

According to the invention, the training and searching of the neural network are completed by utilizing the digital memory integrated core and the CPU of the RISC-V architecture, so that the dispatching efficiency of bus transmission data can be improved, meanwhile, the digital memory integrated core under the architecture can fully utilize the computational power resource to realize the network reasoning task, the limitation of the traditional memory computational power is overcome, and the neural network searching with higher speed and higher precision is realized.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1, risc-V is an open source architecture based on a reduced instruction set, and has the characteristics of simple architecture, easy migration, modularization, and the like, and can be flexibly applied to low-power consumption fast scene conditions by adding more than 40 basic instructions and custom instruction sets. The RISC-V processor consists of processor core, physical memory, I/O unit and fixed function accelerator.

The invention divides the construction of the neural network into two parts of neural network searching and neural network training, the neural network searching adopts genetic algorithm, reinforcement learning and the like to generate a neural network structure, and most of the neural network structure is serial calculation and is suitable for being executed on a CPU, so the invention designs a CPU based on RISC-V instruction set, the neural network training trains the searched network structure, most of the neural network training is parallel calculation and is suitable for being executed on a memory-calculation integrated processor, and the invention designs a digital memory-calculation coprocessor; finally, the two parts are combined to design a neural network searching and storing integrated chip system supporting RISC-V architecture.

Compared with the traditional von neumann architecture, the memory-computing integrated architecture can integrate memory and operation, overcomes the memory wall and power consumption wall memory problems, can reduce the interaction cost of data between operation and memory units, and greatly increases the computing energy efficiency. At present, an analog array-based design and calculation integrated AI chip is more utilized, the calculation process limits the precision, and the array stationarity is more limited and the calculation function is utilized in a scene with low functional flexibility requirement and lower calculation precision, so that the analog calculation method is not suitable for the cloud AI scene end side with high precision, high flexibility, low power consumption and high calculation force and is required to be used in higher neural network training. In the integrated architecture, the complex neural network application function is limited by the architecture singleness caused by the low precision and the fixed array structure in the analog integrated chip. The digital storage can be used for constructing an integrated computing unit and a memory, so that multiply-add computation in an AI algorithm is well realized. The neural network is largely operated by multiply-add operation, and needs to be implemented on memory calculation on memories such as SRAM. In practical network training, the method is suitable for being combined with a memory integrated chip to manufacture high-performance and high-precision products with larger capacity, such as intelligent driving and data centers.

The invention fuses the RISC-V coprocessor and the architecture integrating the memory and the calculation, and fully utilizes the excellent low-power consumption performance and the calculation energy efficiency of the extended instruction of the RISC-V. The digital memory integrated hardware accelerator built by the hardware accelerator has the characteristics of openness and high efficiency, high-efficiency computing capacity is embedded in storage resources, and delay time and energy consumption expense of frequent mobile data are reduced. The invention utilizes the digital memory integrated unit based on logical operation and multiply-add operation, is widely applied to high-precision and high-power scenes based on digital memory integrated unit, has good development prospect, and is favored by academia and industry.

In the invention, resNet is mainly used as an effective model of the neural network, and the model solves the degradation problem in the deep network by using a residual block. The need to constantly store feature map data for each layer in residual learning increases the consumption and storage of hardware resources, so we generate a neural network architecture of 3*3 convolution kernels in the network structure for supporting training and hardware reasoning.

FIG. 2 is a schematic diagram of a neural network search hardware channel design, in the process of searching the neural network, floating point numbers are converted into fixed point numbers through a digital calculation integral core, each PE unit trains the neural network structure through a nonlinear function, a pooling layer, a batch normalization layer and an activation function layer under the control of a controller, meanwhile, the model precision of the neural network is maintained in the training process, and meanwhile, multiple training of the neural network can be facilitated.

The neural network architecture can separate tasks through a blocking strategy of the tasks, data arrive at a controller and a processor to acquire a processor layering training strategy, the efficient NAS strategy reduces the prediction time in the searching process of the neural network structure, improves the accuracy rate of time for training each epoch, simultaneously updates an excitation function in real time, and provides task data set results after each NN sampling and continuously updates and trains the sampling network in the reasoning process.

The high-speed pipeline architecture can flexibly store convolution results and training data of the neural network in an on-chip memory, comprises a plurality of layers of data buffer areas, utilizes FIFO hardware resources to buffer data of each layer, has wide input and output bits and the size of the buffer areas, has a multi-layer storage structure arbitrated by a task scheduler, is convenient for managing resource data, and increases hardware resource throughput rate. The data is accelerated to train the network model through a plurality of processing unit arrays, and the cache architecture of the multi-stage pipeline reduces delay and power consumption of data handling.

The digital memory core based on RISC-V architecture fuses the advantages of RISC-V reduced instruction set architecture, multiple instructions are realized in RISC-V architecture, the memory core realizes multiply-add operation, the calculation unit frequently accesses memory data for many times in AI chip calculation paradigm innovation, the upper computer realizes neural network search training through AI algorithm, is connected with the digital memory core through a bus, each core in the multiple digital memory cores independently operates, and bandwidth unification, low power consumption and training reasoning are realized after the parallel pipeline arbitration scheduling.

FIG. 3 is a logical architecture of a digital memory processor based on RISC-V, wherein a task scheduler is responsible for scheduling a plurality of digital memory cores by processing MAC computation through the digital memory cores, a RISC-V CPU uses the RISC-V reduced instruction set architecture as a controller to control data flow and data buffering, a main memory and a buffer area can buffer instructions, a bus links and controls RISC-V interfaces, a memory-computing core processor and an upper computer interface. The invention discloses a hierarchical storage structure which integrates a plurality of digital memory cores, DDR and buffer, wherein a cache unit accesses the digital memory cores and RISC-V CPU through higher bandwidth and higher speed.

Fig. 4 is a diagram showing a whole structure of a digital memory integrated neural network search controlled by RISC-V, and the whole structure mainly comprises an on-chip memory module, an off-chip SRAM memory module, a digital memory integrated core module, a RISC-V based CPU controller module, a bus interface, and a data FIFO (First In Firsr Out) module.

The working flow of the neural network search chip in the invention is as follows: firstly, after the neural network search task is completed in an upper computer, data flow is compressed, input data and model parameters are processed based on the Resnet neural network search task, the data can be directly accessed into a memory through a bus, 64bit data of data, keep and last are read into a cache, and search training of a neural network structure is performed.

Reasoning is deployed on the data in the storage and retrieval integrated processor. The task scheduler can directly access data instructions through DMA, preprocess read-in cache instructions and store the read-in cache instructions in an on-chip buffer, simultaneously configure a digital storage integrated core, meanwhile, the DMA is also connected to a controller, the controller of RISC-V architecture keeps and processes data in an on-chip cache stream by identifying instruction demands and sending control signals through specific instruction bits compiled by machine codes, the data are converted into a group of 8-bit data for transmission through data conversion, and finally, the digital storage integrated coprocessor is configured to execute tasks, read weights and input data comprising channel numbers and data length, width and height. And sending a signal to the RISC-V coprocessor after calculation and reasoning under the control of the instruction, and enabling the task scheduler to check and conduct the next neural network reasoning process. When all AI operations are completed, all stored input data and output score result reports are output.

The RISC-V64 bit architecture is responsible for task scheduling and control, can divide a neural network reasoning part, and can connect a digital memory integrated chip core and a RISC-V CPU. Each Line Buffer can carry out block data flow processing on PE input data, separate data and weight in the Line Buffer, and store bit width information of the Line Buffer in a digital storage and calculation integrated processor.

The ALU unit in the digital memory integrated processor includes an adder and a multiplier, and fig. 5 represents a digital memory integrated operation flow. When the digital storage processor completes the operation flow, the operation of data and addresses exists, multiplication operation is realized firstly, addition operation is realized secondly, and finally, the data is obtained through a shift accumulator. Meanwhile, the data stream is mapped in a multi-layer convolution structure in the neural network, through propagation training of each layer of convolution calculation result, the resource utilization of the data stream bandwidth under the multi-layer convolution training is considered when the chip is deployed, and the parallelism of the convolution calculation units is increased. The data moving distance is reduced, and the data moving expense is reduced.

The RISC-V instruction set used in the invention is used as a reduced instruction set with open source, contains a plurality of instructions and custom instructions to control a processor, and better performs task scheduling and data scheduling on a coprocessor core.

Because the chip calculation storage resources are limited, the invention uses an effective block tiling strategy to improve the parallelism, improves the resource utilization rate of limited on-chip cache, reasonably divides large-block matrix resources into a plurality of sub-matrices, and can sequentially calculate the plurality of sub-matrices in the on-chip cache to obtain results.

The computing unit mainly comprises a computing unit contained in a digital memory and calculation integrated core, and fig. 6 is a digital memory and calculation unit structure. The preprocessing is performed by a row decoder, a column decoder, a read decoder, and the like. From the calculation perspective, more operations in the neural network are matrix multiply-add operations, and the digital storage design can be more suitable for the operation of the neural network. The matrix computing unit, the vector computing unit and the scalar unit in the computing architecture are better adapted to digital storage computing operations.

The memory system mainly comprises an on-chip memory unit of the memory integrated core and a corresponding data path.

The whole control unit is completed by the RISC-V coprocessor, and can control the completion of the task of the whole processor core.

The RISC-V CPU can be flexibly connected with other modules through buses, has a complete tool chain, has a better code transplanting function, and can provide a reconfigurable integrated architecture for storage and calculation for users.

The computing capability is embedded in the memory, and matrix multiplication and addition computation is completed in the memory integrated architecture, so that the resource consumption and delay time in data movement can be reduced, the memory unit directly performs logic computation operation, core operands in large-scale memory computation are improved, the computing power in logic computation is improved, the efficiency in neural network computation can be improved, the power consumption and the cost are reduced, and the memory wall barrier is broken.

The matrix calculation unit and the accumulator complete matrix calculation in corresponding convolution, the input convolution of 32 x 3 in the neural network search can be completed by using the int8 input type, the control branch judgment of the program can add the current matrix result to the previous result, and bias operation in the convolution.

FIG. 7 is a hardware pipeline structure in a search space, and the improved search space storage unit performs a mapping logic calculation operation to reduce delay time and energy consumption ratio. The digital operation realizes high-precision operation and has no precision loss. The adder pipelining improves the digital storage and calculation integrated efficiency. By searching resources utilized by each PE layer, the efficient and flexible parallel operation of each PE layer of the neural network can be realized in parallel.

The beneficial effects of the invention are as follows:

(1) And a RISC-V instruction set parallel data transmission mode is adopted to realize the read-write function. The conventional data transmission method transmits the access request, and after the read operation is completed, the data is alternately transmitted, and the data is read from the FIFO. The transmission process is simpler and the efficiency is low. The data FIFO processed by the parallel pipeline can ensure the parallel reading and writing of different reading and writing lengths, and ensure that the system is distributed to different banks to run at irrelevant operation moments, and the parallel pipeline design of the banks greatly improves the data transmission rate.

(2) The method is applied to the data frequent intensive neural network search. The method has the characteristics of high parallelization and high utilization rate of hardware mapping resources, and comprises a large matrix multiplication unit with space locality, a large-scale vector convolution data weight and the deep learning field of a complex structure network model.

(3) The design and calculation integrated core array structure can well infer architecture data through a neural network, meanwhile, a plurality of hardware resource hierarchical scheduling tasks are completed through an on-chip bus, a bus resource memory is used for accessing a memory, a RISC-V controller interface is used for mounting the bus, and configuration of corresponding data width is needed, so that the content of the design is enriched. The memory interface is mounted in the network-on-chip layout, the interface input end needs a data layer interface and a state interface, and meanwhile, the design of bus resources for data control is supplemented.

(4) In order to solve the problems of poor precision and insufficient resource utilization, the design adopts a quantization mode to realize the establishment of the neural network on the premise of ensuring the precision. By quantifying the parameters of the network and the number of activated bits according to the requirements, networks with different precision and resource occupation are obtained to adapt to different requirements, so that the balance of precision, hardware resource occupation and power consumption is realized. Meanwhile, a customized pipeline mode is adopted, and the defect that the NN network is unfavorable for parallelization processing due to the self structure limit is overcome.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A neural network search chip based on RISC-V architecture, comprising:

2. The RISC-V architecture based neural network search chip of claim 1, wherein the training of the searched neural network structure according to data instructions comprises:

3. The RISC-V architecture based neural network search chip of claim 1, wherein the neural network structure is search trained in a pipelined manner during a neural network search process.

4. The RISC-V architecture based neural network search chip of claim 1, wherein the digital storage unified core module includes an ALU unit.

5. The RISC-V architecture based neural network search chip of claim 4, wherein the ALU unit includes an adder, a multiplier, and a shift accumulator; and when the operation flow is carried out, the multiplier is utilized to realize multiplication of data and addresses, the adder is utilized to carry out addition operation on the data and the addresses after multiplication operation, and finally the shift accumulator is utilized to process the data and the addresses after addition operation to obtain an operation result.

6. The RISC-V architecture based neural network search chip of claim 1, wherein the on-chip memory module contains multiple layers of data buffers, and each layer of data buffer is buffered by a FIFO.

7. The RISC-V architecture based neural network search chip of claim 1, wherein the digital storage integrated core module is communicatively coupled to the on-chip memory module via an on-chip bus interface.

8. The RISC-V architecture based neural network search chip of claim 1, wherein the Memory storage unit is an SRAM Memory.