WO2019127838A1 - 卷积神经网络实现方法及装置、终端、存储介质 - Google Patents

卷积神经网络实现方法及装置、终端、存储介质 Download PDF

Info

Publication number
WO2019127838A1
WO2019127838A1 PCT/CN2018/074999 CN2018074999W WO2019127838A1 WO 2019127838 A1 WO2019127838 A1 WO 2019127838A1 CN 2018074999 W CN2018074999 W CN 2018074999W WO 2019127838 A1 WO2019127838 A1 WO 2019127838A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
input
neural network
module
control module
Prior art date
Application number
PCT/CN2018/074999
Other languages
English (en)
French (fr)
Inventor
万文涛
梁洁
罗聪
Original Assignee
国民技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国民技术股份有限公司 filed Critical 国民技术股份有限公司
Publication of WO2019127838A1 publication Critical patent/WO2019127838A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present invention relate to an FPGA (Field-Programmable)
  • the field of gate Array field programmable gate array
  • convolutional neural networks have attracted attention due to the reusability of their weights.
  • convolutional neural networks are implemented by software. The amount of data is large, the computing power of hardware is high, and the high computing power of the cloud depends on the power consumption.
  • the embodiment of the invention provides a method and a device, a terminal and a storage medium for implementing a convolutional neural network based on an FPGA, so as to solve the problem that the existing convolutional neural network technology relies on software implementation.
  • the embodiment of the present invention adopts the following technical solutions:
  • An implementation method of convolutional neural network based on FPGA comprising:
  • Initializing an editable resource of the FPGA generating an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and an operation control module;
  • the data to be processed is stored in the memory memory through the memory controller of the FPGA;
  • the operation control module determines the current network layer to be operated according to the correspondence between the status register and each network layer, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module.
  • the current network layer to be processed processes the data until all network layers of the convolutional neural network model are to be sequentially executed, and the processing result corresponding to the data to be processed is output.
  • the network layer includes: a convolution calculation hierarchy, a pooling operation hierarchy, a connection operation hierarchy, a reorganization operation hierarchy, and a classification operation hierarchy.
  • the operation control module controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the current pending operation.
  • the processing of data by the network layer includes:
  • the control data reading module reads the weight data and the input data corresponding to the convolution calculation level stored in the memory memory through the memory controller, and stores the weight data and the input data into the input buffer module;
  • the control input control module inputs the weight data and the input data stored by the input buffer module into the neural network processing unit;
  • the control neural network processing unit calculates the input data using the weight data, and outputs the calculation result
  • the control output control module stores the calculation result in the output buffer module
  • the control memory controller reads the calculation result in the output buffer module and stores the calculation result in the memory.
  • the operation control module controls the input cache module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the current network to be operated.
  • Layer processing of data includes:
  • the control data reading module reads the input data corresponding to the pooled operation level stored in the memory memory through the memory controller, and stores the input data in the input buffer module;
  • the control input control module divides the input data stored by the input buffer module into a plurality of pooled windows, and sequentially inputs the neural network processing unit from the pooling window;
  • the control neural network processing unit performs maximum pooling comparison on the input data, and outputs a comparison result
  • the control output control module stores the comparison result in the output buffer module
  • the control memory controller reads the comparison result in the output buffer module and stores the comparison result in the memory.
  • the operation control module controls the input cache module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the current network to be operated.
  • Layer processing of data includes:
  • Determining output data of other network layers corresponding to input data of the current network layer, and other network layers include at least one network layer except the current network layer;
  • the storage address of the output data of the other network layer in the memory memory is configured as the input address of the input data of the current network layer
  • the control data reading module reads the data corresponding to the input address from the memory memory according to the input address, and stores the data in the input buffer module.
  • the operation control module controls the input cache module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the current network layer to be operated.
  • the processing of the data includes:
  • the control data reading module reads the input data corresponding to the reorganization operation level stored in the memory memory through the memory controller, and stores the input data in the input buffer module;
  • the control input control module inputs the input data stored by the input buffer module into the neural network processing unit;
  • the control neural network processing unit reorganizes the input data and outputs the recombination result
  • the control output control module stores the recombination result in the output buffer module
  • the operation control module controls the input cache module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the network layer to be currently operated.
  • the processing of the data includes:
  • the control data reading module reads the input data corresponding to the classification operation level stored in the memory memory through the memory controller, and stores the input data in the input buffer module;
  • the control input control module inputs the input data stored by the input buffer module as an input feature vector into the neural network processing unit;
  • the control neural network processing unit performs classification calculation on the input data, and outputs the detection result
  • the control output control module stores the detection result in the output buffer module
  • the control memory controller reads the detection result in the output buffer module and outputs the detection result.
  • the data to be processed is source data for the terminal to implement face recognition.
  • An apparatus for implementing a convolutional neural network based on FPGA comprising:
  • An initialization module for initializing an editable resource of the FPGA, generating an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and a running control module; the operation control module is configured to read The parameter of the status register determines the network layer to be operated, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to complete the processing of the data currently being operated by the network layer. Until all network layers of the convolutional neural network model are to be sequentially executed, the processing result corresponding to the data to be processed is output;
  • the loading module is configured to load the weight data of each network layer in the convolutional neural network model to be stored in the memory of the FPGA, associate the status register of the FPGA with the network layer, and store the data to be processed into the memory memory through the memory controller of the FPGA. .
  • the neural network processing unit includes a plurality of processing units for processing data in parallel.
  • the input buffer module includes two input storage units for buffering input data and/or weight data of the neural network processing unit by means of ping-pong double buffering; and/or, the output buffer module includes two An output storage unit for buffering the output data of the neural network processing unit by means of ping-pong double buffering.
  • a terminal comprising: a source data input module for inputting data to be processed, a processor, and a convolutional neural network implementing device as above;
  • the convolutional neural network implementation device initializes an editable resource of the FPGA, and generates an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and a running control module in the terminal; loading is to be implemented
  • the weight data of the network layer in the convolutional neural network model is connected to the memory of the FPGA, the state register of the FPGA and the network layer, and the network layer includes at least six network layers;
  • the convolutional neural network implementation device preprocesses the data to be processed and stores it in the memory memory through the memory controller of the FPGA, and the data to be processed is data including a face image;
  • the operation control module reads the status instruction, determines the network layer to be operated, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to execute data processing under the network layer to be operated. Step: processing the processed data until all network layers of the convolutional neural network model are finished running, and outputting the processing result corresponding to the data to be processed to the processor;
  • the processor recognizes the face image information from the data according to the processing result.
  • a storage medium storing one or more programs, one or more programs executable by one or more processors to implement the steps of the FPGA-based convolutional neural network implementation method described above.
  • Embodiments of the present invention provide a method and device for implementing a convolutional neural network based on an FPGA, a terminal, and a storage medium.
  • the method first initializes an editable resource of an FPGA, and generates an input buffer module, an output buffer module, an input control module, and an output control.
  • the module, the neural network processing unit, the data reading module, and the operation control module then load the weight data of the network layer in the convolutional neural network model to be implemented into the memory of the FPGA, correlate the status register of the FPGA with the network layer, and
  • the processing data is stored in the memory memory by the memory controller of the FPGA, and finally the control module reads the parameters of the status register, determines the network layer to be run, and controls the input buffer module, the output buffer module, the input control module, the output control module, and the neural network.
  • the processing unit and the data reading module complete the processing of the data by the network layer to be executed until all the network layers of the convolutional neural network model are sequentially run, and output the processing result corresponding to the data to be processed; in the whole process, the convolutional nerve
  • the network is implemented through the FPGA Pieces to achieve, rather than relying on software to solve the problems existing convolution neural network technology-dependent software implementation.
  • the hardware corresponding to the implementation of the convolutional neural network is designed to implement the face recognition and the positioning by using the FPGA. Therefore, the terminal provided by the embodiment of the present invention does not need to rely on the cloud. It can be run locally in the terminal to solve the problem that large complex deep convolutional neural networks cannot be operated in the hardware terminal.
  • FIG. 1 is a flowchart of a method for implementing a convolutional neural network according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic structural diagram of a device for implementing a convolutional neural network according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic structural diagram of a device for implementing a convolutional neural network according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic diagram of a method for implementing a convolutional neural network according to Embodiment 3 of the present invention.
  • FIG. 5 is a schematic diagram of a ping-pong double buffer according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a neural network processing unit according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a maximum pooling operation according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of logic control according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of data recombination according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a terminal according to Embodiment 5 of the present invention.
  • the embodiments of the present invention are applicable to all terminal devices provided with an FPGA chip, including a PC, a mobile phone, a PAD, a deposit machine, and the like.
  • the embodiments of the present invention are further described in detail below with reference to the accompanying drawings.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • FIG. 1 is a flowchart of a method for implementing a convolutional neural network based on an FPGA according to Embodiment 1 of the present invention.
  • an FPGA-based convolutional neural network implementation method provided by this embodiment includes the following steps:
  • S101 Initialize an editable resource of the FPGA, and generate an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and a running control module.
  • the editable resources of the FPGA can be constructed into any functional mode according to requirements.
  • the editable resources of the FPGAs are constructed as the functional modes necessary for implementing the convolutional neural network model, and then on this basis. Implement convolutional neural network functions to process data.
  • the input buffer module and the output buffer module both use the ping-pong double-buffer mechanism to cache data
  • the neural network processing unit includes multiple PEs (Processing). Element, processing unit), these PEs process data in parallel; these will be explained in the third embodiment.
  • the neural network processing unit is time-multiplexed, that is, it has different roles in different network layers.
  • S102 Load the weight data of the network layer in the convolutional neural network model to be implemented into the memory of the FPGA, and associate the status register of the FPGA with the network layer.
  • the network layer includes: a convolution calculation hierarchy, a pooling operation hierarchy, a connection operation hierarchy, a reorganization operation hierarchy, and a classification operation hierarchy.
  • the convolutional neural network model generally includes a plurality of network layers.
  • the convolutional neural network model involved in the third embodiment of the present invention has twenty-two layers of convolutional layers, five layers of maximum pooling layers, and two A layer connection layer, a layer reorganization layer, a layer classification layer, and a layer of pre-processing layer, a total of thirty-two network layers, real-time operation processing on the input picture data and outputting the detection result.
  • the embodiment of the present invention further includes setting at least one status register for each network layer, and configuring a corresponding status instruction, that is, by setting the status command to identify the status register and the network layer.
  • the association identifier may be a state instruction for setting a plurality of status registers corresponding to one network layer, each status register corresponding to a network layer status instruction, or only one status register may be set, and the status register is configured with multiple states.
  • the instructions are respectively corresponding to different network layers, and the current state instruction of the status register is updated in real time according to the running process.
  • S103 Store the data to be processed into the memory memory through the memory controller of the FPGA.
  • the data to be processed refers to data that needs to be processed by the convolutional neural network externally, such as image data, source data for implementing face recognition by the terminal, and the like.
  • the operation control module reads the parameters of the status register, determines the network layer to be operated, and processes the data according to the data processing step corresponding to the network layer to be executed, and outputs the processing result.
  • the data processing of the current network layer to be operated is completed by controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module until the convolution is to be implemented. All network layers of the neural network model are sequentially run, and the processing result corresponding to the data to be processed is output.
  • determining that the network layer currently being executed specifically reads the status command configured in the status register by the operation control module to determine the network layer to be operated.
  • the operation control module reads parameters of the status register, determines a network layer to be operated, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data read
  • the module performs a data processing step under the network layer to be executed, and processing the data to be processed includes:
  • the operation control module sequentially reads status instructions of each status register in the network layer
  • the data is processed in turn according to the data processing steps.
  • This step is mainly to run the control module according to the parameters of the status register, control each functional module to process the input data, complete the convolutional neural network processing, and the specific implementation is different for different kinds of network layers.
  • step S104 includes:
  • the control data reading module reads the weight data and the input data corresponding to the convolution calculation level stored in the memory memory through the memory controller, and stores the weight data and the input data into the input buffer module;
  • the control input control module inputs the weight data and the input data stored by the input buffer module into the neural network processing unit;
  • the control neural network processing unit calculates the input data using the weight data, and outputs the calculation result
  • the control output control module stores the calculation result in the output buffer module
  • the control memory controller reads the calculation result in the output buffer module and stores the calculation result in the memory.
  • step S104 includes:
  • the control data reading module reads the input data corresponding to the pooled operation level stored in the memory memory through the memory controller, and stores the input data in the input buffer module;
  • the control input control module divides the input data stored by the input buffer module into a plurality of pooled windows, and sequentially inputs the neural network processing unit from the pooling window;
  • the control neural network processing unit performs maximum pooling comparison on the input data, and outputs a comparison result
  • the control output control module stores the comparison result in the output buffer module
  • the control memory controller reads the comparison result in the output buffer module and stores the comparison result in the memory.
  • step S104 includes:
  • step S104 includes:
  • the control data reading module reads the input data corresponding to the reorganization operation level stored in the memory memory through the memory controller, and stores the input data in the input buffer module;
  • the control input control module inputs the input data stored by the input buffer module into the neural network processing unit;
  • the control neural network processing unit reorganizes the input data and outputs the recombination result
  • the control output control module stores the recombination result in the output buffer module
  • step S104 includes:
  • the control data reading module reads the input data corresponding to the classification operation level stored in the memory memory through the memory controller, and stores the input data in the input buffer module;
  • the control input control module inputs the input data stored by the input buffer module as an input feature vector into the neural network processing unit;
  • the control neural network processing unit performs classification calculation on the input data, and outputs the detection result
  • the control output control module stores the detection result in the output buffer module
  • the control memory controller reads the detection result in the output buffer module and outputs the detection result.
  • the embodiment provides a method for implementing a convolutional neural network based on an FPGA.
  • the method first initializes an editable resource of the FPGA, and generates an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, and data.
  • the device is stored in the memory, and finally the control module reads the parameters of the status register, determines the network layer to be run, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module.
  • the data processing of the network layer to be executed is completed until all the network layers of the convolutional neural network model are sequentially run, and the processing result corresponding to the data to be processed is output; in the whole process, the convolutional neural network is implemented by the FPGA.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the convolutional neural network implementation apparatus 2 includes:
  • the initialization module 21 is configured to initialize an editable resource of the FPGA, and generate an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and a running control module;
  • the operation control module is configured to read Taking the parameters of the status register, determining the network layer to be operated, and controlling the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to execute the data processing steps under the network layer to be operated And processing the to-be-processed data until all network layers of the convolutional neural network model are to be sequentially executed, and outputting a processing result corresponding to the to-be-processed data;
  • the loading module 22 is configured to load the weight data of each network layer in the convolutional neural network model to be stored in the memory of the FPGA, associate the status register of the FPGA with the network layer, and store the data to be processed into the memory through the memory controller of the FPGA. Memory.
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • the input data is taken as an example for illustration.
  • the deep learning convolutional neural network model is implemented by hardware, and the implemented platform is a NOPS-DX7 (Snopsys product type) FPGA development board of Snopsys (American Synopsys Corporation); specifically, it will be trained first.
  • the convolutional neural network model weight parameter is loaded into the DDR (Double Data Rate, the above-mentioned memory memory) of the FPGA development board, and then the input data is preprocessed in the preprocessing module. Transfer the preprocessed data to the DDR of the FPGA development board, and then continuously extract the weight parameters and input data of the current layer network in the DDR through the DMA (Direct Memory Access) unit, and transmit the data to the NPU (network).
  • DMA Direct Memory Access
  • NPU Network
  • Process units, neural network processing unit) Parallel operation the completed output data is the input of the next layer, and is stored in the DDR through the output buffer module.
  • the data feature vector that finally completes all convolution operations is passed to the classification module to complete the feature classification calculation.
  • the apparatus provided in this embodiment includes an input terminal A, an output terminal B, a preprocessing unit 301, a DDR controller 302 (ie, a memory controller in the above), and a DDR memory 303 (ie, the above).
  • DRAM unit 304 for reading and writing weight data
  • buffer unit 305 for buffering weight data
  • buffer unit 306 for buffering input data
  • DMA unit 307 for reading and writing input data
  • Input control module 308 for input control
  • NPU unit 309 output control module 310 for output control
  • output buffer module 311 for buffering output data
  • cache unit 305 and The buffer unit 306 constitutes the above input buffer module.
  • the operation control module 312 includes an instruction unit 3121, a decoder 3122, and a control logic unit 3123.
  • the instruction unit 3121 is configured to receive instruction data
  • the decoder 3122 is configured to translate the instruction data.
  • the code control logic unit 3123 is configured to output a corresponding control instruction according to the decoding result.
  • the DDR controller 302 controls the connection and data transfer functions of the DDR memory and other external modules, including storage control of input data, read control of DMA read DDR data, storage control of output data of hardware operations, and final output. Read control of feature vector data.
  • the cache module includes a first buffer unit 51, a second buffer unit 52, and a selection control unit 53, and the selection control unit 53 is configured to select which cache unit to cache the inputs to be cached, and control Which buffer unit outputs buffer output (input data), and outputs an identifier flag signal, which is the state of the current input and output buffer, indicating the data state of the current two buffer areas.
  • the input buffer includes a weight data cache and an input data cache, wherein the weight data cache buffers the current layer weight data, the deviation value, and the regularization parameter by using a ping-pong double buffer, when the weight data of one of the buffer areas
  • the DMA unit 304 loads data into another buffer area thereof, thereby reducing the waiting time for loading data; correspondingly, the input data buffer also buffers the input data of the current layer by means of ping-pong double buffering, when it is a buffer area.
  • the DMA unit 307 loads data into another buffer area; the output data buffer uses the ping-pong double-buffer method to cache the feature map data calculated by the NPU in one of its buffer areas, and the other has already cached the data.
  • the buffer area writes data to the DDR memory.
  • the NPU unit 309 includes P*P parallel processing units PE (PE0 to PEn) for calculating the multiply/add/subtract operations of the convolution process in parallel.
  • PE P*P parallel processing units
  • the intermediate result of a channel and convolution kernel calculation will be stored in the temporary register.
  • the result of the calculation of the next channel and the convolution kernel is added to the intermediate result and then stored in the temporary register again. This calculation is repeated until all channels and volumes are used.
  • the BN (Batch Normalization) operation is performed on the obtained data.
  • the final calculation result then stores the obtained new feature map data into the DDR memory through the output buffer module.
  • the input control module 308 and the output control module 310 are configured to connect the data transmission direction between the two modules. Specifically, the input control module 308 is configured to rearrange the data of the data buffer area according to the interface of the NPU unit input data, and correctly The data is transferred to the corresponding input interface, and the output control module 310 is configured to rearrange the output data of the NPU according to the input interface of the output buffer area, and correctly transmit the data to the corresponding input interface.
  • the operation control module 312 controls the logic state of the entire system. By reading the current status register, it is determined which level of the deep convolutional neural network is calculated in the current state, thereby executing the logic control instruction in the corresponding state, and controlling the operation of the data.
  • the method provided in this embodiment includes the following steps:
  • S401 Acquire weight data of the model and load it into the DDR memory.
  • the weight data of the trained deep convolutional neural network is obtained from the cloud, and the weight data is loaded into the DDR memory of the FPGA development board through USB.
  • a GPU Graphics Processing Unit
  • yolo a deep learning algorithm
  • the weight data of the trained face detection model is obtained, and the weight data is passed through the USB ( The Universal Serial Bus is loaded into the FPGA development board DDR memory.
  • S402 Preprocess the input data and store it in the DDR memory.
  • the step includes: normalizing the input data to meet the calculation requirement; performing bilinear interpolation processing on the input data to make the image size satisfy the calculation requirement; and the preprocessed input data is stored in the DDR memory.
  • the input image data is normalized and preprocessed, and the gray value is divided by 255 and normalized to 0-1, and the input image data size is rearranged to 416*416 by using a bilinear interpolation method.
  • the input picture size requirement of the yolo convolutional neural network is satisfied, and then the input picture data is stored in the DDR memory.
  • the current status register is read to determine which sub-level of the deep convolutional neural network is calculated in the current state, thereby executing a logical control instruction in the corresponding state to control the operation of the data.
  • n status registers R0, R1, ..., R(n-1), Rn each register stores the status data corresponding to the current layer, indicating that the entire deep convolutional neural network needs to operate R0 to Rn a total of n-layer networks
  • the control logic reads the registers in order, performs the corresponding hierarchical logic control function, controls the flow of the entire hardware data, and completes the calculation of the deep convolutional neural network.
  • the convolution calculation operation is performed.
  • the step includes: loading the convolution layer weight parameter and input data into the parallel convolution processing unit PE, and setting the weight parameter to k*k Point (32bit) / fixed point (16bit) matrix, input data is a*a floating point number (32bit) / fixed point number (16bit) matrix, sliding step length is 1, parallel convolution processing unit PE number is P * P , then the convolution sum of the P*P input data and the weight can be calculated at the same time; the convolution layer calculation includes the multiplication and accumulation operation of the weight and the input data, the batch regularized BN calculation operation, the addition bias, the activation function leaky activation, After calculating the input data of a convolution kernel and multiple input channels, get a feature The map is stored in memory and then the next convolution kernel and input data are calculated until the calculation of the deep convolutional neural network layer is completed.
  • the maximum pooling operation is performed.
  • the step includes: setting the floating data/fixed point matrix of the input data to A*A, the sliding step size to s, and the number of parallel convolution processing units.
  • the maximum pooling operation is used to divide the input data into (A/s)*(A/s) pooling windows, each time from (A/s)*(A/s) pooling.
  • the P*P input data of the corresponding position is sequentially loaded in the window, and the maximum pooled result of P*P can be output after s*s cycles.
  • connection layer operation is performed, and the connection layer operation is to use the output data of a layer or a certain layer calculated in the previous layer as the input data of the current layer, and therefore, it is necessary to reload before
  • the address of the output data of the layer in the DDR memory is used as the input address of the input data of the current layer, and the connection layer operation can be completed.
  • the reorganization layer operation is performed, and the reorganization layer operation is to split and reorganize the current layer, and the original input data is 2h*2w*2c data, the step size is 2, and the operation of the reorganization layer is performed.
  • the output data is the feature map of h*w*8c, and the address mapping unit needs to be added to map the original address to the new address storage data as the input data of the next layer.
  • step S405 Determine whether the network layer calculation is finished. If yes, execute step S406, otherwise return to step S403.
  • step S406 is performed; when the current status register is not calculated by the classification layer, it is determined that the network layer calculation is not completed, and step S403 is performed.
  • S406 Perform a classification layer calculation operation, and output a result.
  • the classification layer is calculated as the input feature vector after the operation of each convolution layer, pooling layer, wiring layer, recombination layer, etc., and the detection result is obtained by classification calculation and output.
  • a complex deep convolutional neural network is implemented by using a hardware FPGA, so that a deep convolutional neural network model highly dependent on the powerful computing power of the cloud is put into a local terminal for operation, and does not need to rely on the network for real-time processing of data, and solves a large-scale problem.
  • Complex deep convolutional neural networks cannot run on hardware terminals.
  • the embodiment of the invention can process a deep convolutional neural network with more complex structure and more network layers, and can adapt to the current deep learning algorithm, and can process the convolution layer, the pooling layer, the connection layer, and the reorganization layer.
  • the convolution layer of the embodiment of the present invention can process the batch regularized BN operation and the activation function leaky function, and increases the implementation of the connection layer and the reorganization layer, and has a leading edge.
  • the embodiment is applicable to processing input graphic data and weight data of a floating point number (32 bit) or a fixed point number (16 bit), and by replacing the internal multiplication, addition, and subtraction units into floating point numbers or fixed point numbers operation units,
  • the deep learning algorithm model that can handle different data types has high flexibility, and by changing the deep learning algorithm of the floating point data type to the hardware fixed point data type, the weight and intermediate result data amount is reduced, and the calculation is performed. The accuracy does not change much.
  • the input picture data or the video frame data is directly processed by the preprocessing module, and then the processed data is input into the convolutional neural network to perform operations on each layer, and the data after each network layer is calculated is used as The feature vector input of the classification layer, the final classification detection calculation, and finally the data output, complete the real-time detection of the face in the picture or video frame, the whole process is implemented in the local FPGA hardware, no networking, and compared with the traditional CPU and GPU solutions greatly reduce power consumption and are more flexible to adapt to current deep learning algorithms.
  • a yolo convolutional neural network model has twenty-two layers of convolutional layers, five layers of maximum pooling layers, two layers of interconnecting layers, one layer of recombination layers, one layer of classification layers, plus a layer of preprocessing modules to achieve Input real-time arithmetic processing of image data and output detection results.
  • the input picture size is 416*416 after preprocessing
  • the convolution kernel size is 3*3 and 1*1
  • the pooling layer step size is 2*2.
  • the input data is A*A floating point number/fixed point number matrix
  • the sliding step size is s
  • the parallel convolution processing unit PE number is P*P
  • the input data is divided into (A / s) * (A / s) pooled window, each time from the (A / s) * (A / s) pooled window in order to load the corresponding position P *P input data, after s*s cycles can output P*P maximum pooled results, which is the maximum pooling operation process.
  • the logic control states of the operation control module include:
  • conv convolution operation control logic; data control state-idel, data initialization-init, data operation-datamode, batch regularization BN operation-BN, activation function-Active, data output-output logic control state, Complete the convolution operation.
  • pool the maximum pooling control logic; after the data preparation -idel, data initialization -init, maximum comparison -MAX, write temporary value -write, data output -output logic control state, complete the maximum pool operation .
  • connection layer control logic preparation after address data -idel and load-Load
  • the address of the output data of a certain layer before the addr in the DDR memory is used as the input address of the input data of the current layer, and the connection layer operation is completed.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • the terminal herein can be understood as a mobile terminal such as a computer, a tablet, or a mobile phone.
  • the specific steps include:
  • Step 1 Initialize an editable resource of the FPGA, and generate an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and a running control module in the terminal.
  • the step is specifically to provide various logic units for implementing the convolutional neural network calculation by controlling a development module designed by the FPGA on the terminal.
  • the FPGA design here if the CPU itself in the terminal is developed by the FPGA chip, can directly generate the input buffer module and output by re-editing the CPU on the terminal to newly add the function for implementing the step.
  • Step 2 The data to be processed is obtained by the terminal, and the data to be processed is preprocessed and stored in the memory by the memory controller of the FPGA, where the data to be processed is picture data including a face image.
  • the method further includes loading the weight data of each network layer of the convolutional neural network model to be implemented into the memory of the FPGA, and setting a correspondence between the status register of the FPGA and the network layers.
  • the step when the data to be processed is acquired, the step may be directly implemented by a data acquisition unit on the terminal, such as an image acquisition unit such as a camera on the terminal.
  • Step 3 The operation control module reads the status instruction, determines a network layer to be operated, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module.
  • the processing of the data by the network layer to be executed is completed until all the network layers of the convolutional neural network model to be implemented are sequentially executed, and the processing result corresponding to the to-be-processed data is output.
  • Step 4 Recognizing face image information from the picture data according to the processing result.
  • the data to be processed obtained herein may also be video data.
  • the video information that needs to be recognized by the camera is captured by the camera, the video information is input to the memory through preprocessing, and finally, according to the detected network layer to video. The information is processed accordingly.
  • the implementation method of the convolutional neural network implemented on the terminal realizes the prior convolutional neural network face by directly setting a set of hardware capable of realizing the convolutional neural network identification calculation by setting the FPAG in the terminal.
  • the identification method makes the whole process, the convolutional neural network is realized by the FPGA hardware on the terminal, and no longer depends on the calculation of software or cloud network technology, and solves the problem that the existing convolutional neural network technology relies on the terminal. Problems with external software or equipment implementation.
  • Embodiment 5 is a diagrammatic representation of Embodiment 5:
  • the terminal 3 provided in this embodiment includes: a source data input module 31 for inputting data to be processed, a processor 32, and a convolutional neural network.
  • the corresponding data is obtained from the website or the database through the terminal networking as the data to be processed.
  • the convolutional neural network implementation device 33 initializes an editable resource of the FPGA, and generates an input buffer module, an output buffer module, an input control module, an output control module, a neural network processing unit, a data reading module, and the operation in the terminal 3. a control module; loading weight data of each network layer of the convolutional neural network model to be implemented into a memory of the FPGA, and setting a correspondence between a status register of the FPGA and the network layers, where the network layer includes at least five networks Floor;
  • the convolutional neural network implementation device 33 preprocesses the to-be-processed data and stores it in the memory memory through a memory controller of the FPGA, where the to-be-processed data is data including a face image;
  • the operation control module reads the status instruction, determines a network layer to be run, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data reading module to perform the
  • the data processing step under the network layer to be processed, the data to be processed is processed until all network layers of the convolutional neural network model to be implemented are finished, and the processing result corresponding to the to-be-processed data is output to the processing On
  • the processor 32 identifies face image information from the data based on the processing result.
  • the convolutional neural network implementing device 33 further includes: setting at least one status register for each network layer, and configuring a corresponding status instruction;
  • the run control module in the convolutional neural network implementation device 33 reads the status command configured in the status register to determine the network layer to be run.
  • the network layer includes: a convolution calculation hierarchy, a pooling operation hierarchy, a connection operation hierarchy, a reorganization operation hierarchy, and a classification operation hierarchy;
  • the operation control module in the convolutional neural network implementing device sequentially reads the status instruction of each status register in the network layer;
  • the data is processed in turn according to the data processing steps.
  • the convolutional neural network implementing device 33 and the processor 32 in the terminal 3 may be implemented by one module.
  • the CPU is developed by the FPAG chip.
  • the CPU can also develop a function for implementing the convolutional neural network implementation device 33 on the CPU, specifically by designing a program code.
  • the program code can implement all the modules or step functions of the convolutional neural network implementation device 33, so that the terminal can implement the convolutional neural network in this embodiment by simply reading the program code by executing the control CPU.
  • the function of the device 33 is achieved.
  • the embodiment of the present invention further provides a computer readable storage medium storing one or more programs, one or more programs being executed to implement the steps of the method provided by all embodiments of the present invention.
  • the embodiment of the invention provides a method and a device for implementing a convolutional neural network based on an FPGA.
  • the method first initializes an editable resource of an FPGA, and generates an input buffer module, an output buffer module, an input control module, an output control module, and a neural network processing.
  • the unit, the data reading module, and the operation control module then load the weight data of the network layer in the convolutional neural network model to the memory of the FPGA, associate the status register of the FPGA with the network layer, and pass the data to be processed through the FPGA.
  • the memory controller is stored in the memory memory, and finally the control module reads the parameters of the status register, determines the network layer to be run, and controls the input buffer module, the output buffer module, the input control module, the output control module, the neural network processing unit, and the data read.
  • the module completes the processing of the data of the network layer to be executed until all the network layers of the convolutional neural network model are sequentially executed, and outputs the processing result corresponding to the data to be processed; in the whole process, the convolutional neural network is implemented.
  • Implemented by the hardware of the FPGA, and Longer rely on software to solve the problems existing convolution neural network technology-dependent software implementation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供了一种卷积神经网络实现方法及装置、终端、存储介质,通过初始化FPGA的可编辑资源,生成实现模型需要的功能模块,然后加载待实现卷积神经网络模型中各网络层的权值数据和待处理数据通过FPGA的内存存储器,最后读取状态寄存器的参数,确定待运行网络层,并完成待运行网络层对数据的处理,直至所有网络层依次运行结束,输出处理结果;在整个过程中,卷积神经网络都是通过FPGA的硬件来实现,解决了现有卷积神经网络技术依赖软件实现的问题。

Description

卷积神经网络实现方法及装置、终端、存储介质 技术领域
本发明实施例涉及FPGA(Field-Programmable Gate Array,现场可编程门阵列)领域,尤其涉及一种基于FPGA的卷积神经网络实现方法及装置、终端、存储介质。
背景技术
随着人工智能的爆发式增长,深度学习已经成为当前从大量数据分析提取有价值信息的有效手段,而卷积神经网络由于其权值的可重用性而受到关注。目前卷积神经网络大部分通过软件实现,数据量大,对硬件的计算能力要求高,依赖于云端的高计算能力,功耗大。
技术问题
本发明实施例提供一种基于FPGA的卷积神经网络实现方法及装置、终端、存储介质,以解决现有卷积神经网络技术依赖软件实现的问题。
技术解决方案
为解决上述技术问题,本发明实施例采用以下技术方案:
一种基于FPGA的卷积神经网络实现方法,,包括:
初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;
加载待实现卷积神经网络模型各网络层的权值数据至FPGA的内存存储器,设置FPGA的状态寄存器与各网络层的对应关系;
将待处理数据通过FPGA的内存控制器存储至内存存储器;
运行控制模块根据状态寄存器与各网络层的对应关系,确定当前待运行的网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至待实现卷积神经网络模型的所有网络层依次运行结束,输出待处理数据对应的处理结果。
进一步地,网络层依次包括:卷积计算层次、池化操作层次、连线操作层次、重组操作层次和分类操作层次。
进一步地,在当前待运行的处理层为卷积计算层次时,运行控制模块控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的卷积计算层次对应的权值数据和输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的权值数据和输入数据,输入神经网络处理单元;
控制神经网络处理单元使用权值数据对输入数据进行计算,并输出计算结果;
控制输出控制模块将计算结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的计算结果,并将计算结果存入内存存储器。
进一步地,在待运行网络层为池化操作层次时,运行控制模块控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的池化操作层次对应的输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的输入数据划分为多个池化窗口,从池化窗口中按顺序输入神经网络处理单元;
控制神经网络处理单元对输入数据进行最大池化比较,并输出比较结果;
控制输出控制模块将比较结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的比较结果,并将比较结果存入内存存储器。
进一步地,在待运行网络层为连线操作层次时,运行控制模块控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
确定当前网络层的输入数据对应的其他网络层的输出数据,其他网络层包括除当前网络层之外的至少一个网络层;
将其他网络层的输出数据在内存存储器中的存储地址,配置为当前网络层的输入数据的输入地址;
控制数据读取模块根据输入地址从内存存储器中读取与输入地址对应的数据,并存入输入缓存模块。
进一步地,在待运行网络层为重组操作层次时,运行控制模块控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的重组操作层次对应的输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的输入数据,输入神经网络处理单元;
控制神经网络处理单元对输入数据进行重组操作,并输出重组结果;
控制输出控制模块将重组结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的重组结果,并将重组结果存入内存存储器;
建立输入数据在内存存储器中的存储地址、与重组结果在内存存储器中的存储地址之间的映射。
进一步地,在待运行网络层为分类操作层次时,运行控制模块控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的分类操作层次对应的输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的输入数据,作为输入特征向量输入神经网络处理单元;
控制神经网络处理单元对输入数据进行分类计算,并输出检测结果;
控制输出控制模块将检测结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的检测结果,并将检测结果输出。
进一步地,其特征在于,待处理数据为用于终端实现人脸识别的源数据。
一种基于FPGA的卷积神经网络实现装置,包括:
初始化模块,用于初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;运行控制模块用于读取状态寄存器的参数,确定待运行网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至待实现卷积神经网络模型的所有网络层依次运行结束,输出待处理数据对应的处理结果;
加载模块,用于加载待实现卷积神经网络模型中各网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层,将待处理数据通过FPGA的内存控制器存储至内存存储器。
进一步地,神经网络处理单元包括多个处理单元,多个处理单元用于并行处理数据。
进一步地,输入缓存模块包括两个输入存储单元,两个输入存储单元用于采用乒乓双缓存的方式缓存神经网络处理单元的输入数据和/或权值数据;和/或,输出缓存模块包括两个输出存储单元,两个输出存储单元用于采用乒乓双缓存的方式缓存神经网络处理单元的输出数据。
一种终端,包括:用于输入待处理数据的源数据输入模块、处理器和如上的卷积神经网络实现装置;
卷积神经网络实现装置初始化FPGA的可编辑资源,在终端中生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;加载待实现卷积神经网络模型中网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层,网络层包括至少六个网络层;
通过源数据输入模块获取待处理数据,并输入至卷积神经网络实现装置;
卷积神经网络实现装置将待处理数据进行预处理后通过FPGA的内存控制器存储至内存存储器,待处理数据为包含人脸图像的数据;
运行控制模块读取状态指令,确定待运行网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块执行待运行网络层下的数据处理步骤,对待处理数据进行处理,直至待实现卷积神经网络模型的所有网络层运行结束,输出待处理数据对应的处理结果至处理器上;
处理器根据处理结果从数据中识别出人脸图像信息。
一种存储介质,计算机存储介质存储有一个或者多个程序,一个或者多个程序可被一个或者多个处理器执行,以实现如上的基于FPGA的卷积神经网络实现方法的步骤。
有益效果
本发明实施例提供了一种基于FPGA的卷积神经网络实现方法及装置、终端、存储介质,该方法首先初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块,然后加载待实现卷积神经网络模型中网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层,并将待处理数据通过FPGA的内存控制器存储至内存存储器,最后运行控制模块读取状态寄存器的参数,确定待运行网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成待运行网络层对数据的处理,直至待实现卷积神经网络模型的所有网络层顺序运行结束,输出待处理数据对应的处理结果;在整个过程中,卷积神经网络是实现都是通过FPGA的硬件来实现的,而不再依赖于软件,解决了现有卷积神经网络技术依赖软件实现的问题。
同理,本发明实施例中提供的中也是通过FPGA来设计出对应于实现该卷积神经网络计算的硬件来实现人脸识别、定位,因此本发明实施例提供的终端不需要再依赖云端,在终端本地就能进行运行,解决大型的复杂的深度卷积神经网络无法在硬件终端进行运行的问题。
附图说明
图1为本发明实施例一提供的卷积神经网络实现方法的流程图;
图2为本发明实施例二提供的卷积神经网络实现装置的结构示意图;
图3为本发明实施例三提供的卷积神经网络实现装置的结构示意图;
图4为本发明实施例三提供的卷积神经网络实现方法的示意图;
图5为本发明实施例提供的乒乓双缓存的示意图;
图6为本发明实施例提供的神经网络处理单元的结构示意图;
图7为本发明实施例提供的最大池化操作示意图;
图8为本发明实施例提供的逻辑控制示意图;
图9为本发明实施例提供的数据重组示意图;
图10为本发明实施例五提供的终端的结构示意图。
本发明的实施方式
本发明实施例适用于所有设置有FPGA芯片的终端设备,包括PC、手机、PAD、存款机等。下面通过具体实施方式结合附图对本发明实施例作进一步详细说明。
实施例一:
图1为本发明实施例一提供的基于FPGA的卷积神经网络实现方法的流程图,请参考图1,本实施例提供的基于FPGA的卷积神经网络实现方法包括以下步骤:
S101:初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块。
FPGA的可编辑资源可以根据需要构建为任意的功能模式,本发明实施例在设备初始化时,将这些FPGA的可编辑资源构建为实现卷积神经网络模型所必须的功能模式,然后在这个基础上实现卷积神经网络功能,以对数据进行处理。
在本发明实施例中,输入缓存模块和输出缓存模块都采用乒乓双缓存机制缓存数据,神经网络处理单元包括多个PE(Processing Element,处理单元),这些PE并行处理数据;这些都将在实施例三中进行说明。
在本发明实施例中,神经网络处理单元是分时复用的,即其在不同的网络层具备不同的作用。
S102:加载待实现卷积神经网络模型中网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层。
在本发明实施例中,该网络层依次包括:卷积计算层次、池化操作层次、连线操作层次、重组操作层次和分类操作层次。
在本发明实施例中,该卷积神经网络模型一般包括多个网络层,如本发明实施例三涉及的卷积神经网络模型有二十二层卷积层、五层最大池化层、两层连线层、一层重组层、一层分类层、一层预处理层,一共三十二个网络层,实现对输入图片数据的实时运算处理并输出检测结果。
为了对网络层进行标识,本发明实施例还包括了为每个网络层设置至少一个状态寄存器,并配置对应的状态指令,也即是说通过设置状态指令的标识方式将状态寄存器与网络层进行关联标识,关联方式可以是设置多个状态寄存器对应一个网络层的状态指令,每个状态寄存器对应一个网络层的状态指令,还可以是仅设置一个状态寄存器,而该状态寄存器配置有多个状态指令,分别是对应与不同的网络层,根据运行过程实时更新状态寄存器的当前状态指令。
S103:将待处理数据通过FPGA的内存控制器存储至内存存储器。
待处理数据是指外部需要进行卷积神经网络处理的数据,例如图像数据、用于终端实现人脸识别的源数据等。
因为不同的卷积神经网络模型能够处理的数据存在限制,那么在将待处理数据通过FPGA的内存控制器存储至内存存储器之前,就需要判断待处理数据是否满足待实现卷积神经网络模型的计算要求,在不满足时,对待处理数据进行归一化处理和/或双线性插值处理,直至满足计算要求,将处理之后的待处理数据,存储至内存存储器。
S104:运行控制模块读取状态寄存器的参数,确定待运行网络层,并根据待运行网络层对应的数据处理步骤对数据进行处理,输出处理结果。
在该步骤中具体是通过控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至待实现卷积神经网络模型的所有网络层顺序运行结束,输出待处理数据对应的处理结果。
在该步骤中,确定当前正在执行的网络层具体是通过所述运行控制模块读取所述状态寄存器中配置的状态指令,确定待运行网络层。
进一步的,所述运行控制模块读取所述状态寄存器的参数,确定待运行网络层,并控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块执行所述待运行网络层下的数据处理步骤,对所述待处理数据进行处理包括:
所述运行控制模块顺序读取所述网络层中各状态寄存器的状态指令;
根据所述状态指令确定对应的数据处理步骤;
根据所述数据处理步骤依次对数据进行处理。
本步骤主要是运行控制模块根据状态寄存器的参数,控制各功能模块对输入数据进行处理,完成卷积神经网络处理,针对不同种类的网络层其具体实现也不同。
在待运行网络层为卷积计算层次时,步骤S104包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的卷积计算层次对应的权值数据和输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的权值数据和输入数据,输入神经网络处理单元;
控制神经网络处理单元使用权值数据对输入数据进行计算,并输出计算结果;
控制输出控制模块将计算结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的计算结果,并将计算结果存入内存存储器。
在待运行网络层为池化操作层次时,步骤S104包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的池化操作层次对应的输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的输入数据划分为多个池化窗口,从池化窗口中按顺序输入神经网络处理单元;
控制神经网络处理单元对输入数据进行最大池化比较,并输出比较结果;
控制输出控制模块将比较结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的比较结果,并将比较结果存入内存存储器。
在待运行网络层为连线操作层次时,步骤S104包括:
确定当前网络层的输入数据对应的其他网络层的输出数据,所述其他网络层包括除所述当前网络层之外的至少一个网络层;
将所述其他网络层的输出数据在所述内存存储器中的存储地址,配置为当前网络层的输入数据的输入地址;
控制所述数据读取模块根据所述输入地址从所述内存存储器中读取与所述输入地址对应的数据,并存入所述输入缓存模块。
在待运行网络层为重组操作层次时,步骤S104包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的重组操作层次对应的输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的输入数据,输入神经网络处理单元;
控制神经网络处理单元对输入数据进行重组操作,并输出重组结果;
控制输出控制模块将重组结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的重组结果,并将重组结果存入内存存储器;
建立输入数据在内存存储器中的存储地址、与重组结果在内存存储器中的存储地址之间的映射。
在待运行网络层为分类操作层次时,步骤S104包括:
控制数据读取模块通过内存控制器读取内存存储器中存储的分类操作层次对应的输入数据,并存入输入缓存模块;
控制输入控制模块将输入缓存模块存储的输入数据,作为输入特征向量输入神经网络处理单元;
控制神经网络处理单元对输入数据进行分类计算,并输出检测结果;
控制输出控制模块将检测结果存入输出缓存模块;
控制内存控制器读取输出缓存模块内的检测结果,并将检测结果输出。
本实施例提供了一种基于FPGA的卷积神经网络实现方法,该方法首先初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块,然后加载待实现卷积神经网络模型中各网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层,并将待处理数据通过FPGA的内存控制器存储至内存存储器,最后运行控制模块读取状态寄存器的参数,确定待运行网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成待运行网络层对数据的处理,直至待实现卷积神经网络模型的所有网络层顺序运行结束,输出待处理数据对应的处理结果;在整个过程中,卷积神经网络是实现都是通过FPGA的硬件来实现的,而不再依赖于软件,解决了现有卷积神经网络技术依赖软件实现的问题。
实施例二:
图2为本发明实施例实施例二提供的卷积神经网络实现装置的结构示意图,请参考图2,本实施例提供的卷积神经网络实现装置2,包括:
初始化模块21,用于初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;运行控制模块用于读取状态寄存器的参数,确定待运行网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块执行该待运行网络层下的数据处理步骤,对所述待处理数据进行处理,直至待实现卷积神经网络模型的所有网络层顺序运行结束,输出待处理数据对应的处理结果;
加载模块22,用于加载待实现卷积神经网络模型中各网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层,将待处理数据通过FPGA的内存控制器存储至内存存储器。
实施例三:
本实施例以输入数据为图片为例为例进行说明。
本实施例将深度学***台是Snopsys(美国新思科技公司)的HAPS-DX7(Snopsys一种产品的型号) FPGA开发板;具体的,首先将训练好的卷积神经网络模型权值参数加载到FPGA开发板的DDR(Double Data Rate,双倍速率同步动态随机存储器,即上文中的内存存储器)中,然后在预处理模块对输入数据进行预处理,将预处理后的数据传输到FPGA开发板的DDR中,然后通过DMA(Direct Memory Access,直接内存存取)单元不断提取DDR中当前层网络的权值参数和输入数据,传送给NPU(network process units,神经网络处理单元) 进行并行运算,运算完毕的输出数据为下一层的输入,通过输出缓存模块存回DDR中。最后完成所有卷积运算的数据特征向量会传送到分类模块完成特征分类计算。
具体的,如图3所示,本实施例提供的装置包括输入端A、输出端B、预处理单元301、DDR控制器302(即上文中的内存控制器)、DDR存储器303(即上文中的内存存储器)、用于读写权值数据的DMA单元304、用于缓存权值数据的缓存单元305、用于缓存输入数据的缓存单元306、用于读写输入数据的DMA单元307、用于输入控制的输入控制模块308、NPU单元309、用于输出控制的输出控制模块310、用于缓存输出数据的输出缓存模块311、运行控制模块312以及分类计算单元313;其中,缓存单元305和缓存单元306组成上文中的输入缓存模块,运行控制模块312包括指令单元3121、译码器3122以及控制逻辑单元3123,指令单元3121用于接收指令数据,译码器3122用于对指令数据进行译码,控制逻辑单元3123用于根据译码结果,输出对应的控制指令。
DDR控制器302为控制DDR存储器与外部其他模块的连接与数据传输功能,包括输入数据的存储控制,DMA读取DDR数据的读取控制,硬件运算完毕的输出数据的存储控制,以及最终输出的特征向量数据的读取控制。
输入缓存模块与输出缓存模块这些缓存模块都采用乒乓双缓存方式。如图5所示,缓存模块包括第一缓存单元51、第二缓存单元52及选择控制单元53,选择控制单元53用于选择将待缓存inputs(进入数据)缓存到哪一个缓存单元,并控制哪个缓存单元输出缓存outputs(输入数据),并输出标识flag(旗)信号,flag为当前输入输出缓存的状态,表示当前两个缓存区的数据状态。具体的,输入缓存包括权值数据缓存和输入数据缓存,其中权值数据缓存采用乒乓双缓存的方式缓存当前层的权值数据、偏差值以及规则化参数,当其一个缓存区的权值数据参与运算时,DMA单元304向其另一个缓存区加载数据,这样可降低加载数据的等待时间;对应的,输入数据缓存同样采用乒乓双缓存的方式缓存当前层的输入数据,当其一个缓存区参与运算时,DMA单元307向其另一个缓存区加载数据;输出数据缓存采用乒乓双缓存的方式,把NPU计算完毕的feature map数据缓存在其一个缓存区的时候,其另一个已经缓存好数据的缓存区向DDR存储器写入数据。
如图6所示,NPU单元309包括P*P个并行的处理单元PE(PE0至PEn),用于并行计算卷积过程的乘/加/减操作。一个通道与卷积核计算的中间结果会存在暂存寄存器上,下一个通道与卷积核计算后的结果与中间结果相加后再次存在暂存寄存器中,如此反复计算,直到所有通道与卷积核的计算完毕,再对得到的数据进行BN(Batch Normalization,批规则化)操作。
批规则化BN操作涉及的表达式为:
y i=γ*((x i-μ)/( √(σ 2+ε)))+β;其中,μ=(1/m)*Σ 1 mx i,σ 2=(1/m)*Σ 1 m (x i-μ) 2,γ为权值,β为修正值,ε是一个为了保证数值稳定性的常量,γ和β及ε这三个参数由云端训练得到。
在进行批规则化BN操作之后,对数据进行激活,leaky(激活函数)表达式为:y=(x>0)?x:0.1*x;即当x大于0时,y=x,当x小于0时,y=0.1x,x为NPU单元的输入,y为NPU单元的输出。
最后计算结果再通过输出缓存模块把得到的新的feature map(特征映射)数据存入DDR存储器中。
输入控制模块308和输出控制模块310用于连接两个模块之间数据的传输走向;具体的,输入控制模块308用于把数据缓存区的数据按照NPU单元输入数据的接口重新排列数据,正确将数据传送到相应的输入接口,输出控制模块310则是用于将NPU的输出数据按照输出缓存区的输入接口重新排列数据,正确将数据传送到相应的输入接口。
运行控制模块312控制整个***的逻辑状态,通过读取当前状态寄存器,判断当前状态处于计算深度卷积神经网络的哪个层次,从而执行相应的状态下的逻辑控制指令,控制数据的运算。
如图4所示,本实施例提供的方法包括以下步骤:
S401:获取模型的权值数据,并加载到DDR存储器中。
从云端获取练好的深度卷积神经网络的权值数据,将权值数据通过USB加载到FPGA开发板DDR存储器中。
具体的,在云端利用GPU(Graphics Processing Unit,图形处理器)加速训练yolo(一种深度学习算法)卷积神经网络,得到训练好的人脸检测模型权值数据,把权值数据通过USB(Universal Serial Bus,通用串行总线)加载到FPGA开发板DDR存储器中。
S402:对输入数据进行预处理,并存到DDR存储器中。
本步骤包括:对输入数据进行归一化处理,使其满足计算要求;对输入数据进行双线性插值处理,使其图片尺寸满足计算要求;预处理完毕的输入数据存到DDR存储器中。
具体的,对输入的图片数据进行归一化预处理,将灰度值除以255归一化到0-1之间,采用双线性插值方法将输入图片数据尺寸重排为416*416,满足yolo卷积神经网络的输入图片尺寸要求,然后将输入图片数据存进DDR存储器中。
S403:读取当前状态寄存器,并确定对应的网络层。
读取当前状态寄存器,判断当前状态处于计算深度卷积神经网络的哪个子层次,从而执行相应的状态下的逻辑控制指令,控制数据的运算。定义n个状态寄存器R0,R1,…,R(n-1),Rn,每个寄存器存放相应指代当前层的状态数据,表示整个深度卷积神经网络需要运算R0到Rn共n层网络,控制逻辑按照顺序读取寄存器,执行相应的层次的逻辑控制功能,控制整个硬件数据的流向,完成深度卷积神经网络的计算。
S404:调用对应的网络层处理数据。
在当前状态寄存器对应卷积计算时,执行卷积计算操作,此时本步骤包括:加载卷积层权值参数和输入数据进并行卷积处理单元PE,设权值参数为k*k的浮点数(32bit)/定点数(16bit)矩阵,输入数据为a*a的浮点数(32bit)/定点数(16bit)矩阵,滑动步长为1,并行卷积处理单元PE个数为P*P,那么能同时计算P*P个输入数据与权值的卷积和;卷积层计算包括权值与输入数据的乘累加操作、批规则化BN计算操作、加偏置、激活函数leaky激活,计算完一个卷积核与多个输入通道的输入数据之后得到一个feature map,存入内存再进行下一个卷积核与输入数据的计算,直至完成深度卷积神经网络一层的计算。
在当前状态寄存器为池化操作时,执行最大池化操作,此时本步骤包括:设输入数据为A*A的浮点数/定点数矩阵,滑动步长为s,并行卷积处理单元个数为P*P,则采用最大池化操作,将输入数据划分为(A/s)*(A/s)个池化窗口,每次从(A/s)*(A/s)个池化窗口中按顺序加载对应位置的P *P个输入数据,经过s*s个周期能输出P *P个最大池化后的结果。
在当前状态寄存器为连线操作时,执行连线层操作,连线层操作为将前面计算完毕的某一层或某两层的输出数据作为当前层的输入数据,因此,需要重新加载之前某层的输出数据在DDR存储器中的地址作为当前层输入数据的输入地址,则可完成连线层操作。
在当前状态寄存器为重组层操作时,执行重组层操作,重组层操作为将当前层进行拆分重组,设原来输入数据为2h*2w*2c的数据,步长为2,经过重组层的操作,输出数据为h*w*8c的feature map,需要增加地址映射单元把原来的地址经过地址映射成为新的地址存储数据,作为下一层的输入数据。
S405:判断网络层计算是否结束,若结束则执行步骤S406,否则返回步骤S403。
在当前状态寄存器为分类层计算时,判定网络层计算结束,执行步骤S406;在当前状态寄存器不为分类层计算时,判定网络层计算没有结束,执行步骤S403。
S406:执行分类层计算操作,并输出结果。
分类层计算为把前面经过各个卷积层、池化层、连线层、重组层等操作之后的计算结果作为本层的输入特征向量,通过分类计算获得检测结果并输出。
本实施例利用硬件FPGA实现复杂的深度卷积神经网络,从而把高度依赖云端强大计算能力的深度卷积神经网络模型放到本地终端进行运行,不需要依赖网络进行数据的实时处理,解决大型的复杂的深度卷积神经网络无法在硬件终端进行运行的问题。同时,本发明实施例能够处理结构更复杂、网络层数更多的深度卷积神经网络,而且能够适应当前深度学习算法,能够处理卷积层、池化层、连线层、重组层。相比于以往的方法,本发明实施例的卷积层能够处理批规则化BN操作和激活函数leaky函数,而且增加了连线层、重组层的实现,具有前沿性。
进一步的,本实施例适用于处理浮点数(32bit)或者定点数(16bit)的输入图形数据和权值数据,通过把内部的乘法、加法、减法单元换为浮点数或者定点数运算单元,则能处理不同的数据类型的深度学习算法模型,具有很高的灵活性,而且通过把浮点数数据类型的深度学习算法换为硬件定点数数据类型实现,权值和中间结果数据量减少,而计算精度变化不大。
进一步的,本实施例通过对输入的图片数据或者视频帧数据用预处理模块进行直接处理,然后把处理好的数据输入卷积神经网络进行各层的运算,各个网络层运算完毕后的数据作为分类层的特征向量输入,进行最后的分类检测计算,最后把数据输出,完成对于图片或视频帧中人脸的实时检测,整个过程均在本地FPGA硬件实现,无需联网,而且相比于传统的CPU、GPU方案大大降低了功耗,而且能够更灵活地配置以适应当前的深度学习算法。
例如,某yolo卷积神经网络模型有二十二层卷积层,五层最大池化层,两层连线层,一层重组层,一层分类层,加上一层预处理模块,实现输入图片数据的实时运算处理并输出检测结果。输入图片大小经过预处理后变为416*416,卷积核大小有3*3和1*1,池化层步长大小为2*2。使用yolo卷积神经网络在云端利用GPU加速训练人脸检测模型,输入图片数据。
针对最大池化操作,如图7所示,设输入数据为A*A的浮点数/定点数矩阵,滑动步长为s,并行卷积处理单元PE个数为P*P,则采用最大池化操作,将输入数据划分为(A/s)*(A/s)个池化窗口,每次从(A/s)*(A/s)个池化窗口中按顺序加载对应位置的P *P个输入数据,经过s*s个周期能输出P*P个最大池化后的结果,为最大池化操作过程。
如图8所示,运行控制模块的逻辑控制状态包括:
(1)read_reg,状态寄存器逻辑;整个深度卷积神经网络需要运算R0,R1,…,R(n-1),Rn共n层网络,所有网络层的状态数据都存在R0到Rn的状态寄存器中,通过顺序读取当前状态寄存器的值能够运行整个深度卷积神经网络。
(2)conv,卷积运算控制逻辑;经过数据的准备-idel、数据初始化-init、数据运算-datamode、批规则化BN操作-BN、激活函数-Active、数据输出-output的逻辑控制状态,完成卷积运算操作。
(3)pool,最大池化控制逻辑;经过数据准备-idel、数据初始化-init、最大值比较-MAX、写暂存值-write、数据输出-output的逻辑控制状态,完成最大池化运算操作。
(4)route,连线层控制逻辑;经过地址数据的准备-idel和加载-Load addr之前某层的输出数据在DDR存储器中的地址作为当前层输入数据的输入地址,完成连线层操作。
(5)reorg,重组层控制逻辑;经过地址数据的准备-idel、地址数据的计算-Count addr、映射-Reorganize后将数据重新排列,如图9所示,将2h*2w*2c的输入数据映射为h*w*8c的新数据作为下一层的输入数据。
实施例四:
下面结合具体的应用场景对该卷积神经网络实现方法做进行一步的详细说明。具体的将本发明实施例提供的卷积神经网络实现方法,应用于终端四线人脸检测来操作,这里的终端可以理解为是电脑、平板、手机等类型移动终端,具体步骤包括:
步骤一:初始化FPGA的可编辑资源,在所述终端中生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块。
在实际应用中,该步骤具体是通过在终端上设置一个由FPGA设计的开发模块,通过控制该开发模块生产用于实现该卷积神经网络计算的各种逻辑单元。
当然,这里的FPGA设计,若终端中的CPU本身是由FPGA芯片开发得到的情况下,可以直接通过重新编辑终端上的CPU新增加用于实现该步骤的功能即可实现生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块。
步骤二:通过所述终端获取待处理数据,并将所述待处理数据进行预处理后通过FPGA的内存控制器存储至所述内存存储器,所述待处理数据为包含人脸图像的图片数据。
在该步骤中,还包括加载待实现卷积神经网络模型各网络层的权值数据至所述FPGA的内存存储器,设置FPGA的状态寄存器与所述各网络层的对应关系。
在实际应用中,该步骤在获取待处理数据时,可以直接通过终端上的数据采集单元来实现,例如终端上的摄像头等图像采集单元。
步骤三:所述运行控制模块读取所述状态指令,确定待运行网络层,并控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至所述待实现卷积神经网络模型的所有网络层依次运行结束,输出所述待处理数据对应的处理结果。
步骤四:根据所述处理结果从所述图片数据中识别出人脸图像信息。
在本发明实施例中,这里获取的待处理数据还可以是视频数据,通过摄像头拍摄到需要进行人脸识别的视频信息后,通过预处理输入到内存存储器,最后根据检测到的网络层对视频信息进行对应的处理。
本发明实施例提供的基于终端上实现的卷积神经网络实现方法,通过在终端设置FPAG的方式直接设置一组能实现卷积神经网络识别计算的硬件来实现先有的卷积神经网络人脸识别方法,使得整个过程,卷积神经网络是实现都是通过终端上的FPGA硬件来实现的,而不再依赖于软件或者云端网络技术的计算,解决了现有卷积神经网络技术依赖终端之外的外界软件或设备实现的问题。
实施例五:
图10为本发明实施例提供的终端的结构示意图,请参考图11,本实施例提供的终端3,包括:用于输入待处理数据的源数据输入模块31、处理器32和卷积神经网络实现装置33,这里的卷积神经网络实现装置33采用的是上述实施例二提供的装置;其中,源数据输入模块31具体可以是终端上的摄像模块、互联网模块等,当是互联网模块时,通过终端联网从网站或者数据库中获取到对应的数据作为待处理数据。
所述卷积神经网络实现装置33初始化FPGA的可编辑资源,在所述终端3中生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;加载待实现卷积神经网络模型各网络层的权值数据至所述FPGA的内存存储器,设置FPGA的状态寄存器与所述各网络层的对应关系,所述网络层包括至少五个网络层;
通过所述源数据输入模块31获取待处理数据,并输入至所述卷积神经网络实现装置33;
所述卷积神经网络实现装置33将所述待处理数据进行预处理后通过FPGA的内存控制器存储至所述内存存储器,所述待处理数据为包含人脸图像的数据;
所述运行控制模块读取所述状态指令,确定待运行网络层,并控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块执行所述待运行网络层下的数据处理步骤,对所述待处理数据进行处理,直至所述待实现卷积神经网络模型的所有网络层运行结束,输出所述待处理数据对应的处理结果至所述处理器上;
所述处理器32根据所述处理结果从所述数据中识别出人脸图像信息。
在本发明实施例中,所述卷积神经网络实现装置33还包括为每个网络层设置至少一个状态寄存器,并配置对应的状态指令;
所述卷积神经网络实现装置33中的运行控制模块读取所述状态寄存器中配置的状态指令,确定待运行网络层。
在本发明的实施例中,所述网络层依次包括:卷积计算层次、池化操作层次、连线操作层次、重组操作层次和分类操作层次;
所述卷积神经网络实现装置中的运行控制模块顺序读取所述网络层中各状态寄存器的状态指令;
根据所述状态指令确定对应的数据处理步骤;
根据所述数据处理步骤依次对数据进行处理。
在实际应用中,当所述终端3中的处理器32是由FPAG芯片开发得到的,则所述终端3中的所述卷积神经网络实现装置33和处理器32可以是通过一个模块来实现,具体是通过由FPAG芯片开发得到CPU除了控制终端现有的实现功能之外,还在CPU上开发出一个能实现所述卷积神经网络实现装置33的功能即可,具体通过设计一个程序代码来实现,该程序代码可以实现卷积神经网络实现装置33的所有模块或步骤功能的,这样终端只要通过控制CPU读取该程序代码进行执行即可实现本实施例中的所述卷积神经网络实现装置33的作用。
本发明实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有一个或者多个程序,一个或者多个程序被执行,以实现本发明所有实施例所提供的方法的步骤。
通过以上实施例的实施可知,本发明实施例具备以下有益效果:
本发明实施例提供了一种基于FPGA的卷积神经网络实现方法及装置,该方法首先初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块,然后加载待实现卷积神经网络模型中网络层的权值数据至FPGA的内存存储器,关联FPGA的状态寄存器与网络层,并将待处理数据通过FPGA的内存控制器存储至内存存储器,最后运行控制模块读取状态寄存器的参数,确定待运行网络层,并控制输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成待运行网络层对数据的处理,直至待实现卷积神经网络模型的所有网络层顺序运行结束,输出待处理数据对应的处理结果;在整个过程中,卷积神经网络是实现都是通过FPGA的硬件来实现的,而不再依赖于软件,解决了现有卷积神经网络技术依赖软件实现的问题。
以上内容是结合具体的实施方式对本发明实施例所作的进一步详细说明,不能认定本发明实施例的具体实施只局限于这些说明。对于本发明实施例所属技术领域的普通技术人员来说,在不脱离本发明实施例构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明实施例的保护范围。
 

Claims (13)

  1. 一种基于FPGA的卷积神经网络实现方法,其特征在于,包括:
    初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;
    加载待实现卷积神经网络模型各网络层的权值数据至所述FPGA的内存存储器,设置FPGA的状态寄存器与所述各网络层的对应关系;
    将待处理数据通过FPGA的内存控制器存储至所述内存存储器;
    所述运行控制模块根据所述状态寄存器与所述各网络层的对应关系,确定当前待运行的处理层次,并控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至所述待实现卷积神经网络模型的所有网络层依次运行结束,输出所述待处理数据对应的处理结果。
     
  2. 如权利要求1所述的卷积神经网络实现方法,其特征在于,所述网络层依次包括:卷积计算层次、池化操作层次、连线操作层次、重组操作层次和分类操作层次。
  3. 如权利要求1或2所述的卷积神经网络实现方法,其特征在于,在所述当前待运行的处理层为卷积计算层次时,所述运行控制模块控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
    控制所述数据读取模块通过所述内存控制器读取所述内存存储器中存储的所述卷积计算层次对应的权值数据和输入数据,并存入所述输入缓存模块;
    控制所述输入控制模块将所述输入缓存模块存储的权值数据和输入数据,输入所述神经网络处理单元;
    控制所述神经网络处理单元使用所述权值数据对所述输入数据进行计算,并输出计算结果;
    控制所述输出控制模块将所述计算结果存入所述输出缓存模块;
    控制所述所述内存控制器读取所述输出缓存模块内的计算结果,并将所述计算结果存入所述内存存储器。
     
  4. 如权利要求2-3任一项所述的卷积神经网络实现方法,其特征在于,在待运行网络层为池化操作层次时,所述运行控制模块控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
    控制所述数据读取模块通过所述内存控制器读取所述内存存储器中存储的所述池化操作层次对应的输入数据,并存入所述输入缓存模块;
    控制所述输入控制模块将所述输入缓存模块存储的输入数据划分为多个池化窗口,从池化窗口中按顺序输入所述神经网络处理单元;
    控制所述神经网络处理单元对输入数据进行最大池化比较,并输出比较结果;
    控制所述输出控制模块将所述比较结果存入所述输出缓存模块;
    控制所述所述内存控制器读取所述输出缓存模块内的比较结果,并将所述比较结果存入所述内存存储器。
     
  5. 如权利要求2-4任一项所述的卷积神经网络实现方法,其特征在于,在待运行网络层为连线操作层次时,所述运行控制模块控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
    确定当前网络层的输入数据对应的其他网络层的输出数据,所述其他网络层包括除所述当前网络层之外的至少一个网络层;
    将所述其他网络层的输出数据在所述内存存储器中的存储地址,配置为当前网络层的输入数据的输入地址;
    控制所述数据读取模块根据所述输入地址从所述内存存储器中读取与所述输入地址对应的数据,并存入所述输入缓存模块。
     
  6. 如权利要求2-5任一项所述的卷积神经网络实现方法,其特征在于,在待运行网络层为重组操作层次时,所述运行控制模块控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
    控制所述数据读取模块通过所述内存控制器读取所述内存存储器中存储的所述重组操作层次对应的输入数据,并存入所述输入缓存模块;
    控制所述输入控制模块将所述输入缓存模块存储的输入数据,输入所述神经网络处理单元;
    控制所述神经网络处理单元对所述输入数据进行重组操作,并输出重组结果;
    控制所述输出控制模块将所述重组结果存入所述输出缓存模块;
    控制所述所述内存控制器读取所述输出缓存模块内的重组结果,并将所述重组结果存入所述内存存储器;
    建立所述输入数据在所述内存存储器中的存储地址、与所述重组结果在所述内存存储器中的存储地址之间的映射。
  7. 如权利要求2-6任一项所述的卷积神经网络实现方法,其特征在于,在待运行网络层为分类操作层次时,所述运行控制模块控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理包括:
    控制所述数据读取模块通过所述内存控制器读取所述内存存储器中存储的所述分类操作层次对应的输入数据,并存入所述输入缓存模块;
    控制所述输入控制模块将所述输入缓存模块存储的输入数据,作为输入特征向量输入所述神经网络处理单元;
    控制所述神经网络处理单元对所述输入数据进行分类计算,并输出检测结果;
    控制所述输出控制模块将所述检测结果存入所述输出缓存模块;
    控制所述所述内存控制器读取所述输出缓存模块内的检测结果,并将所述检测结果输出。
     
  8. 如权利要求1-7任一项所述的卷积神经网络实现方法,其特征在于,其特征在于,所述待处理数据为用于终端实现人脸识别的源数据。
  9. 一种基于FPGA的卷积神经网络实现装置,其特征在于,包括:
    初始化模块,用于初始化FPGA的可编辑资源,生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;所述运行控制模块用于读取所述状态寄存器的参数,确定待运行网络层,并控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至所述待实现卷积神经网络模型的所有网络层依次运行结束,输出所述待处理数据对应的处理结果;
    加载模块,用于加载待实现卷积神经网络模型各网络层的权值数据至所述FPGA的内存存储器,设置FPGA的状态寄存器与所述各网络层的对应关系,将待处理数据通过FPGA的内存控制器存储至所述内存存储器。
     
  10. 如权利要求9所述的卷积神经网络实现装置,其特征在于,所述神经网络处理单元包括多个处理单元,所述多个处理单元用于并行处理数据。
  11. 如权利要求9或10所述的卷积神经网络实现装置,其特征在于,所述输入缓存模块包括两个输入存储单元,所述两个输入存储单元用于采用乒乓双缓存的方式缓存所述神经网络处理单元的输入数据和/或权值数据;和/或,所述输出缓存模块包括两个输出存储单元,所述两个输出存储单元用于采用乒乓双缓存的方式缓存所述神经网络处理单元的输出数据。
  12. 一种终端,其特征在于,包括:用于输入待处理数据的源数据输入模块、处理器和如权利要求9-11任一项所述的卷积神经网络实现装置;
    所述卷积神经网络实现装置初始化FPGA的可编辑资源,在所述终端中生成输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块以及运行控制模块;加载待实现卷积神经网络模型各网络层的权值数据至所述FPGA的内存存储器,设置FPGA的状态寄存器与所述各网络层的对应关系;
    通过所述源数据输入模块获取待处理数据,并输入至所述卷积神经网络实现装置;
    所述卷积神经网络实现装置将所述待处理数据进行预处理后通过FPGA的内存控制器存储至所述内存存储器,所述待处理数据为包含人脸图像的数据;
    所述运行控制模块根据所述状态寄存器与所述各网络层的对应关系,确定当前待运行的处理层次,并控制所述输入缓存模块、输出缓存模块、输入控制模块、输出控制模块、神经网络处理单元、数据读取模块完成当前待运行的网络层对数据的处理,直至所述待实现卷积神经网络模型的所有网络层依次运行结束,输出所述待处理数据对应的处理结果至所述处理器上;
    所述处理器根据所述处理结果从所述数据中识别出人脸图像信息。
     
  13. 一种存储介质,其特征在于,所述计算机存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1至8中任一项所述的基于FPGA的卷积神经网络实现方法的步骤。
     
PCT/CN2018/074999 2017-12-29 2018-02-01 卷积神经网络实现方法及装置、终端、存储介质 WO2019127838A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711485144 2017-12-29
CN201711485144.0 2017-12-29

Publications (1)

Publication Number Publication Date
WO2019127838A1 true WO2019127838A1 (zh) 2019-07-04

Family

ID=63126240

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074999 WO2019127838A1 (zh) 2017-12-29 2018-02-01 卷积神经网络实现方法及装置、终端、存储介质

Country Status (2)

Country Link
CN (1) CN108416422B (zh)
WO (1) WO2019127838A1 (zh)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111416743A (zh) * 2020-03-19 2020-07-14 华中科技大学 一种卷积网络加速器、配置方法及计算机可读存储介质
CN111445420A (zh) * 2020-04-09 2020-07-24 北京爱芯科技有限公司 卷积神经网络的图像运算方法、装置和电子设备
CN111783971A (zh) * 2020-07-02 2020-10-16 上海赛昉科技有限公司 一种用于深度神经网络的可高度灵活配置的数据后处理器
CN111931925A (zh) * 2020-08-10 2020-11-13 西安电子科技大学 基于fpga的二值化神经网络的加速***
CN112070217A (zh) * 2020-10-15 2020-12-11 天津大学 一种卷积神经网络加速器的内部存储带宽优化方法
CN112270252A (zh) * 2020-10-26 2021-01-26 西安工程大学 一种改进YOLOv2模型的多车辆目标识别方法
CN112434635A (zh) * 2020-12-02 2021-03-02 深圳龙岗智能视听研究院 卷积神经网络特征提取方法、***、嵌入式设备及介质
CN112541583A (zh) * 2020-12-16 2021-03-23 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) 一种神经网络加速器
CN112749778A (zh) * 2019-10-29 2021-05-04 北京灵汐科技有限公司 一种强同步下的神经网络映射方法及装置
CN112784952A (zh) * 2019-11-04 2021-05-11 珠海格力电器股份有限公司 一种卷积神经网络运算***、方法及设备
CN112819022A (zh) * 2019-11-18 2021-05-18 同方威视技术股份有限公司 基于神经网络的图像识别装置和图像识别方法
CN113111995A (zh) * 2020-01-09 2021-07-13 北京君正集成电路股份有限公司 一种缩短模型推理和模型后处理运行时间的方法
CN113379047A (zh) * 2021-05-25 2021-09-10 北京微芯智通科技合伙企业(有限合伙) 一种实现卷积神经网络处理的***及方法
CN113673664A (zh) * 2020-05-14 2021-11-19 杭州海康威视数字技术股份有限公司 数据溢出检测方法、装置、设备及存储介质
CN114764613A (zh) * 2022-03-31 2022-07-19 广东浪潮智慧计算技术有限公司 一种神经网络运算控制方法、***及计算机可读存储介质
CN110378470B (zh) * 2019-07-19 2023-08-18 Oppo广东移动通信有限公司 神经网络模型的优化方法、装置以及计算机存储介质

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874605B (zh) 2018-08-31 2024-05-03 嘉楠明芯(北京)科技有限公司 图像识别处理方法和装置
CN109272113B (zh) * 2018-09-13 2022-04-19 深思考人工智能机器人科技(北京)有限公司 一种基于通道的卷积神经网络建立装置及方法
CN109214506B (zh) * 2018-09-13 2022-04-15 深思考人工智能机器人科技(北京)有限公司 一种基于像素的卷积神经网络建立装置及方法
CN110929855B (zh) * 2018-09-20 2023-12-12 合肥君正科技有限公司 一种数据交互方法和装置
CN109272109B (zh) * 2018-10-30 2020-07-17 北京地平线机器人技术研发有限公司 神经网络模型的指令调度方法及装置
CN109446996B (zh) * 2018-10-31 2021-01-22 智慧眼科技股份有限公司 基于fpga的人脸识别数据处理装置及处理方法
CN109542513B (zh) * 2018-11-21 2023-04-21 山东浪潮科学研究院有限公司 一种卷积神经网络指令数据存储***及方法
CN109740732B (zh) * 2018-12-27 2021-05-11 深圳云天励飞技术有限公司 神经网络处理器、卷积神经网络数据复用方法及相关设备
CN109948789A (zh) * 2019-03-21 2019-06-28 百度在线网络技术(北京)有限公司 用于卷积神经网络的数据加载方法和装置
CN110032374B (zh) * 2019-03-21 2023-04-07 深兰科技(上海)有限公司 一种参数提取方法、装置、设备及介质
CN109919312B (zh) * 2019-03-29 2021-04-23 北京智芯微电子科技有限公司 卷积神经网络的运算方法、装置及dpu
CN110058943B (zh) * 2019-04-12 2021-09-21 三星(中国)半导体有限公司 用于电子设备的内存优化方法和设备
CN110097174B (zh) * 2019-04-22 2021-04-20 西安交通大学 基于fpga和行输出优先的卷积神经网络实现方法、***及装置
CN110110850A (zh) * 2019-04-29 2019-08-09 山东浪潮人工智能研究院有限公司 基于fpga前向反向可复用的处理单元实现方法
CN110636221A (zh) * 2019-09-23 2019-12-31 天津天地人和企业管理咨询有限公司 一种基于FPGA的sensor超帧率的***及方法
CN110738317A (zh) * 2019-10-17 2020-01-31 中国科学院上海高等研究院 基于fpga的可变形卷积网络运算方法、装置和***
CN110826507B (zh) * 2019-11-11 2022-08-23 北京百度网讯科技有限公司 人脸检测方法、装置、设备及存储介质
CN111126309A (zh) * 2019-12-26 2020-05-08 长沙海格北斗信息技术有限公司 基于fpga的卷积神经网络架构方法及其人脸识别方法
CN111260050B (zh) * 2020-01-19 2023-03-07 中国电子科技集团公司信息科学研究院 一种用于控制卷积神经网络进行数据处理的方法和装置
CN111427838B (zh) * 2020-03-30 2022-06-21 电子科技大学 基于zynq动态更新卷积神经网络的分类***及方法
CN115380292A (zh) * 2020-04-03 2022-11-22 北京希姆计算科技有限公司 一种数据存储管理装置及处理核
CN112766478B (zh) * 2021-01-21 2024-04-12 中国电子科技集团公司信息科学研究院 一种面向卷积神经网络的fpga流水线结构
CN113222107A (zh) * 2021-03-09 2021-08-06 北京大学 数据处理方法、装置、设备及存储介质
CN112990157B (zh) * 2021-05-13 2021-08-20 南京广捷智能科技有限公司 一种基于fpga的图像目标识别加速***
CN117112452B (zh) * 2023-08-24 2024-04-02 上海合芯数字科技有限公司 寄存器模拟配置方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250939A (zh) * 2016-07-30 2016-12-21 复旦大学 基于fpga+arm多层卷积神经网络的手写体字符识别方法
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速***
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
US20170308324A1 (en) * 2016-04-25 2017-10-26 Huawei Technologies Co., Ltd. Systems, methods and devices for a multistage sequential data process

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329936A (zh) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 一种用于执行神经网络运算以及矩阵/向量运算的装置和方法
CN106228238B (zh) * 2016-07-27 2019-03-22 中国科学技术大学苏州研究院 现场可编程门阵列平台上加速深度学习算法的方法和***
CN106228240B (zh) * 2016-07-30 2020-09-01 复旦大学 基于fpga的深度卷积神经网络实现方法
CN106959937B (zh) * 2017-03-30 2019-03-29 中国人民解放军国防科学技术大学 一种面向gpdsp的反卷积矩阵的向量化实现方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308324A1 (en) * 2016-04-25 2017-10-26 Huawei Technologies Co., Ltd. Systems, methods and devices for a multistage sequential data process
CN106250939A (zh) * 2016-07-30 2016-12-21 复旦大学 基于fpga+arm多层卷积神经网络的手写体字符识别方法
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速***
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378470B (zh) * 2019-07-19 2023-08-18 Oppo广东移动通信有限公司 神经网络模型的优化方法、装置以及计算机存储介质
CN112749778A (zh) * 2019-10-29 2021-05-04 北京灵汐科技有限公司 一种强同步下的神经网络映射方法及装置
CN112749778B (zh) * 2019-10-29 2023-11-28 北京灵汐科技有限公司 一种强同步下的神经网络映射方法及装置
CN112784952B (zh) * 2019-11-04 2024-03-19 珠海格力电器股份有限公司 一种卷积神经网络运算***、方法及设备
CN112784952A (zh) * 2019-11-04 2021-05-11 珠海格力电器股份有限公司 一种卷积神经网络运算***、方法及设备
CN112819022B (zh) * 2019-11-18 2023-11-07 同方威视技术股份有限公司 基于神经网络的图像识别装置和图像识别方法
CN112819022A (zh) * 2019-11-18 2021-05-18 同方威视技术股份有限公司 基于神经网络的图像识别装置和图像识别方法
CN113111995A (zh) * 2020-01-09 2021-07-13 北京君正集成电路股份有限公司 一种缩短模型推理和模型后处理运行时间的方法
CN111416743A (zh) * 2020-03-19 2020-07-14 华中科技大学 一种卷积网络加速器、配置方法及计算机可读存储介质
CN111445420B (zh) * 2020-04-09 2023-06-06 北京爱芯科技有限公司 卷积神经网络的图像运算方法、装置和电子设备
CN111445420A (zh) * 2020-04-09 2020-07-24 北京爱芯科技有限公司 卷积神经网络的图像运算方法、装置和电子设备
CN113673664A (zh) * 2020-05-14 2021-11-19 杭州海康威视数字技术股份有限公司 数据溢出检测方法、装置、设备及存储介质
CN113673664B (zh) * 2020-05-14 2023-09-12 杭州海康威视数字技术股份有限公司 数据溢出检测方法、装置、设备及存储介质
CN111783971A (zh) * 2020-07-02 2020-10-16 上海赛昉科技有限公司 一种用于深度神经网络的可高度灵活配置的数据后处理器
CN111783971B (zh) * 2020-07-02 2024-04-09 上海赛昉科技有限公司 一种用于深度神经网络的可高度灵活配置的数据后处理器
CN111931925A (zh) * 2020-08-10 2020-11-13 西安电子科技大学 基于fpga的二值化神经网络的加速***
CN111931925B (zh) * 2020-08-10 2024-02-09 西安电子科技大学 基于fpga的二值化神经网络的加速***
CN112070217B (zh) * 2020-10-15 2023-06-06 天津大学 一种卷积神经网络加速器的内部存储带宽优化方法
CN112070217A (zh) * 2020-10-15 2020-12-11 天津大学 一种卷积神经网络加速器的内部存储带宽优化方法
CN112270252A (zh) * 2020-10-26 2021-01-26 西安工程大学 一种改进YOLOv2模型的多车辆目标识别方法
CN112434635A (zh) * 2020-12-02 2021-03-02 深圳龙岗智能视听研究院 卷积神经网络特征提取方法、***、嵌入式设备及介质
CN112434635B (zh) * 2020-12-02 2024-02-09 深圳龙岗智能视听研究院 卷积神经网络特征提取方法、***、嵌入式设备及介质
CN112541583A (zh) * 2020-12-16 2021-03-23 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) 一种神经网络加速器
CN113379047A (zh) * 2021-05-25 2021-09-10 北京微芯智通科技合伙企业(有限合伙) 一种实现卷积神经网络处理的***及方法
CN113379047B (zh) * 2021-05-25 2024-04-05 北京微芯智通科技合伙企业(有限合伙) 一种实现卷积神经网络处理的***及方法
CN114764613A (zh) * 2022-03-31 2022-07-19 广东浪潮智慧计算技术有限公司 一种神经网络运算控制方法、***及计算机可读存储介质

Also Published As

Publication number Publication date
CN108416422A (zh) 2018-08-17
CN108416422B (zh) 2024-03-01

Similar Documents

Publication Publication Date Title
WO2019127838A1 (zh) 卷积神经网络实现方法及装置、终端、存储介质
CN109543832B (zh) 一种计算装置及板卡
WO2017124644A1 (zh) 一种人工神经网络压缩编码装置和方法
CN109522052B (zh) 一种计算装置及板卡
WO2017185391A1 (zh) 一种用于执行卷积神经网络训练的装置和方法
KR102470264B1 (ko) 완전연결층 신경망 역방향 트레이닝 실행용 장치와 방법
WO2022022274A1 (zh) 一种模型训练方法及装置
US20190026626A1 (en) Neural network accelerator and operation method thereof
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
JP2020526830A (ja) 演算アクセラレータ
US20210390076A1 (en) Apparatuses and methods for map reduce
WO2017185347A1 (zh) 用于执行循环神经网络和lstm运算的装置和方法
CN111105023B (zh) 数据流重构方法及可重构数据流处理器
WO2022179586A1 (zh) 一种模型训练方法及其相关联设备
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
WO2018113790A1 (zh) 一种人工神经网络运算的装置及方法
WO2023179482A1 (zh) 一种图像处理方法、神经网络的训练方法以及相关设备
Huang et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency
CN109711540B (zh) 一种计算装置及板卡
US20220044107A1 (en) Optimized sensor fusion in deep learning accelerator with integrated random access memory
WO2023197857A1 (zh) 一种模型切分方法及其相关设备
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
WO2023087227A1 (zh) 数据处理装置及方法
CN112766475B (zh) 处理部件及人工智能处理器
WO2021082746A1 (zh) 运算装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18896352

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 171120)

122 Ep: pct application non-entry in european phase

Ref document number: 18896352

Country of ref document: EP

Kind code of ref document: A1