CN115034351A

CN115034351A - Data processing method, convolutional neural network training method and device and FPGA

Info

Publication number: CN115034351A
Application number: CN202110241125.3A
Authority: CN
Inventors: 陈汝丹; 张渊
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2022-09-09

Abstract

The embodiment of the application provides a data processing method, a convolutional neural network training method, a device and an FPGA (field programmable gate array), which can acquire a picture to be processed and store the picture to a preset storage space, wherein the FPGA can acquire the picture to be processed from the preset storage space, process the picture to be processed through convolutional layer operation kernels based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram, and store the final characteristic diagram to the preset storage space; elements in a target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: a compressed convolution kernel and a non-compressed convolution kernel, and the size of the compressed convolution kernel is smaller than the non-compressed convolution kernel. Further, the final feature map may be obtained from a preset storage space. Thus, the efficiency of data processing based on the convolutional neural network can be improved.

Description

Data processing method, convolutional neural network training method and device and FPGA

Technical Field

The application relates to the technical field of deep learning, in particular to a data processing method, a convolutional neural network training device and an FPGA.

Background

With the rapid development of Convolutional Neural Networks (CNNs) in the field of computer vision, the Convolutional Neural Networks are operated to process images based on a Central Processing Unit (CPU), and are increasingly used in image Processing tasks.

However, convolutional neural networks typically contain more convolutional layers, and each convolutional layer contains multiple convolutional cores, so that the number of network parameters of the convolutional neural network is large. Further, if the convolution kernel in the convolution layer is large, for example, the size of the convolution kernel is 3 × 3 or 5 × 5, the number of network parameters of the convolutional neural network is further increased.

The large number of network parameters of the convolutional neural network increases the amount of computation. In addition, the data processing by operating the convolutional neural network based on the CPU is a serial processing method, that is, the convolutional processing is sequentially performed on a per-convolutional-kernel basis, and thus, the efficiency of data processing by the convolutional neural network may be reduced.

Disclosure of Invention

The embodiment of the application aims to provide a data processing method, a convolutional neural network training device and an FPGA (field programmable gate array), so as to improve the efficiency of data processing based on a convolutional neural network. The specific technical scheme is as follows:

in a first aspect, to achieve the above object, an embodiment of the present application discloses a data processing method, where the method includes:

acquiring a picture to be processed;

storing the picture to be processed into a preset storage space so that an FPGA (Field Programmable Gate Array) can acquire the picture to be processed from the preset storage space, processing the picture to be processed through a convolutional layer operation kernel based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram, and storing the final characteristic diagram into the preset storage space;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolution neural network, compressing part of convolution kernels appointed in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

and acquiring the final characteristic diagram from the preset storage space.

Optionally, elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the sizes of convolution kernels of at least two element groups in the plurality of element groups are the same, and the sizes of the convolution kernels of at least two element groups are different; m is the input parallelism of the convolution layer operation core, N is the output parallelism of the convolution layer operation core, so that the FPGA carries out convolution processing in a parallel mode through the convolution layer operation core based on the target parameter matrix corresponding to the convolution layer.

Optionally, the obtaining process of the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network includes:

for each convolutional layer in the initial convolutional neural network, determining a compression step length for compressing the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer and the input parallelism and the output parallelism of the convolutional layer operation core; wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

compressing part of the specified convolution kernels in the convolution layer into target convolution kernels according to the determined compression step length to obtain an alternative convolution neural network;

training the alternative convolutional neural network to obtain a target convolutional neural network;

and generating a target parameter matrix corresponding to each convolution layer in the target convolution neural network.

Optionally, the determining, for each convolutional layer in the initial convolutional neural network, a compression step for compressing the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer, and the input parallelism and the output parallelism of the convolutional layer operation core includes:

aiming at each convolution layer in the initial convolution neural network, calculating the number of input channels of the convolution layer and the ratio of the number of the input channels of the convolution layer to the input parallelism of the convolution layer operation core as a first ratio, and calculating the number of output channels of the convolution layer and the ratio of the output parallelism of the convolution layer operation core as a second ratio;

determining a first compression step size and a second compression step size when the convolutional layer is compressed based on the first ratio and the second ratio; wherein the first compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in convolution kernels corresponding to each output channel is increased; the second compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution kernels corresponding to each input channel is increased.

Optionally, the determining, based on the first ratio and the second ratio, a first compression step and a second compression step when compressing the convolutional layer includes:

if the input parallelism of the convolution layer operation core is equal to the output parallelism, determining a first compression step length and a second compression step length when the convolution layer is compressed based on a smaller numerical value in the first ratio and the second ratio;

and if the input parallelism of the convolution layer operation core is not equal to the output parallelism, determining a first compression step length when the convolution layer is compressed based on the first ratio, and determining a second compression step length when the convolution layer is compressed based on the second ratio.

Optionally, the compressing, according to the determined compression step size, a part of convolution kernels specified in the convolutional layer into a target convolution kernel to obtain an alternative convolutional neural network, including:

averagely dividing each convolution kernel corresponding to each output channel in the convolution layer into a plurality of first convolution kernel groups according to the sequence of corresponding input channels; wherein the number of convolution kernels included in each first convolution kernel group is the first compression step size; for each first convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the first convolution kernel group as first convolution kernels to be processed; compressing the first to-be-processed convolution kernels in each first convolution kernel group into a target convolution kernel to obtain an alternative convolution neural network; or the like, or, alternatively,

averagely dividing each convolution kernel corresponding to each input channel in the convolution layer into a plurality of second convolution kernel groups according to the sequence of the corresponding output channels; wherein each second convolution kernel group comprises the number of convolution kernels of the second compression step size; for each second convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the second convolution kernel group as second convolution kernels to be processed; and compressing the second convolution kernels to be processed in each second convolution kernel group into a target convolution kernel to obtain the alternative convolution neural network.

Optionally, the generating a target parameter matrix corresponding to each convolutional layer in the target convolutional neural network includes:

aiming at each convolution layer in the target convolution neural network, acquiring an initial parameter matrix corresponding to the convolution layer; wherein, the elements in the initial parameter matrix correspond to convolution kernels in the convolution layer one by one; the elements in the initial parameter matrix are arranged according to the sequence of the input channel and the output channel to which the corresponding convolution kernel belongs;

and adjusting the positions of rows and columns in the initial parameter matrix based on the first compression step length and the second compression step length to obtain a target parameter matrix corresponding to the convolutional layer.

Optionally, the adjusting the positions of rows and columns in the initial parameter matrix based on the first compression step and the second compression step to obtain a target parameter matrix corresponding to the convolutional layer includes:

determining elements corresponding to the convolution kernels which are not replaced in each alternative row in the initial parameter matrix as first elements; wherein the alternative row comprises: a row in which an element corresponding to the convolution kernel that is not replaced in the first column of the initial parameter matrix is located;

moving the column of each adjacent first compression step length element in the first elements to an adjacent position;

determining elements corresponding to the convolution kernels which are not replaced in each alternative column in the initial parameter matrix after the movement as second elements; wherein the alternative columns include: the column of the element corresponding to the convolution kernel which is not replaced in the first row of the initial parameter matrix;

and moving the row of each adjacent second compression step length element in the second elements to an adjacent position to obtain a corresponding target parameter matrix of the convolutional layer.

In a second aspect, in order to achieve the above object, an embodiment of the present application discloses a data processing method, where the method is applied to a field programmable gate array FPGA, and the method includes:

acquiring a picture to be processed from a preset storage space;

processing the picture to be processed based on the target parameter matrix of each convolutional layer in the target convolutional neural network through the convolutional layer operation core to obtain a final characteristic diagram;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

and storing the final characteristic diagram into the preset storage space.

Optionally, elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a group of compressed convolution kernels or to a group of uncompressed convolution kernels; the convolution kernels of at least two element groups in the plurality of element groups have the same size, and the convolution kernels of at least two element groups have different sizes; m is the input parallelism of the convolution layer operation core, and N is the output parallelism of the convolution layer operation core;

the processing the to-be-processed picture based on the target parameter matrix of each convolutional layer in the target convolutional neural network through the convolutional layer operation core to obtain a final characteristic diagram, including:

and carrying out convolution processing on the picture to be processed in a parallel mode on the basis of the target parameter matrix of each convolution layer in the target convolution neural network through the convolution layer operation core to obtain a final characteristic diagram.

In order to achieve the above object, an embodiment of the present application discloses a convolutional neural network training method, including:

for each convolution layer in the initial convolutional neural network, determining a compression step length for compressing the convolution kernel in the convolution layer based on the number of input channels and the number of output channels of the convolution layer and the input parallelism and the output parallelism of the convolution layer operation kernel of a Field Programmable Gate Array (FPGA); wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

generating a target parameter matrix corresponding to each convolution layer in the target convolution neural network; the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer, respectively.

Optionally, elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the convolution kernels of at least two element groups in the plurality of element groups have the same size, and the convolution kernels of at least two element groups have different sizes; m is the input parallelism of the convolution layer operation core, and N is the output parallelism of the convolution layer operation core.

In a fourth aspect, in order to achieve the above object, an embodiment of the present application discloses a field programmable gate array FPGA, where the FPGA includes a DDR (Double Data Rate SDRAM), an AXI (Advanced eXtensible Interface) bus, a state controller, a feature map cache region, a parameter cache region, an output cache region, an adder, a read/write controller, and at least one convolutional layer operation core with an input parallelism of M and an output parallelism of N, where:

the DDR controller is used for acquiring a picture to be processed from a preset storage space and storing the picture to be processed to the feature map cache region;

the state controller is used for controlling the read-write controller, reading a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network from the parameter cache region, reading an alternative graph from the characteristic graph cache region and sending the alternative graph to the convolutional layer operation core; the alternative graph is a to-be-processed picture or a feature graph stored in the feature graph cache region; elements in a target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernels are set based on the number of input channels and the number of output channels of the convolution layer and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

the convolution layer operation core is used for performing convolution processing on the received alternative graph based on a target parameter matrix corresponding to each convolution layer according to a preset clock period;

the state controller is further configured to control the read/write controller to store, for each convolutional layer of the target convolutional neural network, a convolutional calculation result of each convolutional layer operation core in each clock cycle to the output cache region; reading the offset value of the output channel from the parameter cache region aiming at each output channel of the convolution layer, controlling the adder, accumulating convolution calculation results of each clock period of the output channel stored in the output cache region, calculating the sum of the accumulation result and the offset value of the output channel, obtaining a characteristic diagram corresponding to the output channel, and storing the characteristic diagram to the output cache region;

the DDR controller is further configured to store, via the AXI bus, a feature map obtained based on each target convolution layer in an output buffer area to the feature map buffer area; acquiring a final characteristic diagram obtained based on the last convolutional layer in the target convolutional neural network from the output cache region, and storing the final characteristic diagram in the preset storage space; wherein the target convolutional layer comprises: the other convolutional layers in the target convolutional neural network except the last convolutional layer.

Optionally, elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the convolution kernels of at least two element groups in the plurality of element groups have the same size, and the convolution kernels of at least two element groups have different sizes;

and the convolutional layer operation core is used for performing convolution processing on the received alternative graph in a parallel mode based on the target parameter matrix corresponding to each convolutional layer according to a preset clock period.

In a fifth aspect, to achieve the above object, an embodiment of the present application discloses a data processing apparatus, including:

the image to be processed acquisition module is used for acquiring an image to be processed;

the storage module is used for storing the picture to be processed into a preset storage space so that a Field Programmable Gate Array (FPGA) can acquire the picture to be processed from the preset storage space, and the picture to be processed is processed through a convolutional layer operation core based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram and stored into the preset storage space;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and the number of columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolution neural network, compressing part of convolution kernels appointed in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

and the characteristic diagram obtaining module is used for obtaining the final characteristic diagram from the preset storage space.

Optionally, elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the convolution kernels of at least two element groups in the plurality of element groups have the same size, and the convolution kernels of at least two element groups have different sizes; and M is the input parallelism of the convolution layer operation core, and N is the output parallelism of the convolution layer operation core, so that the FPGA carries out convolution processing in a parallel mode through the convolution layer operation core based on the target parameter matrix corresponding to the convolution layer.

Optionally, the apparatus further comprises:

a compression step length determining module, configured to determine, for each convolutional layer in the initial convolutional neural network, a compression step length for compressing the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer, and the input parallelism and the output parallelism of the convolutional layer arithmetic core; wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

the convolution kernel compression module is used for compressing part of convolution kernels appointed in the convolution layer into target convolution kernels according to the determined compression step length to obtain an alternative convolution neural network;

the training module is used for training the alternative convolutional neural network to obtain a target convolutional neural network;

and the target parameter matrix generating module is used for generating a target parameter matrix corresponding to each convolution layer in the target convolutional neural network.

Optionally, the compression step determining module includes:

the first calculation submodule is used for calculating the ratio of the number of input channels of each convolutional layer in the initial convolutional neural network to the input parallelism of the convolutional layer operation core as a first ratio, and calculating the ratio of the number of output channels of each convolutional layer to the output parallelism of the convolutional layer operation core as a second ratio;

a compression step determining submodule, configured to determine a first compression step and a second compression step when the convolutional layer is compressed based on the first ratio and the second ratio; wherein the first compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in convolution kernels corresponding to each output channel is increased; the second compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution kernels corresponding to each input channel is increased.

Optionally, the compression step size determining submodule is specifically configured to determine, if the input parallelism of the convolutional layer arithmetic core is equal to the output parallelism, a first compression step size and a second compression step size when the convolutional layer is compressed based on a smaller value of the first ratio and the second ratio;

Optionally, the convolution kernel compression module is specifically configured to, for each convolution kernel corresponding to each output channel in the convolution layer, averagely divide the convolution kernel into a plurality of first convolution kernel groups according to the sequence of the corresponding input channel; wherein the number of convolution kernels included in each first convolution kernel group is the first compression step size; for each first convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the first convolution kernel group as a first convolution kernel to be processed; compressing the first to-be-processed convolution kernels in each first convolution kernel group into a target convolution kernel to obtain an alternative convolution neural network; or the like, or, alternatively,

Optionally, the target parameter matrix generating module includes:

the initial parameter matrix acquisition submodule is used for acquiring an initial parameter matrix corresponding to each convolutional layer in the target convolutional neural network; wherein, the elements in the initial parameter matrix correspond to convolution kernels in the convolution layer one by one; the elements in the initial parameter matrix are arranged according to the sequence of the input channel and the output channel to which the corresponding convolution kernel belongs;

and the target parameter matrix generation submodule is used for adjusting the positions of rows and columns in the initial parameter matrix based on the first compression step length and the second compression step length to obtain a target parameter matrix corresponding to the convolutional layer.

Optionally, the target parameter matrix generating sub-module includes:

a first element determining unit, configured to determine, as a first element, an element corresponding to an uncompressed convolution kernel in each candidate row in the initial parameter matrix; wherein the alternative row comprises: a row of elements corresponding to the uncompressed convolution kernel in a first column of the initial parameter matrix;

the first moving unit is used for moving the column of each adjacent first compression step length element in the first elements to an adjacent position;

a second element determining unit, configured to determine, for each candidate column in the initial parameter matrix after the movement, an element corresponding to an uncompressed convolution kernel in the candidate column, as a second element; wherein the alternative columns include: the first row of the initial parameter matrix is the column where the element corresponding to the uncompressed convolution kernel is located;

and the second moving unit is used for moving the line of each adjacent second compression step length element in the second elements to an adjacent position to obtain a corresponding target parameter matrix of the convolutional layer.

In a sixth aspect, in order to achieve the above object, an embodiment of the present application discloses a convolutional neural network training apparatus, including:

the compression step length determining module is used for determining the compression step length for compressing the convolution kernel in the convolution layer based on the number of input channels and the number of output channels of the convolution layer and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA (field programmable logic gate array); wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

the convolution kernel compression module is used for compressing part of specified convolution kernels in the convolution layer into target convolution kernels according to the determined compression step length to obtain an alternative convolution neural network;

the target parameter matrix generation module is used for generating a target parameter matrix corresponding to each convolution layer in the target convolutional neural network; the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer, respectively.

In another aspect of this application, in order to achieve the above object, an embodiment of this application further discloses an electronic device, where the electronic device includes a processor, a communication interface, a memory, a field programmable gate array FPGA, and a communication bus, where the processor, the communication interface, the FPGA, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the data processing method according to any one of the first aspect described above when executing the program stored in the memory;

the FPGA is configured to execute the data processing method according to any one of the second aspects.

In another aspect of this application implementation, in order to achieve the foregoing object, an embodiment of this application further discloses an electronic device, where the electronic device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the convolutional neural network training method according to any of the third aspects when executing the program stored in the memory.

In yet another aspect of this embodiment, there is further provided a computer-readable storage medium, having stored therein instructions, which, when run on a computer, implement the data processing method according to any of the first aspects described above, or the convolutional neural network training method according to any of the third aspects described above.

In yet another aspect of this embodiment, a computer program product including instructions is provided, which when executed on a computer, causes the computer to perform any one of the data processing methods described in the first aspect above, or any one of the convolutional neural network training methods described in the third aspect above.

The embodiment of the application provides a data processing method, which can acquire a picture to be processed, store the picture to be processed into a preset storage space, correspondingly, an FPGA acquires the picture to be processed from the preset storage space, processes the picture to be processed through a convolutional layer operation kernel based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network, acquires a final characteristic diagram, and stores the final characteristic diagram into the preset storage space; the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: the method comprises the steps of compressing a convolution kernel and a non-compression convolution kernel, wherein the size of the compression convolution kernel is smaller than that of the non-compression convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernels are set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernels of the FPGA.

Based on the processing, the specified partial convolution kernels in the convolution layer are compressed, so that the number of network parameters of the convolution layer can be reduced, the number of network parameters of a target convolution neural network can be reduced, the calculation amount can be reduced, furthermore, the FPGA carries out convolution processing on the picture to be processed based on the compressed convolution layer, and the efficiency of carrying out data processing based on the convolution neural network can be improved.

Of course, it is not necessary for any product or method of the present application to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application;

fig. 2 is a flowchart of a method for obtaining a target parameter matrix in a data processing process according to an embodiment of the present application;

fig. 3 is a flowchart of another method for obtaining a target parameter matrix in a data processing process according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a structure of a convolutional layer provided in the present embodiment;

FIG. 5 is a schematic diagram of a convolution kernel arrangement resulting from compressing the convolutional layer of FIG. 4;

FIG. 6 is a schematic diagram of another arrangement of convolution kernels resulting from compressing the convolutional layer of FIG. 4;

FIG. 7 is a diagram illustrating another exemplary convolution kernel arrangement provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of another convolution kernel arrangement provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a convolution kernel arrangement resulting from compressing the convolutional layer corresponding to FIG. 8;

FIG. 10 is a schematic diagram of a convolution kernel arrangement derived from a column shift based on the arrangement of convolution kernels of FIG. 9;

FIG. 11 is a schematic diagram of a convolution kernel arrangement based on the shift of the arrangement of convolution kernels of FIG. 10;

FIG. 12 is a schematic diagram of a convolution kernel arrangement resulting from a shift in rows and columns based on the arrangement of convolution kernels of FIG. 9;

fig. 13 is another flowchart of a data processing method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an FPGA provided in the embodiment of the present application;

fig. 15 is a schematic diagram illustrating a principle of data processing by running a convolutional neural network based on an FPGA according to an embodiment of the present application;

fig. 16 is a block diagram of a data processing apparatus according to an embodiment of the present application;

fig. 17 is a structural diagram of a convolutional neural network training device according to an embodiment of the present application;

fig. 18 is a block diagram of an electronic device according to an embodiment of the present application;

fig. 19 is another structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the related art, the number of network parameters of the convolutional neural network is large, and the convolutional neural network is operated based on the CPU to perform data processing in a serial processing manner, so that the efficiency of performing data processing based on the convolutional neural network is reduced.

In order to solve the above problem, an embodiment of the present application provides a data processing method, and referring to fig. 1, fig. 1 is a flowchart of the data processing method provided by the embodiment of the present application, and the method may include the following steps:

s101: and acquiring a picture to be processed.

S102: and storing the picture to be processed into a preset storage space so that the FPGA acquires the picture to be processed from the preset storage space, processing the picture to be processed through a convolutional layer operation kernel based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram, and storing the final characteristic diagram into the preset storage space.

The elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and the number of columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is: for each convolution layer in the initial convolution neural network, compressing part of convolution kernels appointed in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernels are set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernels of the FPGA.

S103: and acquiring a final characteristic diagram from a preset storage space.

According to the data processing method provided by the embodiment of the application, the specified partial convolution kernels in the convolution layer are compressed, so that the number of network parameters of the convolution layer can be reduced, the number of network parameters of a target convolution neural network can be reduced, and the calculation amount can be reduced. Furthermore, the FPGA performs convolution processing on the picture to be processed based on the compressed convolution layer, and the efficiency of performing data processing based on the convolution neural network can be improved.

In one embodiment, the method may be applied to an electronic device, which is configured to process the acquired to-be-processed picture, for example, the electronic device may be a monitoring device, or the electronic device may also be a server capable of communicating with the monitoring device.

For step S102, a target convolutional neural network may be trained in advance, and a target parameter matrix corresponding to each convolutional layer in the target convolutional neural network is obtained and written into a buffer of the FPGA (for example, a parameter buffer mentioned later in this embodiment of the present application). Correspondingly, after the to-be-processed picture is acquired from the preset storage space by the FPGA, the target parameter matrix corresponding to each convolution layer in the target convolution neural network can be acquired from the parameter cache region, and the to-be-processed picture is processed.

In one embodiment, part of the convolution kernels in the convolutional layer can be deleted, and then, the number of network parameters of the convolutional layer is reduced, so that the number of network parameters of the target convolutional neural network is reduced, and the calculation amount can be reduced. Accordingly, the number of rows and columns of the target parameter matrix may be smaller than the number of output channels and the number of input channels of the convolutional layer, respectively.

The FPGA may include one convolution layer operation core or a plurality of convolution layer operation cores. The input parallelism and the output parallelism of one convolution layer operation core are used for expressing the number of the calculation units contained in the convolution operation core. For example, a convolutional layer arithmetic core with M input parallelism and N output parallelism may include M × N computational units.

In one embodiment, the elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer. Each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels. The convolution kernels of at least two of the plurality of element groups are the same in size, and the convolution kernels of at least two of the element groups are different in size.

The size of the convolution kernel indicates the number of elements that the convolution kernel contains, for example, a convolution kernel of 1 × 1 size contains one element and a convolution kernel of 3 × 3 size contains 9 elements.

It can be understood that the non-compressed convolution kernel is an original convolution kernel in the convolutional neural network, and accordingly, if one non-compressed convolution kernel in the convolutional neural network is compressed, a corresponding compressed convolution kernel can be obtained, that is, the compressed convolution kernel is obtained by compressing the non-compressed convolution kernel. The uncompressed convolution kernel is compressed, i.e., the number of elements that the uncompressed convolution kernel contains is reduced. For example, a non-compressed convolution kernel of 3 × 3 size is compressed into a compressed convolution kernel of 1 × 1 size, and a non-compressed convolution kernel of 5 × 5 size is compressed into a compressed convolution kernel of 3 × 3 size.

That is, the convolution kernels corresponding to M × N elements included in each element group in the target parameter matrix have the same size, that is, data of the convolution kernels corresponding to M × N elements included in the element group can be written into M × N calculation units of the convolution layer calculation kernel. Correspondingly, each computing unit can perform convolution processing on the basis of the written data of the convolution kernel, and further can realize the convolution processing in a parallel mode in one clock cycle of the FPGA so as to improve the efficiency of data processing.

In addition, if the FPGA includes a plurality of convolution layer operation cores, convolution operation can be simultaneously performed in a parallel manner based on the plurality of convolution layer operation cores in one clock cycle, thereby further improving the efficiency of data processing.

In one embodiment, referring to fig. 2, the process of obtaining the target parameter matrix may include the following steps:

s201: for each convolutional layer in the initial convolutional neural network, determining a compression step size for compressing the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer and the input parallelism and the output parallelism of the convolutional layer operation core.

Wherein the compression step size is used to indicate: the convolution layer is compressed by the proportion of convolution kernels that need to be compressed.

S202: and compressing part of the specified convolution kernels in the convolution layer into target convolution kernels according to the determined compression step length to obtain the alternative convolution neural network.

S203: and training the alternative convolutional neural network to obtain the target convolutional neural network.

S204: and generating a target parameter matrix corresponding to each convolution layer in the target convolution neural network.

In an embodiment of the present application, the convolutional layers are compressed, i.e., a portion of the convolutional kernels in the convolutional layers are reduced in size, so that the convolutional layers contain convolutional kernels of different sizes.

In one embodiment, the size of the convolution kernel in the convolutional layer may be 3 × 3, or may also be 5 × 5, but is not limited thereto. Accordingly, if the size of the convolution kernel in the convolutional layer is 3 × 3, the partial convolution kernels specified in the convolutional layer can be compressed into 1 × 1 target convolution kernels; if the size of the convolution kernel in the convolutional layer is 5 × 5, the partial convolution kernel specified in the convolutional layer may be compressed to a 3 × 3 target convolution kernel, or the partial convolution kernel specified in the convolutional layer may be compressed to a1 × 1 target convolution kernel.

In one embodiment, the partial convolution kernels specified in the convolutional layer may all be compressed to the same size target convolution kernel. For example, if the size of the convolution kernel in the convolutional layer is 5 × 5, the partial convolution kernels specified in the convolutional layer may be uniformly compressed to 3 × 3 target convolution kernels; alternatively, if the size of the convolution kernel in the convolutional layer is 5 × 5, the partial convolution kernels specified in the convolutional layer may be uniformly compressed to 1 × 1 of the target convolution kernel.

In one embodiment, on the basis of fig. 2, referring to fig. 3, step S201 may include the steps of:

s2011: and aiming at each convolution layer in the initial convolutional neural network, calculating the ratio of the number of input channels of the convolution layer to the input parallelism of the convolution layer operation core as a first ratio, and calculating the ratio of the number of output channels of the convolution layer to the output parallelism of the convolution layer operation core as a second ratio.

S2012: and determining a first compression step size and a second compression step size when the convolutional layer is compressed based on the first ratio and the second ratio.

Wherein the first compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution kernels corresponding to each output channel is increased. The second compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution kernels corresponding to each input channel is increased.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a convolutional layer provided in the embodiment of the present application. In fig. 4, N represents the number of output channels of the convolutional layer, and C represents the number of input channels of the convolutional layer. In a convolutional neural network, the number of output channels of one convolutional layer is the same as the number of input channels of the next convolutional layer of the convolutional layer. The number of input channels to a convolutional layer is the same as the number of feature maps to be input to the convolutional layer.

In fig. 4, the convolutional layer has 4 output channels and 4 input channels, that is, the convolutional layer includes 4 × 4-16 convolutional kernels, each of which is K ₀₀ ，K ₀₁ ，K ₀₂ ，….K ₃₁ ，K ₃₂ ，K ₃₃ 。

K ₀₀ ，K ₀₁ ，K ₀₂ And K ₀₃ Corresponding to the same output channel; k is ₁₀ ，K ₁₁ ，K ₁₂ And K ₁₃ Corresponding to the same output channel; k ₂₀ ，K ₂₁ ，K ₂₂ And K ₂₃ Corresponding to the same output channel; k ₃₀ ，K ₃₁ ，K ₃₂ And K ₃₃ Corresponding to the same output channel.

K ₀₀ ，K ₁₀ ，K ₂₀ And K ₃₀ Corresponding to the same input channel; k ₀₁ ，K ₁₁ ，K ₂₁ And K ₃₁ Corresponding to the same input channel; k ₀₂ ，K ₁₂ ，K ₂₂ And K ₃₂ Corresponding to the same input channel; k ₀₃ ，K ₁₃ ，K ₂₃ And K ₃₃ Corresponding to the same input channel.

Since the number of input channels of the convolutional layer is 4, that is, 4 feature maps can be input to the convolutional layer, correspondingly, each convolutional core (e.g., K) corresponding to the same output channel ₀₀ ，K ₀₁ ，K ₀₂ And K ₀₃ ) The convolution processing may be performed on one of the 4 feature maps, and further, for the results of the 4 convolution processing, the values at the corresponding positions may be added, and the added result may be added to the offset value of the output channel, so as to obtain the output result corresponding to the output channel. Accordingly, the convolutional layer can output 4 feature maps to the next convolutional layer.

In order to match the structure of the compressed convolution layer with the convolution layer operation core of the FPGA, the number of input channels of the convolution layer may be calculated as a ratio (i.e., a first ratio) to the input parallelism M of the convolution layer operation core of the FPGA, and the number of output channels of the convolution layer may be calculated as a ratio (i.e., a second ratio) to the output parallelism N of the convolution layer operation core of the FPGA.

Further, a first compression step size and a second compression step size when compressing the first convolution layer may be determined based on the first ratio and the second ratio.

For example, the first compression step size may be denoted by P1, and the second compression step size may be denoted by P2. For the structure of the convolutional layer in fig. 4, if the sizes of the convolutional cores in the convolutional layer are all 3 × 3, and P1 is 2, P2 is 2. That is, when the convolutional layer is compressed, the proportion of the convolutional kernel to be compressed in the convolutional kernel corresponding to each output channel is 1/2; the proportion of the convolution kernel required to be compressed in the convolution kernel corresponding to each output channel is 1/2. Accordingly, the structure shown in fig. 5 can be obtained.

In fig. 5, the size of the white convolution kernel is 3 × 3, and the size of the gray convolution kernel is 1 × 1. It can be seen that, in the convolution kernel corresponding to each output channel, one convolution kernel in every two convolution kernels is compressed into a smaller convolution kernel (i.e., a convolution kernel of 1 × 1), and in the convolution kernel corresponding to each input channel, one convolution kernel in every two convolution kernels is compressed into a smaller convolution kernel (i.e., a convolution kernel of 1 × 1). That is, the proportion of the reduced convolution kernel in the convolution kernel corresponding to each output channel is 1/2; the reduced proportion of the convolution kernels for each output channel is 1/2.

For the structure of the convolutional layer in fig. 4, if the sizes of the convolutional cores in the convolutional layer are all 3 × 3, P1 is 4, and P2 is 4. That is, when the convolutional layer is compressed, the proportion of the convolutional kernel to be reduced in the convolutional kernel corresponding to each output channel is 3/4; the convolution kernel for each output channel is 3/4. Accordingly, the structure shown in fig. 6 can be obtained.

In fig. 6, the size of the white convolution kernel is 3 × 3, and the size of the gray convolution kernel is 1 × 1. It can be seen that three of every four convolution kernels are compressed into a smaller convolution kernel (i.e., a1 × 1 convolution kernel) in the corresponding convolution kernel of each output channel, and three of every four convolution kernels are compressed into a smaller convolution kernel (i.e., a1 × 1 convolution kernel) in the corresponding convolution kernel of each input channel. That is, the proportion of the reduced convolution kernel in the convolution kernel corresponding to each output channel is 3/4; the reduced convolution kernel accounts for 3/4 in the corresponding convolution kernel for each output channel.

In one embodiment, the first compression step size and the second compression step size may be determined in different manners by combining the magnitude relationship between the input parallelism and the output parallelism of the convolutional layer arithmetic core.

The method I comprises the following steps: and if the input parallelism of the convolution layer operation core is equal to the output parallelism, determining a first compression step size and a second compression step size when the convolution layer is compressed based on the smaller value of the first ratio and the second ratio.

In one embodiment, if the input parallelism of the convolution layer arithmetic core is equal to the output parallelism, the first compression step size and the second compression step size may be determined based on a smaller value of the first ratio and the second ratio.

For example, if the number of output channels of the convolutional layer is 96, the number of input channels is 128, the input parallelism M of the convolutional layer arithmetic core is 32, and the output parallelism N is 32, the first ratio is 128/32-4, the second ratio is 96/32-3, and further, P1-P2-3 can be determined. Accordingly, see fig. 7.

In fig. 7, in both the direction of the input channel and the direction of the output channel, the step size P ═ 3 is used for compression, and since the number of input channels is greater than that of output channels, the convolution kernels corresponding to the redundant input channels may not be compressed any more. Furthermore, the compressed convolutional neural network can be ensured to have higher accuracy.

The second method comprises the following steps: and if the input parallelism of the convolution layer operation core is not equal to the output parallelism, determining a first compression step length when the convolution layer is compressed based on a first ratio, and determining a second compression step length when the convolution layer is compressed based on a second ratio.

In one embodiment, if the input parallelism of the convolutional layer arithmetic core is not equal to the output parallelism, a first compression step size may be determined based on a first ratio and a second compression step size may be determined based on a second ratio.

For example, if the number of output channels of the convolutional layer is 192, the number of input channels is 64, the input parallelism M of the convolutional layer arithmetic core is 32, and the output parallelism N is 64, the first ratio is 64/32-2, the second ratio is 192/64-3, and further, P1-2 and P2-3 can be determined.

In one embodiment, in order to avoid a large number of compressed convolution kernels and thus a low accuracy of the target convolutional neural network, when the determined compression step size is larger than a preset step size, a divisor of the determined compression step size may be calculated as a final compression step size.

For example, if the preset step size is 5, the number of output channels of the convolutional layer is 128, the number of input channels is 192, the input parallelism M of the convolutional layer arithmetic core is 32, and the output parallelism N is 64, the first ratio is 192/32-6, the second ratio is 128/64-2, and further, P1-6 and P2-3 can be determined. Since the calculated P1 is greater than the preset step size, the divisor of 6 (2 or 3) may be determined as the final compression step size, i.e., determine that P1 is 2, P2 is 3; or determining that P1 is 3 and P2 is 3.

In one embodiment, if the number of input channels of the convolutional layer is not an integer multiple of the input parallelism of the convolutional layer arithmetic core, the input channels of the convolutional layer can be further expanded so that the number of input channels of the convolutional layer after expansion is an integer multiple of the input parallelism of the convolutional layer arithmetic core. The extended input channel may have a convolution kernel value of 0.

Similarly, if the number of output channels of the convolution layer is not an integer multiple of the output parallelism of the convolution layer operation core, the output channels of the convolution layer can be expanded, so that the number of the output channels of the convolution layer after expansion is an integer multiple of the output parallelism of the convolution layer operation core. The extended output channel may have a convolution kernel value of 0.

In one embodiment, the compression may be performed in different ways, resulting in an alternative convolutional neural network.

The method I comprises the following steps: averagely dividing each convolution kernel corresponding to each output channel in the convolution layer into a plurality of first convolution kernel groups according to the sequence of corresponding input channels; for each first convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the first convolution kernel group as first convolution kernels to be processed; and compressing the first to-be-processed convolution kernels in each first convolution kernel group into a target convolution kernel to obtain the alternative convolution neural network.

Wherein, the number of convolution kernels contained in each first convolution kernel group is a first compression step size.

In one embodiment, referring to FIG. 8, FIG. 8 is a schematic diagram of the arrangement of convolution kernels in a 12 × 12 convolutional layer. Each row represents a convolution kernel for one output channel and each column represents a convolution kernel for one input channel.

If the input parallelism of the convolutional layer arithmetic core is 3 and the output parallelism is 3, P1 is P2 is 4.

Therefore, each convolution kernel corresponding to each output channel in the convolution layer can be divided into 3 convolution kernel groups (i.e., the first convolution kernel group) on average according to the order of the corresponding input channels. Each first set of convolution kernels contains 4 convolution kernels.

For example, for the convolution kernel of the first row, K may be assigned ₀₀ ，K ₀₁ ，K ₀₂ And K ₀₃ Determining as a first set of convolution kernels, K ₀₄ ，K ₀₅ ，K ₀₆ And K ₀₇ Determining as a first set of convolution kernels, K ₀₈ ，K ₀₉ ，K ₀₁₀ And K ₀₁₁ Is determined as a first convolution kernel group.

Then, for each first convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the first convolution kernel group as the first convolution kernel to be processed.

In one embodiment, the designated location in the first row may be the first location in each first convolution core group. For the above example, the convolution kernel at the specified location includes K ₀₀ ，K ₀₄ And K ₀₈ 。

Further, K may be ₀₁ ，K ₀₂ ，K ₀₃ ，K ₀₅ ，K ₀₆ ，K ₀₇ ，K ₀₉ ，K ₀₁₀ And K ₀₁₁ Compressed into a smaller target convolution kernel.

Similarly, the designated location in the second row may be the second location in each first convolution kernel group, and accordingly, K may be assigned ₁₀ ，K ₁₂ ，K ₁₃ ，K ₁₄ ，K ₁₆ ，K ₁₇ ，K ₁₈ ，K ₀₁₀ And K ₀₁₁ Compressed into a smaller target convolution kernel. By analogy, the arrangement diagram of the convolution kernel shown in fig. 9 can be obtained. In FIG. 9, ashThe colored convolution kernels represent compressed convolution kernels and the white convolution kernels represent uncompressed convolution kernels.

The second method comprises the following steps: averagely dividing each convolution kernel corresponding to each input channel in the convolution layer into a plurality of second convolution kernel groups according to the sequence of corresponding output channels; for each second convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the second convolution kernel group as second convolution kernels to be processed; and compressing the second convolution kernels to be processed in each second convolution kernel group into a target convolution kernel to obtain the alternative convolution neural network.

Wherein each second convolution kernel group comprises a second number of convolution kernels of a second compression step size. In the embodiment of the present application, similar to the processing in the first mode, reference may be made to the related description.

In the actual operation process, the compression of the convolution kernel can be performed from the direction of the input channel, and also can be performed from the direction of the output channel. As long as it is ensured that the direction of the input channel and the direction of the output channel are compressed according to their respective compression step lengths.

In one embodiment, step S204 may include the steps of:

step 1: and aiming at each convolutional layer in the target convolutional neural network, acquiring an initial parameter matrix corresponding to the convolutional layer.

Wherein, the elements in the initial parameter matrix correspond to the convolution kernels in the convolution layer one by one. The elements in the initial parameter matrix are arranged according to the sequence of the input channel and the output channel to which the corresponding convolution kernel belongs.

In one embodiment, see FIG. 10, the resulting parameter matrix, i.e., the initial parameter matrix of the convolutional layer, may be based on the arrangement of convolutional kernels shown in FIG. 10.

Step 2: and adjusting the positions of rows and columns in the initial parameter matrix based on the first compression step length and the second compression step length to obtain a target parameter matrix corresponding to the convolutional layer.

In one embodiment, the step 2 may include the steps of:

the method comprises the following steps: and determining an element corresponding to the uncompressed convolution kernel in each alternative row in the initial parameter matrix as a first element.

Wherein the alternative rows include: the row in the first column of the initial parameter matrix where the element corresponding to the uncompressed convolution kernel is located.

Step two: and moving the column of each adjacent first compression step size element in the first elements to an adjacent position.

In the embodiment of the present application, for the initial parameter matrix shown in fig. 9, the uncompressed convolution kernel in the first column is K ₀₀ ，K ₄₀ And K ₈₀ . Thus, it may be determined that the alternative rows include the first row, the fifth row, and the ninth row.

The uncompressed convolution kernel in the first row is K ₀₀ ，K ₀₄ And K ₀₈ . Thus, the first element, the fifth element, and the ninth element in the first row may be determined to be the first element.

Further, the first column, the fifth column, and the ninth column may be moved to adjacent positions, and a parameter matrix corresponding to the configuration diagram shown in fig. 10 may be obtained.

Step three: and determining elements corresponding to the uncompressed convolution kernels in each alternative column in the initial parameter matrix after the movement as second elements.

Wherein the alternative columns include: and the first row of the initial parameter matrix is provided with a column where the element corresponding to the uncompressed convolution kernel is positioned.

Step four: and moving the row of each adjacent second compression step length element in the second elements to an adjacent position to obtain a corresponding target parameter matrix of the convolutional layer.

In the embodiment of the present application, the uncompressed convolution kernel in the first row is K in fig. 10 ₀₀ ，K ₀₄ And K ₀₈ . Thus, it may be determined that the alternative columns include the first column, the fifth column, and the ninth column.

The uncompressed convolution kernel in the first column is K ₀₀ ，K ₄₀ And K ₈₀ . Thus, the first and fifth elements in the first column may be determinedAnd the ninth element is the second element. Further, the first row, the fifth row, and the ninth row may be moved to adjacent positions, and a parameter matrix corresponding to the configuration diagram shown in fig. 11 may be obtained.

Similarly, the first element in the fifth row may be determined and the corresponding column shifted. The second element in the fifth column is determined and the corresponding row is shifted. Determining a first element in a ninth row and shifting a corresponding column; determining the second element in the ninth column and moving the corresponding row, so as to finally obtain the schematic diagram of the convolution kernel arrangement shown in fig. 12. The parameter matrix corresponding to the schematic diagram shown in fig. 12 is also the target parameter matrix.

In one embodiment, after the target parameter matrix is determined, the target parameter matrix can be converted from a floating point number format to a fixed point number format, so that the method is suitable for the computing environment of the FPGA and the data processing efficiency is improved.

For example, the format of the data may be converted based on formula (1).

W _int ＝(-1) ^S ×2 ^-fl ×round(2 ^fl ×W _f ) (1)

Wherein, W _f Representing the floating point number to be converted, fl representing a preset number, round representing a rounding operation, S representing a sign bit, W _int The fixed point number obtained by conversion is represented.

Based on the same inventive concept, the embodiment of the present application further provides a convolutional neural network training method, which may include the following steps:

for each convolution layer in the initial convolutional neural network, determining a compression step length for compressing convolution kernels in the convolution layer based on the number of input channels and the number of output channels of the convolution layer and the input parallelism and the output parallelism of a convolution layer operation kernel of an FPGA; wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

compressing part of the specified convolution kernels in the convolution layer into target convolution kernels according to the determined compression step length to obtain an alternative convolution neural network; the target convolution kernel is smaller than the specified partial convolution kernel;

According to the convolutional neural network training method, the specified partial convolutional kernels in the convolutional layers are compressed, the number of network parameters of the convolutional layers can be reduced, the number of network parameters of the target convolutional neural network can be reduced, and the calculated amount can be reduced. Furthermore, the FPGA performs convolution processing on the picture to be processed based on the compressed convolution layer, and the efficiency of performing data processing based on the convolution neural network can be improved.

In one embodiment, the elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a group of compressed convolution kernels or to a group of uncompressed convolution kernels; the convolution kernels of at least two of the plurality of element groups are the same in size, and the convolution kernels of at least two of the element groups are different in size.

Based on the above processing, the convolution kernels corresponding to the M × N elements included in each element group in the target parameter matrix have the same size, that is, the data of the convolution kernels corresponding to the M × N elements included in the element group can be written into the M × N calculation units of the convolution layer operation kernel. Correspondingly, each computing unit can perform convolution processing on the basis of the written data of the convolution kernel, and further can realize the convolution processing in a parallel mode in one clock cycle of the FPGA so as to improve the efficiency of data processing.

Based on the same inventive concept, the embodiment of the present application further provides a data processing method, which can be applied to an FPGA, and referring to fig. 13, the method can include the following steps:

s1301: and acquiring the picture to be processed from a preset storage space.

S1302: and processing the picture to be processed by the convolutional layer operation kernel based on the target parameter matrix of each convolutional layer in the target convolutional neural network to obtain a final characteristic diagram.

The elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and the number of columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernels are set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernels of the FPGA.

S1303: and storing the final characteristic diagram into a preset storage space.

In one embodiment, the elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the convolution kernels of at least two of the plurality of element groups are the same in size, and the convolution kernels of at least two of the element groups are different in size.

Accordingly, S1302 may include:

and carrying out convolution processing on the picture to be processed in a parallel mode through the convolution layer operation core based on the target parameter matrix of each convolution layer in the target convolution neural network to obtain a final characteristic diagram.

In the embodiment of the application, the FPGA may obtain the picture to be processed from the preset storage space, and perform convolution processing on the picture to be processed based on the target parameter matrix corresponding to the first convolution layer through convolution calculation and according to the preset clock cycle. In addition, for each output channel of the first convolutional layer, the convolution calculation results of each clock period of the output channel may be accumulated, and the sum of the accumulated result and the offset value of the output channel may be calculated to obtain the feature map corresponding to the output channel.

Then, the FPGA may perform convolution processing on the feature map obtained by the first convolution layer based on the target parameter matrix corresponding to the second convolution layer according to a preset clock cycle through the convolution layer operation core. In addition, for each output channel of the second convolutional layer, the convolution calculation results of each clock period of the output channel may be accumulated, and the sum of the accumulation result and the offset value of the output channel may be calculated to obtain the feature map corresponding to the output channel.

And the like until a final feature map determined based on the last convolutional layer is obtained.

Based on the same inventive concept, an embodiment of the present application further provides an FPGA, which includes a DDR controller, an AXI bus, a state controller, a feature map cache region, a parameter cache region, an output cache region, an adder, a read/write controller, and at least one convolutional layer arithmetic core with an input parallelism M and an output parallelism N, wherein:

and the DDR controller is used for acquiring the picture to be processed from the preset storage space and storing the picture to be processed to the feature map cache region.

And the state controller is used for controlling the read-write controller, reading a target parameter matrix corresponding to each convolution layer in the target convolution neural network from the parameter cache region, reading the alternative diagram from the characteristic diagram cache region and sending the alternative diagram to the convolution layer operation core.

The alternative graph is a to-be-processed picture or a feature graph stored in a feature graph cache region; elements in a target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernels are set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernels of the FPGA.

And the convolutional layer operation core is used for performing convolution processing on the received alternative graph based on the target parameter matrix corresponding to each convolutional layer according to a preset clock period.

The state controller is also used for controlling the read-write controller, and respectively storing convolution calculation results of each convolution layer of the convolution layer operation core in each clock period to an output cache region aiming at each convolution layer of the target convolution neural network; and for each output channel of the convolutional layer, reading the offset value of the output channel from the parameter cache region, controlling an adder to accumulate convolution calculation results of each clock period of the output channel stored in the output cache region, calculating the sum of the accumulated result and the offset value of the output channel, obtaining a characteristic diagram corresponding to the output channel, and storing the characteristic diagram to the output cache region.

The DDR controller is also used for respectively storing the characteristic diagrams obtained on the basis of each target convolution layer in the output cache region into a characteristic diagram cache region through an AXI bus; and acquiring a final characteristic diagram obtained based on the last convolutional layer in the target convolutional neural network from the output buffer area, and storing the final characteristic diagram in a preset storage space.

Wherein the target convolutional layer comprises: the convolutional layers in the target convolutional neural network except the last convolutional layer.

Based on the above processing, compressing the partial convolution kernels specified in the convolution layer can reduce the number of network parameters of the convolution layer, thereby reducing the number of network parameters of the target convolutional neural network and reducing the amount of calculation. Furthermore, the FPGA performs convolution processing on the picture to be processed based on the compressed convolution layer, and the efficiency of performing data processing based on the convolution neural network can be improved.

Correspondingly, the convolutional layer operation core can perform convolutional processing on the received alternative graph in a parallel mode based on the target parameter matrix corresponding to each convolutional layer according to a preset clock period.

Based on the above processing, the convolution kernels corresponding to the M × N elements included in each element group in the target parameter matrix have the same size, that is, the data of the convolution kernels corresponding to the M × N elements included in the element group can be written into the M × N calculation units of the convolution layer operation kernel. Correspondingly, each computing unit can perform convolution processing on the basis of the written data of the convolution kernel, and then the convolution processing in a parallel mode can be realized in one clock cycle of the FPGA so as to improve the data processing efficiency.

In an embodiment, referring to fig. 14, fig. 14 is a schematic structural diagram of an FPGA provided in the embodiment of the present application.

The parameter buffer 1404 may pre-write a target parameter matrix corresponding to each convolutional layer in the target convolutional neural network and an offset value of each output channel of the convolutional layer. The written data can be in the format of fixed point number. The format of the to-be-processed pictures stored in the preset storage space can also be a fixed point number format. The convolutional layer arithmetic core 1408 in fig. 14 may include: convolutional layer arithmetic core 1, convolutional layer arithmetic core 2, convolutional layer arithmetic core 3, convolutional layer arithmetic core 4, convolutional layer arithmetic core 5, and convolutional layer arithmetic core 6. But is not limited thereto.

The DDR controller 1401 may obtain the to-be-processed picture from the preset storage space and store the to-be-processed picture in the feature map cache region 1405.

When processing an image to be processed based on the first convolution layer, the state controller 1409 may control the read/write controller 1407 to read the target parameter matrix corresponding to the first convolution layer from the parameter buffer 1404, read the picture to be processed from the feature map buffer 1405, and send the picture to be processed to the convolution layer arithmetic core 1408.

The convolutional layer arithmetic core 1408 may perform convolution processing on the picture to be processed based on the target parameter matrix corresponding to the first convolutional layer according to a preset clock cycle.

The state controller 1409 can control the read/write controller 1407 to store the convolution calculation results of the convolution layer core 1408 for each clock cycle into the output buffer 1403, respectively. In addition, for each output channel of the first convolution layer, the offset value of the output channel is read from the parameter buffer 1403, and the adder 1406 is controlled to accumulate the convolution calculation results of the output channel stored in the output buffer 1403 in each clock cycle, calculate the sum of the accumulated result and the offset value of the output channel, obtain the feature map corresponding to the output channel, and store the feature map in the output buffer 1403.

The DDR controller 1401 may store the signature map obtained by the first convolution layer in the signature map buffer 1405 via the AXI bus 1402.

Then, the state controller 1409 may control the read/write controller 1407 to read the target parameter matrix corresponding to the second convolution layer from the parameter buffer 1404, read the signature obtained by the first convolution layer from the signature buffer 1405, and send the signature to the convolution layer arithmetic core 1408.

The convolutional layer arithmetic core 1408 may perform convolution processing on the feature map obtained by the first convolutional layer based on the target parameter matrix corresponding to the second convolutional layer according to a preset clock cycle.

The state controller 1409 can control the read/write controller 1407 to store the convolution calculation results of the convolution layer core 1408 for each clock cycle into the output buffer 1403, respectively. In addition, for each output channel of the second convolution layer, the offset value of the output channel is read from the parameter buffer 1403, and the adder 1406 is controlled to accumulate the convolution calculation results of the output channel stored in the output buffer 1403 in each clock cycle, calculate the sum of the accumulated result and the offset value of the output channel, obtain the feature map corresponding to the output channel, and store the feature map in the output buffer 1403.

The DDR controller 1401 may store the signature obtained by the second convolution layer in the signature buffer 1405 via the AXI bus 1402.

And so on until a final characteristic map determined based on the last convolutional layer is obtained. The DDR controller 1401 may retrieve the final feature map from the output buffer and store the feature map in the preset storage space.

Referring to fig. 15, fig. 15 is a schematic diagram illustrating a principle of data processing by running a convolutional neural network based on an FPGA according to an embodiment of the present application.

The parameter buffer area can store a target parameter matrix of each convolution layer of the target convolutional neural network and bias values of output channels of each convolution layer.

And the characteristic map cache region is used for storing the picture to be processed, and other convolutional layers except the last convolutional layer in the target convolutional neural network based on the characteristic map obtained by each target convolutional layer.

The heterogeneous convolution controller is equivalent to the state controller in the above embodiment, and under the control of the heterogeneous convolution controller, the convolution layer operation core of the FPGA may obtain the picture to be processed and the target parameter matrix of each convolution layer, and perform convolution processing based on the feature map obtained by each target convolution layer.

In addition, for each convolution layer of the target convolutional neural network, convolution calculation results of each convolution layer operation core in each clock cycle can be respectively stored into an output buffer area, and for each output channel of the convolution layer, the offset value of the output channel is read from the parameter buffer area, the convolution calculation results of each clock cycle of the output channel stored in the output buffer area are accumulated, the sum of the accumulated result and the offset value of the output channel is calculated, a feature map corresponding to the output channel is obtained, and the feature map is stored into the output buffer area.

In fig. 15, the FPGA includes n convolutional layer arithmetic cores, that is, in one clock cycle, the feature map can be convolved based on the n convolutional layer arithmetic cores at the same time.

Based on the same inventive concept, an embodiment of the present application further provides a data processing apparatus, and referring to fig. 16, fig. 16 is a structural diagram of the data processing apparatus provided in the embodiment of the present application, and the apparatus may include:

a to-be-processed picture acquiring module 1601 configured to acquire a to-be-processed picture;

a storage module 1602, configured to store the to-be-processed picture in a preset storage space, so that a field programmable gate array FPGA acquires the to-be-processed picture from the preset storage space, and process the to-be-processed picture through a convolutional layer operation kernel based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final feature map, and store the final feature map in the preset storage space;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: the method comprises the steps of compressing a convolution kernel and a non-compression convolution kernel, wherein the size of the compression convolution kernel is smaller than that of the non-compression convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

a feature map obtaining module 1603, configured to obtain the final feature map from the preset storage space.

Optionally, elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the sizes of convolution kernels of at least two element groups in the plurality of element groups are the same, and the sizes of the convolution kernels of at least two element groups are different; and M is the input parallelism of the convolution layer operation core, and N is the output parallelism of the convolution layer operation core, so that the FPGA carries out convolution processing in a parallel mode through the convolution layer operation core based on the target parameter matrix corresponding to the convolution layer.

Optionally, the apparatus further comprises:

a compression step determining module, configured to determine, for each convolutional layer in the initial convolutional neural network, a compression step for compressing the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer, and the input parallelism and the output parallelism of the convolutional layer operation core; wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

Optionally, the compression step determining module includes:

the first calculation submodule is used for calculating the number of input channels of each convolutional layer in the initial convolutional neural network and the ratio of the number of the input channels of the convolutional layer to the input parallelism of the convolutional layer operation core as a first ratio, and calculating the number of output channels of the convolutional layer and the ratio of the output parallelism of the convolutional layer operation core as a second ratio;

Optionally, the convolution kernel compression module is specifically configured to, for each convolution kernel corresponding to each output channel in the convolution layer, averagely divide the convolution kernel into a plurality of first convolution kernel groups according to the sequence of the corresponding input channel; wherein the number of convolution kernels included in each first convolution kernel group is the first compression step size; for each first convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the first convolution kernel group as a first convolution kernel to be processed; compressing the first to-be-processed convolution kernels in each first convolution kernel group into a target convolution kernel to obtain an alternative convolution neural network; or the like, or a combination thereof,

Optionally, the target parameter matrix generating module includes:

Optionally, the target parameter matrix generating sub-module includes:

a first element determining unit, configured to determine, for each candidate row in the initial parameter matrix, an element corresponding to an uncompressed convolution kernel in the candidate row, as a first element; wherein the alternative row comprises: a row of the first column of the initial parameter matrix in which an element corresponding to an uncompressed convolution kernel is located;

a second element determining unit, configured to determine, for each candidate column in the initial parameter matrix after the movement, an element corresponding to an uncompressed convolution kernel in the candidate column, as a second element; wherein the alternative columns include: the first row of the initial parameter matrix is a row in which elements corresponding to the uncompressed convolution kernels are located;

and the second moving unit is used for moving the row of each adjacent second compression step length element in the second elements to an adjacent position to obtain a corresponding target parameter matrix of the convolutional layer.

Based on the same inventive concept, an embodiment of the present application further provides a convolutional neural network training device, referring to fig. 17, where fig. 17 is a structural diagram of the convolutional neural network training device provided in the embodiment of the present application, and the device may include:

a compression step size determination module 1701, configured to determine, for each convolutional layer in the initial convolutional neural network, a compression step size for compressing a convolutional core in the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer, and the input parallelism and the output parallelism of a convolutional layer operation core of the field programmable gate array FPGA; wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

a convolution kernel compression module 1702, configured to compress, according to the determined compression step, a part of convolution kernels specified in the convolution layer into a target convolution kernel, so as to obtain an alternative convolution neural network;

a training module 1703, configured to train the alternative convolutional neural network to obtain a target convolutional neural network;

a target parameter matrix generation module 1704, configured to generate a target parameter matrix corresponding to each convolutional layer in the target convolutional neural network; the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer, respectively.

An electronic device is further provided in the embodiments of the present application, as shown in fig. 18, and includes a processor 1801, a communication interface 1802, a memory 1803, an FPGA1805, and a communication bus 1804, where the processor 1801, the communication interface 1802, the memory 1803, and the FPGA1805 complete communication with each other through the communication bus 1804,

a memory 1803 for storing a computer program;

the processor 1801 is configured to, when executing the program stored in the memory 1803, implement the following steps:

acquiring a picture to be processed;

storing the picture to be processed into a preset storage space so that the FPGA1805 can acquire the picture to be processed from the preset storage space, processing the picture to be processed through a convolutional layer operation kernel based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram, and storing the final characteristic diagram into the preset storage space;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernels are set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA 1805;

and acquiring the final characteristic diagram from the preset storage space.

The FPGA1805 is configured to perform the following steps:

acquiring a picture to be processed from a preset storage space;

and storing the final characteristic diagram into the preset storage space.

The embodiment of the present application further provides an electronic device, as shown in fig. 19, which includes a processor 1901, a communication interface 1902, a memory 1903 and a communication bus 1904, wherein the processor 1901, the communication interface 1902, and the memory 1903 complete communication with each other via the communication bus 1904,

a memory 1903 for storing computer programs;

the processor 1901 is configured to execute the program stored in the memory 1903, and implements the following steps:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

Embodiments of the present application further provide a computer-readable storage medium, in which instructions are stored, and when the instructions are executed on a computer, the computer is enabled to execute the data processing method provided in the embodiments of the present application, or the convolutional neural network training method.

It should be noted that other implementation manners of the data processing method and the convolutional neural network training method are the same as those of the foregoing method embodiments, and are not described herein again.

Embodiments of the present application further provide another computer program product containing instructions, which when executed on a computer, cause the computer to perform the data processing method or the convolutional neural network training method provided in the embodiments of the present application.

It should be noted that other implementation manners of the above data processing method and the convolutional neural network training method are partially the same as those of the foregoing method embodiments, and are not described herein again.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method of data processing, the method comprising:

acquiring a picture to be processed;

storing the picture to be processed into a preset storage space so that a Field Programmable Gate Array (FPGA) can acquire the picture to be processed from the preset storage space, processing the picture to be processed through a convolutional layer arithmetic core based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram, and storing the final characteristic diagram into the preset storage space;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolution layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

and acquiring the final characteristic diagram from the preset storage space.

2. The method of claim 1, wherein the elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a group of compressed convolution kernels or to a group of uncompressed convolution kernels; the convolution kernels of at least two element groups in the plurality of element groups have the same size, and the convolution kernels of at least two element groups have different sizes; m is the input parallelism of the convolution layer operation core, N is the output parallelism of the convolution layer operation core, so that the FPGA carries out convolution processing in a parallel mode through the convolution layer operation core based on the target parameter matrix corresponding to the convolution layer.

3. The method of claim 2, wherein the obtaining of the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network comprises:

4. The method of claim 3, wherein determining, for each convolutional layer in the initial convolutional neural network, a compression step size for compressing the convolutional layer based on the number of input channels and the number of output channels of the convolutional layer, and the input parallelism and the output parallelism of the convolutional layer arithmetic core comprises:

aiming at each convolution layer in the initial convolutional neural network, calculating the ratio of the number of input channels of the convolution layer to the input parallelism of the convolution layer operation core as a first ratio, and calculating the ratio of the number of output channels of the convolution layer to the output parallelism of the convolution layer operation core as a second ratio;

5. The method of claim 4, wherein determining the first compression step size and the second compression step size for compressing the convolutional layer based on the first ratio and the second ratio comprises:

6. The method of claim 4, wherein the compressing the specified partial convolution kernels in the convolutional layer into the target convolution kernels according to the determined compression step size to obtain the alternative convolutional neural network, comprises:

averagely dividing each convolution kernel corresponding to each output channel in the convolution layer into a plurality of first convolution kernel groups according to the sequence of corresponding input channels; wherein, the number of convolution kernels contained in each first convolution kernel group is the first compression step size; for each first convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the first convolution kernel group as a first convolution kernel to be processed; compressing the first convolution kernels to be processed in each first convolution kernel group into target convolution kernels to obtain an alternative convolution neural network; or the like, or, alternatively,

averagely dividing each convolution kernel corresponding to each input channel in the convolution layer into a plurality of second convolution kernel groups according to the sequence of corresponding output channels; wherein each second convolution kernel group comprises the number of convolution kernels of the second compression step size; for each second convolution kernel group, determining other convolution kernels except the convolution kernel at the specified position in the second convolution kernel group as second convolution kernels to be processed; and compressing the second convolution kernels to be processed in each second convolution kernel group into a target convolution kernel to obtain the alternative convolution neural network.

7. The method of claim 6, wherein generating the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network comprises:

8. The method of claim 7, wherein the adjusting the positions of the rows and columns in the initial parameter matrix based on the first compression step and the second compression step to obtain the target parameter matrix corresponding to the convolutional layer comprises:

determining an element corresponding to the convolution kernel which is not replaced in each alternative row in the initial parameter matrix as a first element; wherein the alternative row comprises: a row in which an element corresponding to an unsubstitued convolution kernel in a first column of the initial parameter matrix is located;

9. A data processing method is applied to a Field Programmable Gate Array (FPGA), and comprises the following steps:

acquiring a picture to be processed from a preset storage space;

and storing the final characteristic diagram into the preset storage space.

10. The method of claim 9, wherein the elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the sizes of convolution kernels of at least two element groups in the plurality of element groups are the same, and the sizes of the convolution kernels of at least two element groups are different; m is the input parallelism of the convolution layer operation core, and N is the output parallelism of the convolution layer operation core;

11. A convolutional neural network training method, the method comprising:

for each convolution layer in the initial convolutional neural network, determining a compression step length for compressing a convolution kernel in the convolution layer based on the number of input channels and the number of output channels of the convolution layer and the input parallelism and the output parallelism of a convolution layer operation kernel of a Field Programmable Gate Array (FPGA); wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

12. A field programmable gate array FPGA is characterized by comprising a double data rate synchronous dynamic random access memory (DDR) controller, an advanced extensible interface (AXI) bus, a state controller, a feature diagram buffer area, a parameter buffer area, an output buffer area, an adder, a read-write controller and at least one convolutional layer operation core with an input parallelism M and an output parallelism N, wherein:

the state controller is used for controlling the read-write controller, reading a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network from the parameter cache region, reading an alternative graph from the characteristic graph cache region and sending the alternative graph to the convolutional layer operation core; the alternative graph is a to-be-processed picture or a feature graph stored in the feature graph cache region; elements in a target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one by one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and the number of columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

the convolutional layer operation core is used for performing convolution processing on the received alternative graph based on a target parameter matrix corresponding to each convolutional layer according to a preset clock period;

the DDR controller is further configured to store, via the AXI bus, a feature map obtained based on each target convolution layer in an output buffer area to the feature map buffer area; acquiring a final characteristic diagram obtained based on the last convolutional layer in the target convolutional neural network from the output cache region, and storing the final characteristic diagram in the preset storage space; wherein the target convolutional layer comprises: the convolutional layers in the target convolutional neural network except the last convolutional layer.

13. The FPGA of claim 12, wherein elements in the target parameter matrix corresponding to each convolutional layer in the target convolutional neural network are divided into a plurality of element groups; the number of rows of the target parameter matrix is the number of output channels of the convolutional layer, and the number of columns of the target parameter matrix is the number of input channels of the convolutional layer; each element group contains M × N elements; each element group corresponds to a set of compressed convolution kernels or to a set of uncompressed convolution kernels; the sizes of convolution kernels of at least two element groups in the plurality of element groups are the same, and the sizes of the convolution kernels of at least two element groups are different;

and the convolutional layer operation core is used for performing convolutional processing on the received alternative graph in a parallel mode based on the target parameter matrix corresponding to each convolutional layer according to a preset clock period.

14. A data processing apparatus, characterized in that the apparatus comprises:

the storage module is used for storing the picture to be processed to a preset storage space so that a Field Programmable Gate Array (FPGA) can acquire the picture to be processed from the preset storage space, and the picture to be processed is processed through a convolutional layer operation kernel based on a target parameter matrix corresponding to each convolutional layer in a target convolutional neural network to obtain a final characteristic diagram and stored to the preset storage space;

the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the convolution kernel in the convolutional layer comprises: compressing the convolution kernel and the non-compressed convolution kernel, wherein the size of the compressed convolution kernel is smaller than that of the non-compressed convolution kernel; the number of rows and the number of columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer respectively; the target convolutional neural network is as follows: for each convolution layer in the initial convolutional neural network, compressing part of convolution kernels specified in the convolution layer to obtain compressed convolution kernels, and performing model training to obtain the compressed convolution kernels; the specified partial convolution kernel is set based on the number of input channels and the number of output channels of the convolution layer, and the input parallelism and the output parallelism of the convolution layer operation kernel of the FPGA;

15. An apparatus for convolutional neural network training, the apparatus comprising:

the compression step length determining module is used for determining the compression step length for compressing the convolution kernel in the convolution layer based on the number of input channels and the number of output channels of the convolution layer and the input parallelism and the output parallelism of the convolution layer operation kernel of the field programmable gate array FPGA aiming at each convolution layer in the initial convolution neural network; wherein the compression step size is used to indicate: when the convolution layer is compressed, the proportion of convolution kernels needing to be compressed in the convolution layer is occupied;

the target parameter matrix generating module is used for generating a target parameter matrix corresponding to each convolution layer in the target convolution neural network; the elements in the target parameter matrix corresponding to each convolution layer in the target convolution neural network correspond to convolution kernels in the convolution layers one to one; the number of rows and columns of the target parameter matrix are preset based on the number of output channels and the number of input channels of the convolutional layer, respectively.

16. An electronic device is characterized by comprising a processor, a communication interface, a memory, a Field Programmable Gate Array (FPGA) and a communication bus, wherein the processor, the communication interface, the FPGA and the memory are communicated with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-8;

the FPGA configured to perform the method steps of any one of claims 9-10.

17. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, performs the method steps of claim 11.

18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 8, or 11.