CN108573305B

CN108573305B - Data processing method, equipment and device

Info

Publication number: CN108573305B
Application number: CN201710152660.5A
Authority: CN
Inventors: 胡睿; 方颉翔; 张铧铧
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2017-03-15
Filing date: 2017-03-15
Publication date: 2020-07-24
Anticipated expiration: 2037-03-15
Also published as: CN108573305A

Abstract

The embodiment of the invention provides a data processing method, equipment and a device, wherein the data processing method comprises the following steps: acquiring a preset convolution kernel and determining the width of a convolution frame; acquiring and determining the width of a data column according to the cache capacity of the chip, the preset data quantity and the first preset row number; dividing a data matrix to be processed according to the width of a data column to obtain a multi-column area; for any column of regions, extracting data to be operated of a second preset row number, sending the data to be operated to a chip cache, and performing convolution operation on the data to be operated by utilizing a preset convolution kernel; after the first row of data in the data to be operated participates in the convolution operation, deleting the first row of data, extracting the next row of data in the corresponding column area, and updating the data to be operated; and performing convolution operation on the updated data to be operated until all the line data in the area participate in the convolution operation. The invention can reduce the power consumption generated by the chip during data processing and improve the processing performance.

Description

Data processing method, equipment and device

Technical Field

The present invention relates to the field of chip design technologies, and in particular, to a data processing method, device, and apparatus.

Background

The CNN (Convolutional Neural Network) belongs to one of deep learning algorithms, and is an algorithm for realizing data information extraction by simulating a working mode of a brain Neural Network. The algorithm completes the initial extraction of information by utilizing convolution calculation, and combines some nonlinear operations to realize high-performance target detection. With the continuous development of deep learning algorithms, CNNs are widely used in the image processing fields of target detection, data classification, information extraction and matching, and the like. Due to the characteristics of the CNN algorithm, a large amount of data needs to participate in the operation repeatedly, so that a high requirement is imposed on a cache space in a chip, a large enough storage space is required to store all information required by the CNN operation, and most of the existing chips cannot meet the requirement of directly storing all required information.

Aiming at the problem that most of chips cannot directly store all required information in the chips, the prior art provides a CNN implementation method, and all data required by operation are imported from a memory again each time for calculation before convolution operation.

Because a large amount of repeatedly used data exists in the CNN calculation, a large amount of repeated data exists in a data block imported from a memory each time in the prior art, so that a large amount of bandwidth is wasted in the reading process, and a chip generates large power consumption and influences processing performance when processing data.

Disclosure of Invention

Embodiments of the present invention provide a data processing method, device, and apparatus, so as to reduce power consumption generated by a chip during data processing and improve processing performance. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:

acquiring a preset convolution kernel, and determining the width of a convolution frame of the preset convolution kernel;

acquiring and determining the width of a data column according to the cache capacity of a chip, a preset data amount and a first preset row number, wherein the width of the data column is greater than or equal to the width of the convolution frame;

dividing a data matrix to be processed into a plurality of rows according to the data row width to obtain a plurality of rows of areas, wherein the data matrix to be processed is a matrix which is stored in a memory and contains all data to be processed;

when a data processing instruction is received, extracting data to be operated of a second preset number of lines aiming at any one column region in all the column regions, and sending the data to be operated to the chip cache so as to carry out convolution operation on the cached data to be operated by utilizing the preset convolution kernel, wherein the second preset number of lines is greater than or equal to the width of the convolution frame and is less than or equal to the first preset number of lines;

after the first line of data in the data to be operated participates in convolution operation, deleting the first line of data from the chip cache, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache as the last line of data of the data to be operated, and updating the data to be operated;

and performing convolution operation on the updated data to be operated by using the preset convolution kernel until all the data of the data in the area participate in the convolution operation, and sending all operation results obtained after the convolution operation to the memory.

Optionally, the step of obtaining and determining the width of the data column according to the cache capacity of the chip, the preset data amount, and the first preset number of rows includes:

acquiring the cache capacity of a chip and a preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of data cached by the chip;

acquiring a first preset line number, and dividing the maximum value of the data number by the first preset line number to obtain the number of each line of data cached by the chip;

and determining the number of each row of data cached by the chip as the width of a data column.

Optionally, before the step of performing column division on the data matrix to be processed according to the data column width to obtain a multi-column region, the method further includes:

subtracting a preset value from the width of the convolution frame to obtain the width of a superposition area, wherein the superposition area is an area where any data column is superposed with an adjacent data column;

and determining the width of the data column width including the overlapped area.

Optionally, before the step of receiving a data processing instruction, and extracting, for any column of regions in all the regions, a second preset number of rows of data to be operated, and sending the data to be operated to the chip cache, the method further includes:

adding a first empty row before the data of the first row aiming at any column of regions in all the regions, and setting the data of the first empty row to be 0;

adding a second empty row after the last row of data, and setting the data of the second empty row to be 0;

the step of extracting the data to be operated of the second preset number of lines and sending the data to be operated to the chip cache comprises the following steps:

and extracting data to be operated of a second preset line number from the first empty line and sending the data to be operated to the chip cache.

Optionally, after the first line of data in the data to be operated participates in the convolution operation, deleting the first line of data from the chip cache, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache, and updating the data to be operated as the last line of data of the data to be operated, the method includes:

when the first line of data in the data to be operated participates in convolution operation, deleting the first line of data from the chip cache, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache as the last line of data of the data to be operated, and updating the data to be operated;

alternatively, the first and second electrodes may be,

and after any convolution operation is carried out, deleting the first row of data from the chip cache, and when any convolution operation is carried out after the first row of data is deleted, extracting the next row of data from the corresponding row area and sending the next row of data to the chip cache to serve as the last row of data of the data to be operated, and updating the data to be operated.

Optionally, the step of performing convolution operation on the updated data to be operated by using the preset convolution kernel until all the row data in the region participate in the convolution operation, and sending all operation results obtained after the convolution operation to the memory includes:

performing convolution operation on the updated data to be operated by using the preset convolution kernel to obtain a convolution result;

storing the convolution result into the column of the next convolution layer with the same column number as the corresponding column area;

and sending the next convolution layer to the memory.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes:

the main control unit is used for receiving the data processing instruction and sending a control command to the rolling cache and calculation unit so as to control the rolling cache to extract data from the memory and control the calculation unit to carry out convolution operation on the extracted data;

the rolling cache is used for acquiring a preset convolution kernel and determining the width of a convolution frame of the preset convolution kernel; acquiring and determining the width of a data column according to the cache capacity of a chip, a preset data amount and a first preset row number, wherein the width of the data column is greater than or equal to the width of the convolution frame; after receiving a control command sent by the main control unit, extracting data to be operated of a second preset number of rows for any column area in a plurality of column areas obtained by column division of a data matrix to be processed in the memory according to the data column width; after the first row of data in the data to be operated participates in convolution operation, deleting the first row of data, extracting the next row of data from the corresponding column area to serve as the last row of data of the data to be operated, and updating the data to be operated;

and the calculation unit is used for performing convolution operation on the data to be operated or the updated data to be operated by utilizing the preset convolution kernel after receiving the data to be operated sent by the rolling cache until all the data in the area participate in the convolution operation, and sending all operation results obtained after the convolution operation to the memory.

Optionally, the rolling cache is further specifically configured to:

acquiring the cache capacity of a chip and a preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of the data of the rolling cache;

acquiring a first preset line number, and dividing the maximum value of the data number by the first preset line number to obtain the number of each line of data of the rolling cache;

and determining the number of each row of data of the rolling cache as the width of a data column.

Optionally, the rolling cache is further specifically configured to:

aiming at any column of regions in all the regions, before extracting first row data, adding a first empty row before the first row data, and setting the data of the first empty row as 0;

before the last line of data is extracted, adding a second empty line after the last line of data, and setting the data of the second empty line to be 0;

and extracting the data to be operated of a second preset line number from the first empty line.

Optionally, the rolling cache is further specifically configured to:

when the first row of data in the data to be operated participates in convolution operation, deleting the first row of data, extracting the next row of data from the corresponding column area, using the next row of data as the last row of data of the data to be operated, and updating the data to be operated;

alternatively, the first and second electrodes may be,

Optionally, the computing unit is specifically further configured to:

performing convolution operation on the data to be operated or the updated data to be operated by utilizing the preset convolution kernel to obtain a convolution result;

and sending the next convolution layer to the memory.

In a third aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes:

the first determining module is used for acquiring a preset convolution kernel and determining the width of a convolution frame of the preset convolution kernel;

the second determining module is used for acquiring and determining the width of a data column according to the cache capacity of a chip, the preset data amount and the first preset row number, wherein the width of the data column is larger than or equal to the width of the convolution frame;

the dividing module is used for dividing a data matrix to be processed into a plurality of rows according to the data row width to obtain a plurality of rows of areas, wherein the data matrix to be processed is a matrix which is stored in the memory and contains all data to be processed;

the extracting module is used for extracting data to be operated of a second preset line number aiming at any column area in all the column areas and sending the data to be operated to the chip cache when a data processing instruction is received, and performing convolution operation on the cached data to be operated by utilizing the preset convolution kernel, wherein the second preset line number is greater than or equal to the width of the convolution frame and is less than or equal to the first preset line number;

the updating module is used for deleting the first line of data from the chip cache after the first line of data in the data to be operated participates in convolution operation, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache as the last line of data of the data to be operated, and updating the data to be operated;

and the first operation module is used for performing convolution operation on the updated data to be operated by utilizing the preset convolution kernel until all the data of the data in the area participate in the convolution operation, and sending all operation results obtained after the convolution operation to the memory.

Optionally, the second determining module includes:

the first operation submodule is used for acquiring the cache capacity of a chip and a preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of data cached by the chip;

the second operation submodule is used for acquiring a first preset line number, and dividing the maximum value of the data number by the first preset line number to obtain the number of each line of data cached by the chip;

and the determining submodule is used for determining the number of each row of data cached by the chip as the width of the data column.

Optionally, the data processing apparatus further includes:

the second operation module is used for subtracting a preset value from the width of the convolution frame to obtain the width of a superposition area, wherein the superposition area is an area where any data column is superposed with an adjacent data column;

and the third determining module is used for determining the width of the overlapping area contained in the data column width.

Optionally, the data processing apparatus further includes:

the device comprises a first setting module, a second setting module and a third setting module, wherein the first setting module is used for adding a first empty row before first row data aiming at any column region in all regions and setting the data of the first empty row to be 0;

the second setting module is used for adding a second empty line after the last line of data and setting the data of the second empty line to be 0;

the extraction module is specifically further configured to:

Optionally, the update module is specifically configured to:

alternatively, the first and second electrodes may be,

Optionally, the first operation module further includes:

the third operation submodule is used for carrying out convolution operation on the updated data to be operated by utilizing the preset convolution kernel to obtain a convolution result;

the storage submodule is used for storing the convolution result into the column of the next convolution layer with the same column number as the corresponding column area;

and the sending submodule is used for sending the next convolution layer to the memory.

According to the data processing method, the data processing equipment and the data processing device, when data required by operation are obtained from the memory, a row of data with the width is obtained from data to be processed stored in the memory each time according to the data row width determined by the size of a convolution kernel, the cache capacity of a chip, the data amount required to be cached and a first preset row number, the characteristic that a large number of data are repeated in convolution operation is utilized, the fact that the data required to participate in the operation are only obtained from the memory in each time of the operation is guaranteed, the broadband requirement on the off-chip memory is reduced, and therefore power consumption generated when the chip processes the data is reduced, and processing performance is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a data column partitioning method of an application example according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a convolution operation according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to reduce power consumption generated by a chip during data processing and improve processing performance, embodiments of the present invention provide a data processing method, device and apparatus.

First, a data processing method provided in an embodiment of the present invention is described below.

It should be noted that an execution main body of the data processing method provided in the embodiment of the present invention may be a chip for data processing, such as a DSP (Digital Signal Processor), an ARM (Advanced Reduced Instruction Set Computer microprocessor), an FPGA (Field Programmable Gate Array), or the like, or may be a data processing controller, and at present, may also be other devices with data processing capability, which is not limited herein. The manner of implementing the data processing method provided by the embodiment of the present invention may be software, hardware circuit and/or logic circuit disposed in a chip or a data processing controller for data processing. The application scenario of the embodiment of the present invention may be image processing or radar scanning, and of course, other application scenarios applying the convolutional neural network are all applicable to the embodiment of the present invention.

As shown in fig. 1, a data processing method provided in an embodiment of the present invention may include the following steps:

s101, acquiring a preset convolution kernel, and determining the width of a convolution frame of the preset convolution kernel.

It should be noted that, in the present embodiment, for the convolutional neural network, a convolution operation needs to be performed, and the preset convolution kernel may be preset or determined according to the selected preset operation policy. The preset operation strategy can be any one of operation strategies of convolution neural networks such as nonlinear correction activation and pooling, a convolution kernel for performing convolution operation is given to each operation strategy, and the preset convolution kernel can be determined according to the selected preset operation strategy.

It is emphasized that the key to performing the convolution operation is the selection of the convolution operator, i.e. the coefficient matrix, i.e. the convolution kernel, and the width of the coefficient matrix, i.e. the width of the convolution box, e.g. often say the convolution kernel of 3 × 3, where 3 is the width of the convolution box.

And S102, obtaining and determining the width of a data column according to the cache capacity of the chip, the preset data quantity and the first preset row number.

Wherein the data column width is greater than or equal to the convolution box width; the data column comprises a plurality of data; the chip comprises a cache used for storing data participating in operation; the preset data volume can be obtained according to the characteristics of data needing to participate in operation, and can also be preset, and the preset data volume represents the data volume needing to be cached for performing one convolution operation. It should be noted that, when the chip cache capacity is large, a column of extracted data may include more data for caching, so as to perform convolution operation.

Optionally, the step of obtaining and determining the width of the data column according to the chip cache capacity, the preset data amount, and the first preset row number may include:

firstly, obtaining the cache capacity of the chip and the preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of data cached by the chip.

It should be noted that, in this embodiment, when data is extracted from the memory, the data to be processed, which needs to be extracted, needs to be divided into multiple rows of areas having overlapping areas, and data is read only in a certain row of areas each time, the width of each row of areas is determined by the chip cache capacity, the preset data amount, and the first preset number of rows, and first, the maximum value of the number of data cached by the chip, that is, the maximum number of data that can be cached by the chip, may be determined according to the chip cache capacity and the preset data amount. For example, if the chip buffer capacity is 100kB and the data amount of each data is 256B, the maximum value of the number of chip buffers may be 400 data.

And secondly, acquiring a first preset line number, and dividing the maximum value of the data number by the first preset line number to obtain the number of each line of data cached by the chip.

And finally, determining the number of each row of data cached by the chip as the width of the data column.

Wherein the data column width comprises a width of the overlap region; the first preset number of lines may be the number of lines of data stored in the memory, may also be a preset number of lines that can be extracted, and may also be equal to the width of a convolution frame of the preset convolution kernel. Specifically, the first preset number of lines may be determined according to a cache capacity of the chip. It should be noted that, after determining the maximum value of the number of data cached by the chip, the maximum value of the number of data may be divided by the first preset number of rows to obtain the number of each row of data cached by the chip, and the number of each row of data may be used as the data column width, for example, if the threshold of the number of data cached by the chip is 400 data, and the first preset number of rows is 4 rows, the data column width may be 100.

And S103, dividing the rows of the data matrix to be processed according to the width of the data rows to obtain a multi-row area.

The matrix of the data to be processed is a matrix which is stored in the memory and contains all the data to be processed. In general, raw data is stored in a memory, for example, in an image processing system, a captured original image is stored in the memory; of course, the data may be two-dimensional data obtained by converting the original data. It should be noted that, when data is extracted from the memory, if the data is original data in the memory, the original data is first converted into data in a two-dimensional form, then the data to be processed, which needs to be extracted, is divided into a plurality of rows of areas according to the width of the obtained data row, and the data is only read in a certain row of areas each time. After the data column width is obtained, in order to reduce the data amount buffered each time, the data matrix to be processed in the memory may be divided according to the data column width. For example, if the obtained data column width is 6, the data matrix to be processed may be divided by taking each 6 columns of data as a column region.

It should be emphasized that, in order to ensure that the result of performing convolution operation on the divided data is completely consistent with the result of performing convolution operation on the original data matrix to be processed, a certain overlapping area may exist between two adjacent columns of areas when the data matrix to be processed is divided.

Optionally, before the step of performing column division on the data matrix to be processed according to the data column width to obtain the multi-column region, the data processing method may further include:

first, the width of the convolution frame is subtracted by a preset value to obtain the width of the overlapped area.

Next, the width of the data column width including the overlapping area is determined.

The overlapping area is an area where any data column overlaps with an adjacent data column. Since the overlapping area is an area where two adjacent data columns overlap, the first column of data in each column of area only participates in the convolution operation of the current column of area, and the other columns of data need to participate in the convolution operation of the other columns of areas, the preset value is generally 1. For example, when the convolution frame width is 3, the width of the overlapping area is 2; when the convolution frame width is 5, the width of the overlapping area is 4, and so on. It should be noted that the overlapping region is to ensure that the required data can still be obtained when the convolution operation is calculated to the boundary, and that the result is completely consistent with the normal convolution process.

And S104, when a data processing instruction is received, extracting the data to be calculated of a second preset row number aiming at any column area in all the column areas, sending the data to be calculated to a chip cache, and performing convolution operation on the cached data to be calculated by utilizing a preset convolution kernel.

The second preset line number is larger than or equal to the width of the convolution frame and smaller than or equal to the first preset line number. It should be noted that the data processing instruction is used to start data processing operation, and may receive the data processing instruction sent by the user and start to perform data processing, so that interaction between the user and the data processing is increased, and user experience is improved. Of course, after the original data is acquired, the acquisition module may generate a data processing instruction to start data processing, for example, in image processing, when the original image is acquired, the stored state of the original image stored in the memory is received, and data processing is started.

It should be emphasized that, after receiving a data processing instruction, extracting data from to-be-processed data in a partitioned area in a memory, where a buffer space of a chip is limited, when a data column width is calculated, a first preset number of rows is equal to a maximum number of rows that the chip can buffer and have the data column width, when data is extracted from the memory for the first time, the maximum number of rows of data in a certain column area that can be extracted is the first preset number of rows, and since a convolution operation is to be performed, the minimum number of rows of extracted data is a convolution frame width of a preset convolution kernel. The data to be operated may be extracted according to the second preset number of rows, after the data is extracted for the first time, the preset convolution kernel may be directly utilized to perform convolution operation on the data to be operated, and the process of convolution operation belongs to the prior art and is not described herein again.

Optionally, before the step of extracting, for any column of the regions in all the regions, the data to be computed of the second preset number of rows and sending the extracted data to the chip cache when the data processing instruction is received, the data processing method may further include:

firstly, adding a first empty row before first row data for any column area in all areas, and setting the data of the first empty row as 0;

secondly, a second empty line is added after the last line of data, and the data of the second empty line is set to 0.

It should be noted that, for the first row and the last row of a column of data, for example, for the edge point information of an image, when the edge point information is extracted, if the data of the first row or the last row is directly extracted, data loss may be caused. Therefore, the empty line with the data of 0 is added before the first line of data and after the last line of data respectively, and the integrity of the extracted data is ensured.

Optionally, the step of extracting the data to be calculated of the second preset number of lines and sending the data to be calculated to the chip cache may include:

and extracting the data to be operated of the second preset line number from the first empty line and sending the data to be operated to the chip cache.

It should be noted that the convolutional neural network needs to perform convolution operations sequentially from the first row according to the width of the convolution frame until all data in one extracted data column are subjected to convolution operations, the convolution operations of the data columns are not finished, and the convolution operations of the next data column are started after the operations of one data column are finished.

And S105, after the first line of data in the data to be operated participates in the convolution operation, deleting the first line of data from the chip cache, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache as the last line of data of the data to be operated, and updating the data to be operated.

The next row of data in the corresponding column region refers to the first row of data that is not extracted in the corresponding column region. When performing convolution operation on the data to be operated extracted for the first time, the number of lines of the data to be operated extracted for the first time may be equal to or larger than the width of the convolution frame. If the line number of the data to be operated is equal to the width of the convolution frame, deleting the first line of data after the convolution operation is finished, and extracting the next line of data from the corresponding row area in the memory; if the number of rows of the data to be operated is greater than the width of the convolution frame, the next row of data can be extracted from the corresponding row of the area in the memory after the data to be operated participates in the convolution operation and the first row of data participating in the convolution operation is deleted, or the first row of data participating in the convolution operation can be deleted while the first row of data to be operated participates in the convolution operation and the next row of data can be extracted from the corresponding row of the area in the memory, which is not specifically limited herein. However, the next row of data may be extracted from the corresponding column area in the memory only after the first row of data participates in the convolution operation and the first row of data participating in the convolution operation is deleted, otherwise the cache capacity of the chip may be exceeded. After a new line of data is received, the new line of data needs to be set as the last line of data to be operated according to the line sequence, and the data to be operated is updated.

Optionally, after the first line of data in the data to be operated participates in the convolution operation, deleting the first line of data from the chip cache, extracting the next line of data from the corresponding column region, sending the next line of data to the chip cache as the last line of data of the data to be operated, and updating the data to be operated may include:

and when the first line of data in the data to be operated participates in the convolution operation, deleting the first line of data from the chip cache, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache as the last line of data of the data to be operated, and updating the data to be operated.

It should be noted that, in order to ensure the utilization rate of the chip cache, use the cache capacity of the chip to the maximum extent, and ensure the efficiency of the data operation, a specific implementation manner of this embodiment may be that, while the first line of data of the data to be operated participates in the convolution operation, the first line of data is deleted, and the next line of data is immediately extracted from the corresponding column region of the memory, and the data to be operated is updated, for example, 5 lines of data (the 0 th line to the 4 th line of data of a certain column region) have been extracted from the memory and stored in the chip cache, the data to be operated is convolved with the convolution kernel of 3 × 3, while the 0 th line of data participates in the convolution operation, the 0 th line of data is deleted from the chip cache, and the 5 th line of data is extracted from the corresponding column region of the memory and sent to the line of data chip cache, and the 5 th line of data is used as the last line of data.

Alternatively, the first and second electrodes may be,

It should be noted that, when the second preset number of rows is greater than the width of the convolution frame, another implementation manner of this embodiment may be that, when each time convolution operation is completed, the first row of data is deleted, but new row of data is not immediately extracted, new row of data may be extracted at any subsequent convolution operation, for example, new row of data is extracted after all row data participate in the convolution operation, or when convolution operation is completed, the first row of data is not immediately deleted, the first row of data may be deleted after any convolution operation after the first convolution operation, and when any convolution operation is performed after deleting the first row of data, the next row of data is extracted from the corresponding column area, for example, when the second preset number of rows is 5 rows, data of the 0 th row to the 4 th row of a certain column area is extracted from the memory, and stored in the chip buffer, a convolution kernel of 3 × 3 is used, data of the 0 th row to the 2 nd row of data are first subjected to convolution operation, and when convolution operation is completed, the first row of data may be extracted from the 0 th row of the row of data, the first row of data and/or the first row of data may be extracted, or only the first row of data is deleted, or the convolution operation is performed, and/or the convolution operation of the first row of data is performed, and/or the convolution operation of the first row of data is deleted, and/or the convolution operation of the first row of data is performed, and/or the data is performed, the convolution operation of the data is performed, the.

And S106, performing convolution operation on the updated data to be operated by utilizing a preset convolution kernel until all the data of the row in the area participate in the convolution operation, and sending all operation results obtained after the convolution operation to an internal memory.

It should be noted that the convolutional neural network is to perform a convolutional operation successively by using convolutional check data, and a convolutional result obtained through the convolutional operation can be used as an input of a next convolutional operation, and also can be used as a judgment or comparison basis for outputting characteristic data. For example, in image processing, a set of image features and a plurality of attribute features of an image obtained by convolution operation may be output, and determination such as object detection may be performed by obtaining the features of the image. It should be emphasized that the operation result of the convolution operation is usually stored in the memory.

Optionally, the step of performing convolution operation on the updated data to be operated by using a preset convolution kernel, and sending all operation results obtained after the convolution operation to the memory may include:

firstly, carrying out convolution operation on updated data to be operated by utilizing a preset convolution kernel to obtain a convolution result;

secondly, storing the convolution result into the next convolution layer row with the same number as the row of the corresponding row area;

and finally, sending the next volume of stack layer to the memory.

It should be noted that the data stored in the memory is the operation result of all the rows of data obtained by convolution operation, so that an operation result matrix is established according to the number of rows of the data in the data rows currently subjected to convolution operation, and the operation result matrix is stored, which is the next convolution layer. For example, convolution operation is performed on the data in the 1 st column of the data columns, and the operation result is stored in the first column of the next convolution layer. The convolution operation process is prior art and is not described herein.

By applying the embodiment, when data required by operation is acquired from the memory, according to the size of a convolution kernel, the cache capacity of a chip, the data amount required to be cached and the data column width determined by the first preset line number, the data with a certain line number of the width is acquired from the data to be processed stored in the memory each time, the characteristic of large data repetition in the convolution operation is utilized, the fact that only the data required to participate in the operation needs to be acquired from the memory in each calculation is ensured, the broadband requirement on the off-chip memory is reduced, and therefore the power consumption generated when the chip processes the data is reduced, and the processing performance is improved; because the data are divided into data columns and extracted to the cache for operation, the data with large data volume can be operated by using a small cache; and through the setting of the overlapping area, the required data can still be acquired when the convolution calculation reaches the boundary, and the result is ensured to be completely consistent with the normal convolution process.

The data processing method provided by the embodiment of the invention is described below with reference to specific application examples.

An image with a resolution of 256 × 160 is stored in the memory, and the data volume of each data is 256B, the storage space occupied by the image is 256 × 160 ═ 100MB, the convolution algorithm of the convolutional neural network is selected, the convolution kernel is preset to be a convolution kernel with a size of 5 × 5, the manner of data column division is assumed to be performed with a chip buffer capacity of 100 kB. as shown in fig. 2, since the width of the convolution frame is 5, the width of the overlapping region 202 is 2, and since the data volume of each data is 256B assuming that the chip buffer capacity is 100KB, the data column width 201 is set to be 100 × 1024/256/(5+1) ≈ 66 in the case that the size of the convolution kernel is 5 × 5.

In order to reduce the number of data columns and thus reduce the number of times of repetitive loading, the amount of data that needs to be repetitively loaded for the image is thus

Therefore, under this condition, 640KB of data is loaded twice for each 100MB of images with a resolution of 256 × 160, and the ratio of invalid load data is 640/(100 × 1024) to 0.63%.

Through analysis, extracting the data from the 1 st row to the 66 th row in the original two-dimensional data from the memory for the first time, and performing convolution operation; extracting the data from the 65 th row to the 130 th row in the original two-dimensional data from the memory for the second time, and performing convolution operation; data columns are extracted sequentially for a width of 66, and convolution operation is performed.

Specifically, as shown in fig. 3, the process of extracting the data column and performing convolution operation each time is assumed that the preset convolution kernel is a convolution kernel with a size of 5 × 5, the first preset number of rows is equal to the width of the convolution frame, and the data matrix to be processed has 50 rows of data in total, and the specific steps are as follows:

firstly, reading data in a 0 th row 311, a 1 st row 312, a 2 nd row 313, a 3 rd row 314 and a 4 th row 315 of a first data column 310, wherein the 0 th row 311 is a null row, setting the 0 th row 311 data to be 0, performing convolution operation on the 0 th row 311 to the 4 th row 315, and storing a result to a 1 st row of a 1 st column of next volume layer data; while the convolution operation is being performed, the data in row 5 316 is read and stored to the buffer.

Secondly, reading the data in the 1 st row 312, the 2 nd row 313, the 3 rd row 314, the 4 th row 315 and the 5 th row 316 of the first data column 310, performing convolution operation on the 1 st row 312 to the 5 th row 316, and storing the result to the 1 st row and the 2 nd row of the next convolution layer data; while reading the data in row 6 317 for storage to the cache.

And a third step of sequentially reading each row of data of the first data column 310 according to the second step by reading three rows of data each time.

Fourthly, after 47 rows of calculation, convolution operation is performed on the data of the 47 th row 318, the 48 th row 319, the 49 th row 3110, the 50 th row 3111 and the 51 th row 3112, wherein the 51 th row 3112 is an empty row, the data in the 51 th row 3112 is set to 0, and the result is stored in the 1 st column and 50 th row of the next volume of laminated layer data.

Fifthly, reloading the data in the 0 th row 321, the 1 st row 322, the 2 nd row 323, the 3 rd row 324 and the 4 th row 325 of the second data column 320, wherein the 0 th row 321 is a null row, setting the data in the 0 th row 321 to 0, performing convolution operation on the 0 th row 321 to the 4 th row 325, and storing the result to the 2 nd row 1 of the next volume of layer data; while the convolution operation is being performed, the data in row 5 326 is read and stored to the buffer.

And a sixth step of repeatedly executing the fourth step and the fifth step until convolution operation is performed on the data of the 47 th row 331, the 48 th row 332, the 49 th row 333, the 50 th row 334 and the 51 th row 335 of the nth data column 330, wherein the 51 th row 335 is an empty row, the data in the 51 th row 335 is set to 0, and the result is stored to the nth column 50 th row of the next volume of layer data, wherein the nth column is the last column of the data stored in the memory.

And finally, sending the obtained next volume of lamination to the internal memory.

Compared with the prior art, in the scheme, when data required by operation is obtained from the memory, a line of data with the width is obtained from the data to be processed stored in the memory each time according to the size of a convolution kernel, the cache capacity of a chip, the data amount required to be cached and the data line width determined by the first preset line number, the characteristic that a large number of data are repeated in the convolution operation is utilized, the fact that the data required to participate in the operation are only obtained from the memory in each time of the calculation is guaranteed, the broadband requirement on the off-chip memory is reduced, and therefore the power consumption generated when the chip processes the data is reduced, and the processing performance is improved; because the data are divided into data columns and extracted to the cache for operation, the data with large data volume can be operated by using a small cache; and through the setting of the overlapping area, the required data can still be acquired when the convolution calculation reaches the boundary, and the result is ensured to be completely consistent with the normal convolution process.

Corresponding to the foregoing embodiments, an embodiment of the present invention provides a data processing apparatus, and as shown in fig. 4, the data processing apparatus may include:

the main control unit 410 is configured to receive a data processing instruction, and send a control command to the rolling cache and calculation unit, so as to control the rolling cache to extract data from the memory and control the calculation unit to perform convolution operation on the extracted data;

a rolling cache 420, configured to obtain a preset convolution kernel and determine a convolution frame width of the preset convolution kernel; acquiring and determining the width of a data column according to the cache capacity of a chip, a preset data amount and a first preset row number, wherein the width of the data column is greater than or equal to the width of the convolution frame; after receiving a control command sent by the main control unit, extracting data to be operated of a second preset number of rows for any column area in a plurality of column areas obtained by column division of a data matrix to be processed in the memory according to the data column width; after the first row of data in the data to be operated participates in convolution operation, deleting the first row of data, extracting the next row of data from the corresponding column area to serve as the last row of data of the data to be operated, and updating the data to be operated;

and the calculating unit 430 is configured to, after receiving the data to be operated sent by the rolling cache, perform convolution operation on the data to be operated or the updated data to be operated by using the preset convolution kernel until all the data in the area participate in the convolution operation, and send all operation results obtained after the convolution operation to the memory.

By applying the embodiment, when data required by operation is acquired from the memory, a row of data with the width is acquired from the data to be processed stored in the memory each time according to the data row width determined by the size of the convolution kernel, the cache capacity of the chip, the data amount required to be cached and the first preset row number, and the characteristic of large data repetition in the convolution operation is utilized, so that the data required to participate in the operation is only acquired from the memory in each calculation, the broadband requirement on the off-chip memory is reduced, the power consumption generated when the chip processes the data is reduced, and the processing performance is improved; because the data are divided into data columns and extracted to the cache for operation, the data with large data volume can be operated by using a small cache; and through the setting of the overlapping area, the required data can still be acquired when the convolution calculation reaches the boundary, and the result is ensured to be completely consistent with the normal convolution process.

Optionally, the scroll cache 420 may be specifically configured to:

alternatively, the first and second electrodes may be,

Optionally, the calculating unit 430 may be further specifically configured to:

and sending the next convolution layer to the memory.

It should be noted that, the data processing apparatus according to the embodiment of the present invention is an apparatus to which the data processing method is applied, and all embodiments of the data processing method are applicable to the apparatus and can achieve the same or similar beneficial effects.

Corresponding to the above embodiments, an embodiment of the present invention provides a data processing apparatus, and as shown in fig. 5, the data processing apparatus may include:

a first determining module 510, configured to obtain a preset convolution kernel, and determine a convolution frame width of the preset convolution kernel;

a second determining module 520, configured to obtain and determine a data column width according to a chip cache capacity, a preset data amount, and a first preset row number, where the data column width is greater than or equal to the convolution frame width;

a dividing module 530, configured to perform column division on a to-be-processed data matrix according to the data column width to obtain a multi-column region, where the to-be-processed data matrix is a matrix stored in a memory and includes all to-be-processed data;

the extracting module 540 is configured to, when a data processing instruction is received, extract, for any one of all the column regions, data to be operated of a second preset number of lines, send the data to be operated to the chip cache, and perform convolution operation on the cached data to be operated by using the preset convolution kernel, where the second preset number of lines is greater than or equal to the width of the convolution frame and is less than or equal to the first preset number of lines;

the updating module 550 is configured to delete the first line of data from the chip cache after the first line of data in the data to be operated participates in the convolution operation, extract the next line of data from the corresponding column region, send the next line of data to the chip cache, serve as the last line of data of the data to be operated, and update the data to be operated;

the first operation module 560 is configured to perform convolution operation on the updated data to be operated by using the preset convolution kernel until all the row data in the area participate in the convolution operation, and send all operation results obtained after the convolution operation to the memory.

Optionally, the second determining module 520 may further include:

Optionally, the data processing apparatus may further include:

optionally, the extracting module 540 may be specifically further configured to:

Optionally, the update module 550 may be specifically configured to:

alternatively, the first and second electrodes may be,

Optionally, the first operation module 560 may further include:

It should be noted that, the data processing apparatus according to the embodiment of the present invention is an apparatus applying the data processing method, and all embodiments of the data processing method are applicable to the apparatus and can achieve the same or similar beneficial effects.

It is understood that, in another embodiment of the present invention, the data processing apparatus may include: the device comprises a first determining module 510, a second determining module 520, a dividing module 530, an extracting module 540, an updating module 550, a first operating module 560, a second operating module, a third determining module, a first setting module and a second setting module.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of data processing, the method comprising:

obtaining the cache capacity of a chip and a preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of data cached by the chip, wherein the preset data volume represents the data volume required for caching in one convolution operation;

determining the number of each row of data cached by the chip as a data column width, wherein the data column width is greater than or equal to the width of the convolution frame;

when a data processing instruction is received, extracting data to be operated of a second preset number of lines aiming at any one column region in all the column regions, sending the data to be operated to the chip cache, and performing convolution operation on the cached data to be operated by utilizing the preset convolution kernel, wherein the second preset number of lines is greater than or equal to the width of the convolution frame and is less than or equal to the first preset number of lines;

2. The data processing method according to claim 1, wherein before the step of performing column division on the data matrix to be processed according to the data column width to obtain a plurality of column regions, the method further comprises:

3. The data processing method according to claim 1, wherein before the step of extracting, for any column of the regions, a second preset number of rows of data to be calculated and sending the extracted data to the chip cache when the data processing instruction is received, the method further comprises:

4. The data processing method according to claim 1, wherein the step of deleting the first line of data from the chip cache after the first line of data in the data to be operated participates in the convolution operation, extracting the next line of data from the corresponding column area, sending the next line of data to the chip cache as the last line of data in the data to be operated, and updating the data to be operated comprises:

alternatively, the first and second electrodes may be,

5. The data processing method according to claim 1, wherein the step of performing convolution operation on the updated data to be operated by using the preset convolution kernel until all the line data in the area participate in the convolution operation, and sending all operation results obtained after the convolution operation to the memory comprises:

and sending the next convolution layer to the memory.

6. A data processing apparatus, characterized in that the apparatus comprises:

the rolling cache is used for acquiring a preset convolution kernel and determining the width of a convolution frame of the preset convolution kernel; acquiring the cache capacity of a chip and a preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of the data of the rolling cache, wherein the preset data volume represents the data volume of the cache required by carrying out the convolution operation for one time; acquiring a first preset line number, and dividing the maximum value of the data number by the first preset line number to obtain the number of each line of data of the rolling cache; determining the number of each row of data of the rolling cache as a data column width, wherein the data column width is greater than or equal to the width of the convolution frame; after receiving a control command sent by the main control unit, extracting data to be operated of a second preset number of rows for any column area in a plurality of column areas obtained by column division of a data matrix to be processed in the memory according to the data column width; after the first row of data in the data to be operated participates in convolution operation, deleting the first row of data, extracting the next row of data from the corresponding column area to serve as the last row of data of the data to be operated, and updating the data to be operated;

7. The data processing device of claim 6, wherein the rolling cache is further specifically configured to:

8. The data processing device of claim 6, wherein the rolling cache is further specifically configured to:

9. The data processing device of claim 6, wherein the rolling cache is further specifically configured to:

alternatively, the first and second electrodes may be,

10. The data processing device of claim 6, wherein the computing unit is further configured to:

and sending the next convolution layer to the memory.

11. A data processing apparatus, characterized in that the apparatus comprises:

the first operation submodule is used for acquiring the cache capacity of a chip and a preset data volume, and dividing the cache capacity of the chip and the preset data volume to obtain the maximum value of the number of data cached by the chip, wherein the preset data volume represents the data volume cached by performing convolution operation for one time;

the determining submodule is used for determining the number of each row of data cached by the chip as the width of a data column, wherein the width of the data column is greater than or equal to the width of the convolution frame;

12. The data processing apparatus of claim 11, further comprising:

13. The data processing apparatus of claim 11, further comprising:

the extraction module is specifically further configured to:

14. The data processing apparatus according to claim 11, wherein the update module is specifically configured to:

alternatively, the first and second electrodes may be,

15. The data processing apparatus of claim 11, wherein the first operation module further comprises: