CN111832713B

CN111832713B - Parallel computing method and computing device based on line buffer Linebuffer

Info

Publication number: CN111832713B
Application number: CN201910317455.9A
Authority: CN
Inventors: 张伟豪; 李涵; 王封; 丁瑞强
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2024-06-18
Anticipated expiration: 2039-04-19
Also published as: WO2020211654A1; CN111832713A

Abstract

The invention provides a parallel computing method and computing equipment based on line buffering Linebuffer, which are applied to a template computing structure, wherein the method comprises the following steps: determining a template calculation object; a preset template of the line buffer Linebuffer is constructed according to the template parameters of the template calculation object and the number of calculation units; and transmitting template data to the plurality of computing units through a preset template of the line buffer Linebuffer at the same time, and processing respective computing tasks by each computing unit in parallel. According to the method provided by the invention, the template data required by the execution of the computation by the computation units are simultaneously acquired by the preset template of the line buffer Linebuffer, and then the computation is synchronously executed by the computation units, so that the computation efficiency is higher and the computation speed is faster compared with the traditional scheme.

Description

Parallel computing method and computing device based on line buffer Linebuffer

Technical Field

The invention relates to the convolutional neural field, in particular to a parallel computing method and computing equipment based on line buffering Linebuffer.

Background

In recent years, with the development of artificial intelligence, convolutional neural networks are increasingly applied, and accelerator architectures specially designed for the convolutional neural networks are also continuously emerging.

At present, the convolutional neural network needs to provide data required by each calculation for a calculation unit when each calculation is performed, and in the traditional mode, all input data are stored on a chip, or the input data are obtained by continuously accessing off-chip storage, the on-chip storage pressure is increased in the first mode, and the I/O access pressure is increased in the second mode. At this point, buffering of on-chip intermediate data is typically achieved using a Linebuffer architecture, but conventional Linebuffer does not support parallel synchronous execution by consumers.

Disclosure of Invention

In view of the foregoing, the present invention provides a parallel computing method and computing device based on line buffer Linebuffer that overcomes or at least partially solves the foregoing problems.

According to one aspect of the present invention, there is provided a parallel computing method based on line buffering Linebuffer, applied to a template computing structure, the method comprising:

determining a template calculation object;

Building a preset template of a line buffer Linebuffer according to the template parameters of the template calculation object and the number of calculation units;

And transmitting template data to the plurality of computing units through a preset template of the line buffer Linebuffer at the same time, and processing respective computing tasks by each computing unit in parallel. According to the template parameters of the template calculation object and the preset templates of the line-building buffer Linebuffer of the number of the calculation units, the template data required by the calculation performed by the plurality of calculation units can be obtained simultaneously, and then the calculation is performed synchronously by the plurality of calculation units.

Optionally, when the template calculation object is a convolutional neural network, the method includes:

Determining a network layer to be processed in parallel;

Assigning a plurality of computing units to the network layer;

a preset template of the line buffer Linebuffer is constructed according to the template parameters of the network layer and the number of the calculation units;

and simultaneously transmitting template data to the plurality of computing units through a preset template of the line buffer Linebuffer, wherein the template data are original template data defined by the template parameters, and processing the tasks of the network layer in parallel by each computing unit.

When a network layer to be processed in parallel is selected in the convolutional neural network, one or more network layers can be selected from all network layers according to the calculated amount of each network layer, a preset template of the line buffer Linebuffer is constructed by the template parameters of the selected network layer and the number of the calculation units, and then the template data is transmitted to a plurality of calculation units based on the preset template, so that each calculation unit processes the tasks of the network layers in parallel, and compared with the traditional scheme, the method has higher calculation efficiency and faster calculation speed.

Optionally, the preset template of Linebuffer is composed of a plurality of original templates with specified sizes, and the number of the original templates is equal to the number of the computing units;

The original templates are sequentially connected in the preset templates and at least partially overlap. The size of the template is enlarged by combining a plurality of original templates together so as to simultaneously acquire the data required by each calculation unit, thereby realizing the parallel calculation of a plurality of calculation units. The plurality of original templates are combined together to form an enlarged preset template, and all the calculation units acquire required processing data at the same time, so that parallel calculation of a plurality of calculation units is realized.

Optionally, the transmitting the template data to the plurality of computing units through the preset template of the line buffer Linebuffer simultaneously, where each computing unit processes the task of the network layer in parallel, includes:

The template data of each original template is simultaneously transmitted to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer, and the task of the network layer is processed in parallel by each computing unit.

Optionally, the transmitting the template data of each original template to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer simultaneously, where each computing unit processes the tasks of the network layer in parallel, includes:

Dividing an input feature map of the convolutional neural network into a plurality of data blocks in advance;

Simultaneously acquiring template data required by each computing unit to execute convolution operation based on the plurality of data blocks by using the plurality of original templates, and transmitting the acquired template data to the corresponding computing units;

and continuously moving the plurality of original templates by a preset step length according to a specified direction, simultaneously acquiring new template data required by each calculation unit for currently executing convolution operation after each movement of the plurality of original templates, and transmitting the new template data to the corresponding calculation unit until all the plurality of data blocks are read.

Optionally, the method further comprises:

When the template data required by the plurality of computing units to execute the current execution are acquired, acquiring new template data required by the plurality of computing units to execute the next template computation based on the input feature map;

And storing the new template data into a preset data buffer area. When a plurality of templates in the preset templates acquire data required by each calculation unit to execute convolution calculation, the Linebuffer buffer area continuously reads the data generated by the upper data layer, so that a plurality of template data can be simultaneously transmitted, a plurality of consumers can simultaneously and parallelly calculate the data, the time for acquiring the data by the templates is shortened, and the calculation efficiency is further improved.

Optionally, when the data buffer is full, the preset template is moved by a preset step length.

Optionally, the preset step size is p×stride _x, where p represents the number of the computing units; stride _x represents the horizontal step size of the original template.

Optionally, the Linebuffer is implemented by a set of registers;

Each original template in the preset templates comprises a plurality of registers so as to read and write template data required by each execution of the template calculation into the template data required by each execution of the template calculation by the calculation unit based on the data blocks in the input feature diagram.

According to another aspect of the present invention, there is also provided a computing device including:

a processor for performing the parallel computing method based on line buffer Linebuffer as defined by any one of the preceding claims.

Optionally, the computing device further comprises:

a storage device for storing a computer program that is loaded and executed by the processor when run in the computing device.

The invention provides a more efficient synchronous calculation method based on line buffering Linebuffer, which is characterized in that a plurality of calculation units are distributed to a network layer needing to execute parallel calculation after the network layer is determined, a preset template of line buffering Linebuffer is constructed according to template parameters of the network layer and the number of the calculation units, template data is simultaneously transmitted to the plurality of calculation units through the preset template, and the calculation is synchronously executed by the plurality of calculation units.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

Fig. 1 shows a schematic diagram of the working principle of Linebuffer;

FIG. 2 shows a schematic diagram of a Linebuffer-based convolution calculation;

FIG. 3 shows Linebuffer a schematic diagram of the preservation of intermediate calculations between layers of a convolutional neural network;

FIG. 4 shows a schematic diagram of a Linebuffer implementation using a shift register;

FIG. 5 shows a line feed schematic of Linebuffer shown in FIG. 4;

FIG. 6 illustrates a split schematic of a neural network layer;

FIG. 7 is a schematic diagram of template computation assigned to each computation unit shown in FIG. 6;

FIG. 8 is a schematic diagram showing the calculation time of each conventional calculation unit;

FIG. 9 is a flow chart of a Linebuffer-based parallel computing method according to an embodiment of the present invention;

FIG. 10 shows a Linebuffer master template schematic;

FIG. 11 shows a schematic diagram of composing a preset template based on a plurality of raw templates;

FIG. 12 shows a buffer setup diagram of the first embodiment;

fig. 13 shows a schematic diagram of the synchronous calculation time of each calculation unit of the first embodiment;

FIG. 14 shows a buffer arrangement diagram of a second embodiment;

FIG. 15 shows a synchronous Linebuffer operation schematic;

FIG. 16 shows a synchronous Linebuffer movement operation schematic; and

Fig. 17 shows a synchronous Linebuffer linefeed operation schematic.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Linebuffer, also called line buffering, is a technique widely used in template computation, and the fields of image processing, artificial intelligence, etc. are largely used in template computation. Generally, linebuffer can reduce the access times and the on-chip storage, and is a more common structure in the running-water type template calculation. The convolutional operation in the convolutional neural network is also a template calculation, so Linebuffer technology is also often used in some convolutional accelerator architectures, which makes Linebuffer a large number of applications in recent years.

Fig. 1 shows a schematic diagram of the working principle of Linebuffer. In the figure, the size of the input feature map is 5×5, and a template (window) is continuously slid on the input feature map. The template data is computed once for each slide. The non-white parts (01-21) in fig. 1 represent Linebuffer the stored data, wherein the dark gray parts (01, 10, 11, 12, 21) represent the templates of this template calculation, i.e. the input data involved in this calculation. Each time a template is calculated Linebuffer it is necessary to provide the calculation unit with the data required for this calculation. After completing one template calculation Linebuffer needs to be updated, new data needs to be read in and data that will not be reused need to be discarded. In this example, after the calculation unit completes the first calculation, the calculation unit moves in the horizontal direction by a step length of 1, linebuffer discards the data 01, and reads the data 22. If Linebuffer is not used, either all of the input data is stored on-chip, which increases on-chip storage pressure, or the off-chip storage is continually accessed to obtain input data, increasing I/O access pressure. By using Linebuffer, the on-chip storage pressure or external access pressure is greatly reduced. Fig. 1 shows that the template in the template calculation example is cross-shaped, and the template can be any shape in practical application. In a general convolutional neural network, the shape of the template is preferably rectangular.

Fig. 2 shows a conventional Linebuffer-based convolution calculation scheme, in which the template size is 3×3 and the step size is 1. In pipelined template calculations, such as convolutional neural network calculations, linebuffer tend to act as a buffer from layer to preserve intermediate results between layers with minimal storage costs. In the pipelining process, the front layer and the back layer often adopt producer and consumer modes, that is, after the front layer calculates all data required by the back layer for one calculation, the back layer starts one calculation immediately. Linebuffer will send the template data to the later network layer after receiving all the data required for this calculation, from which the calculation starts. In the embodiment shown in fig. 3, linebuffer mainly realizes data transfer between Layer0 (network Layer 0) to Layer1 (network Layer 1), layer1 (network Layer 1) to Layer2 (network Layer 2), and Layer2 (network Layer 2) to Layer3 (network Layer 3).

In hardware, linebuffer may be implemented using a piece of memory or a set of registers, and fig. 4 shows a schematic diagram of construction Linebuffer by shift registers. Taking Linebuffer of fig. 2 as an example, where each stencil runs in a row, the register is shifted to the left 1 time (the step size of the stencil in the horizontal direction) according to the black line, see fig. 4, register R00 discards one data and register R22 reads in new data. Registers R00 through R22 output the data contained in the template. Each line, the register is shifted 3 bits to the left (width of the template in the horizontal direction) and three new numbers are read in, as shown in fig. 5.

In pipeline template calculation similar to a convolutional neural network, the calculation amounts of different layers may be greatly different, so that a layer with slow calculation often needs to wait for calculation of a previous layer, and a bottleneck of the whole network calculation is formed.

In this case, layers that are slow to calculate may be parallel. Taking Layer0 in fig. 3 as an example, layer1 is calculated on one calculation unit assuming that Layer0 is not parallel, layer1 is split into three parts, i.e., divided into calculation unit 1, calculation unit 2, and calculation unit 3, and calculated on three calculation units in parallel, as shown in fig. 6.

In this case, three calculation units will equally distribute the convolution calculations of Layer 1. Assume that the calculation allocation of three calculation units is as shown in fig. 7, that is, layer1 needs to perform template calculation 9 times in total, and each calculation unit is responsible for template calculation three times. Each template is calculated as stepil [ i ] [ j ], and the data required is recorded as data [ i ] [ j ].

After Layer1 splits, linebuffer needs to provide data for each compute unit, which is demonstrated in the first behavior example. When Layer0 calculates data 00 to data 22, linebuffer sends data 00-22 calculated by Layer0, namely data 0, to calculation unit 1, and starts calculating stencil 0. However, at this time, the calculation units 2 and 3 cannot start calculation because the data 23 and 24 are not yet calculated by Layer 0. When Layer0 completes the calculation of one template again, linebuffer obtains data 23, linebuffer updates once, and sends data [0] [1] to the calculation unit 2, and starts calculating stepil [0] [1]. Similarly, linebuffer obtains data 24, and then sends data [0] [2] to the calculation unit 3, and starts calculating stepil [0] [2]. It can be found that three calculation units cannot start calculation at the same time, i.e. they cannot be synchronized. Let us assume that Layer0 calculates a template, and the time to get a number is s_0, layer1 calculates a template is s_1, ignores Linebuffer of the read data, updates, and sends the data. This process is shown in fig. 8.

It can be found that the calculation unit 2 needs to wait S ₀ before starting the calculation, whereas the calculation unit 3 needs to wait 2S ₀ before starting the calculation. The computations of the 3 computing units are unsynchronized and will not be synchronized in future computations. If the underlying hardware architecture is a strongly synchronized architecture, such asynchronous operations can cause significant difficulties in algorithm scheduling, even if the hardware architecture does not support such operations at all.

A method of negotiating between three other computing units to synchronize may be adopted to solve this problem, but this will certainly increase the communication cost between the three computing units, and also complicate the synchronization logic.

The embodiment of the invention provides a parallel computing method based on line buffering Linebuffer, which can be applied to template computing, so that Linebuffer has synchronous adjustment capability, and further, a consumer of Linebuffer can perform synchronous computing. Optionally, the specific method may include: firstly, determining a template calculation object; secondly, building a preset template of a line buffer Linebuffer according to the template parameters of the template calculation object and the number of calculation units; finally, the template data is simultaneously transmitted to the plurality of computing units through the preset template of the line buffer Linebuffer, and each computing unit processes the respective computing tasks in parallel. Taking convolutional neural network as a template calculation object as an example, referring to fig. 9, it can be known that the Linebuffer-based parallel calculation method provided by the embodiment of the invention may include:

step S901, determining a network layer to be processed in parallel;

Step S902, allocating a plurality of computing units to the network layer. Taking fig. 6 as an example, three calculation units, namely calculation unit 1, calculation unit 2 and calculation unit 3, are allocated to Layer 1. When the network layers to be processed in parallel are selected from the convolutional neural network, one or more network layers can be selected from all network layers according to the calculated amount of each network layer, and the method is not limited.

Step S903, a preset template of the line buffer Linebuffer is constructed according to the template parameters of the network layer and the number of computing units.

In step S904, template data is simultaneously transmitted to the plurality of computing units through the preset template of the line buffer Linebuffer, and the task of the network layer is processed in parallel by each computing unit, where the template data is original template data defined by the template parameters.

Fig. 10 shows a conventional Linebuffer-technology diagram, which expands the input feature map on the basis of fig. 6. As can be seen from an analysis of fig. 10 with reference to fig. 6, the conventional scheme would use a 3×3 template for each calculation unit, with a step size of 1.

In the embodiment of the present invention, the template data may be simultaneously transmitted to a plurality of computation units allocated to the network layer requiring parallel processing through the preset template of the line buffer Linebuffer, and optionally, the preset template in the embodiment of the present invention is composed of a plurality of original templates with specified sizes, and the template data required by the computation unit to execute the convolution operation is located in the original templates. The number of the original templates is equal to the number of the calculation units, and the original templates are sequentially connected in the preset templates and at least partially overlapped. The master templates may be the same or different in size, and the present invention is not limited thereto.

That is, in the conventional scheme, one original template is adopted to sequentially acquire data required by each calculation unit to execute convolution calculation, and in the embodiment of the invention, a plurality of original templates are combined together to form an enlarged preset template, so that each calculation unit acquires the required processing data at the same time, and further, parallel calculation of a plurality of calculation units is realized. In practice, the computing templates of Linebuffer are preferentially expanded in the horizontal direction.

Assuming Linebuffer that the consumer parallelism is p, p templates that are consecutive in the horizontal direction constitute a new template, referred to as a large template (i.e., the preset template in the above embodiment). The horizontal step size of the large template is p×stride _x. Fig. 11 shows this process (p=3 for example). Referring to fig. 11, the master template has a rectangular shape and a size of 3×3, and by expanding the master template in the horizontal direction, three successive master templates may constitute one large template, wherein the three successive master templates may have overlapping portions. The step S902 may further include: the template data of each original template is simultaneously transmitted to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer, and the task of the network layer is processed in parallel by each computing unit. Optionally, the method specifically includes:

S902-1, dividing an input characteristic diagram of a convolutional neural network into a plurality of data blocks in advance, such as 8 multiplied by 6 data blocks shown in FIG. 11;

s902-2, simultaneously acquiring template data required by each computing unit to execute convolution operation by utilizing a plurality of original templates, and transmitting the acquired template data to the corresponding computing unit to execute calculation.

In the working process, linebuffer obtains p pieces of original template data according to the data contained in the large template, and sends the p pieces of original template data to p computing units for template computation.

After the step S902-2, the method may further include: s902-3, continuously moving a plurality of original templates according to a specified direction by a preset step length, simultaneously acquiring new template data required by each calculation unit for performing convolution operation based on the plurality of data blocks after each movement of the plurality of original templates, and transmitting the new template data to the corresponding calculation units until all the plurality of data blocks are read.

In an optional embodiment of the present invention, when acquiring template data currently required by a plurality of computing units to execute, new template data required by the plurality of computing units to execute next template computation may also be acquired based on the input feature map; storing the new template data into a preset data buffer area; when the data buffer area is full, the preset template moves by a preset step length. The diagonal line portions (data tiles 25, 26, 27) of fig. 12 are buffers.

That is, in the embodiment of the present invention, p×stride _x buffers may be further added at the end of Linebuffer, and when a plurality of templates in a preset template acquire data required by each calculation unit to perform convolution calculation, the Linebuffer buffers continuously read data generated by a previous data layer, so as to reduce the time for acquiring data by the template, thereby improving the calculation efficiency. When Linebuffer buffer reads full, linebuffer may perform one movement of a preset template including a plurality of original templates. Adding Linebuffer buffers can be done to send multiple template data simultaneously so that multiple consumers can compute simultaneously in parallel. As shown in fig. 13, when Linebuffer obtains all the data of the first template, three template data are simultaneously sent to three calculation units, and the three calculation units can start calculation synchronously. At this time, linebuffer will continuously receive the data generated by Layer0 and store it in the buffer. When the buffer is full, linebuffer send the following 3 template data, and the calculation unit starts the calculation of the next round immediately after receiving the template data.

In the embodiment shown in fig. 13, 3S ₀＞S₁, the calculation unit waits for a period of time after the calculation has completed a round. If the parallel number is exactly equal to the multiple of the front-layer calculation speed and the rear-layer calculation speed, the rear-layer calculation unit does not need to wait and can directly start calculation. At this time, if the overhead of Linebuffer and the overhead of communication and control are ignored, the calculation utilization of two layers will be 100%, and all parallel calculation will start synchronously. In summary, synchronization Linebuffer achieves the purpose of synchronizing parallel computing by occupying more memory.

The parallel computing method based on Linebuffer in the above embodiment is suitable for the case of finer control granularity. In practice, sometimes the underlying hardware can only provide control with a coarser granularity, such as control in units of rows, where Linebuffer employs a pipelining technique. The line buffers support parallel methods when the granularity of control is multi-line, i.e., multi-line pipelining. At this time, the buffer of Linebuffer becomes stride _y line, i.e., buffers data of stride _y line, as shown in fig. 14.

Optionally, embodiments of the present invention implement Linebuffer with a set of registers when performing calculations based on Linebuffer; each original template in the preset templates comprises a plurality of registers so as to read and write template data required by each execution of template calculation into the calculation unit based on the data blocks in the input feature diagram. Alternatively, one register may correspond to reading data of one data tile.

As shown in fig. 15, the registers R00 to R24 are partially similar to those shown in fig. 4 and 5, being 3 templates of 3×3, which are sent to 3 computing units, respectively. While the computing units calculate the template, the synchronization Linebuffer continuously acquires new template data by reading in 2, so that each computing unit can execute the template calculation. Synchronization Linebuffer continuously reads in new data (data 25, 26, 27 in fig. 12) by reading in 1 and stores in a buffer consisting of three shift registers, bo, B01, B02. The write controller will control the data coming in from read 1 continuously, and write it in turn cyclically to bo, B01, B02, etc.

When the buffer is full Linebuffer can make one large template shift, at which time all shift registers (including buffer) in Linebuffer shift 3 bits to the left, linebuffer state becomes as shown in fig. 16. Registers R00 to R24 will send new templates to the computing unit and buffers blo, B01, B02 wait for new data 30, 31, 32 to be read in.

Thereafter Linebuffer reaches the position where a line feed is required. Linebuffer performs a line feed operation, all registers are shifted 3 bits to the left and 3 new data are read in. Thereafter Linebuffer reaches the state shown in fig. 17, the buffers blo, B01, B02 wait for new data 35, 36, 37 to be read in.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including: a processor configured to perform the parallel computing method based on the line buffer Linebuffer according to any one of the above embodiments. In addition, the computing device may further include: a storage device for storing a computer program that is loaded and executed by the processor when run in the computing device.

The embodiment of the invention provides a more efficient synchronous calculation method based on line buffering Linebuffer, for a neural network, a plurality of calculation units are distributed to a network layer needing to execute parallel calculation after the network layer is determined, a preset template of line buffering Linebuffer is constructed according to the template parameters of the network layer and the number of the calculation units, template data are simultaneously transmitted to the plurality of calculation units through the preset template, and then the calculation is executed by the plurality of calculation units in parallel. The method provided by the embodiment of the invention can be implemented on a most general storage architecture, such as a register set or a RAM. The synchronization calculation method based on Linebuffer can solve the problem of asynchronous algorithm such as neural network algorithm, multi-step image processing algorithm and the like after parallel splitting, so that the synchronization Linebuffer can be widely applied to hardware architectures such as many-core neural network accelerator architecture, many-core image processor architecture and the like, and is particularly suitable for hardware architectures requiring strong synchronization.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

By now it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been shown and described herein in detail, many other variations or modifications of the invention consistent with the principles of the invention may be directly ascertained or inferred from the present disclosure without departing from the spirit and scope of the invention. Accordingly, the scope of the present invention should be understood and deemed to cover all such other variations or modifications.

Claims

1. A parallel computing method based on line buffering Linebuffer, applied to a template computing structure, the template computing structure being a convolutional neural network, the method comprising:

Determining a network layer to be processed in parallel;

Assigning a plurality of computing units to the network layer;

A preset template of the line buffer Linebuffer is constructed according to the template parameters of the network layer and the number of the calculation units; the preset templates of Linebuffer consist of a plurality of original templates with specified sizes, and the number of the original templates is equal to the number of the computing units; the original templates are sequentially connected in the preset templates and at least partially overlapped;

Transmitting template data to the plurality of computing units through a preset template of the line buffer Linebuffer at the same time, and processing tasks of the network layer in parallel by each computing unit, wherein the template data is original template data defined by the template parameters;

The transmitting the template data to the plurality of computing units through the preset template of the line buffer Linebuffer at the same time, and the parallel processing of the tasks of the network layer by each computing unit includes: transmitting template data of a plurality of original templates to a plurality of computing units through the plurality of original templates of the line buffer Linebuffer at the same time, and processing tasks of the network layer in parallel by each computing unit;

The transmitting the template data of the plurality of original templates to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer simultaneously, and the parallel processing of the tasks of the network layer by each computing unit includes: dividing an input feature map of the convolutional neural network into a plurality of data blocks in advance; simultaneously acquiring template data required by each computing unit to execute convolution operation based on the plurality of data blocks by using the plurality of original templates, and transmitting the acquired template data to the corresponding computing units; and continuously moving the plurality of original templates by a preset step length according to a specified direction, simultaneously acquiring new template data required by each calculation unit for currently executing convolution operation after each movement of the plurality of original templates, and transmitting the new template data to the corresponding calculation unit until all the plurality of data blocks are read.

2. The method of claim 1, wherein the method further comprises:

and storing the new template data into a preset data buffer area.

3. The method of claim 2, wherein the preset template is moved by a preset step size when the data buffer is full.

4. A method according to any one of claims 1-3, wherein said Linebuffer is implemented by a set of registers;

5. A computing device, comprising:

a processor configured to perform the parallel computing method based on line buffering Linebuffer as claimed in any one of claims 1 to 4.

6. The computing device of claim 5, wherein the computing device further comprises: