CN110955530A

CN110955530A - Deep learning engine parallel processing data method, device, equipment and storage medium

Info

Publication number: CN110955530A
Application number: CN202010114094.0A
Authority: CN
Inventors: 李远超; 蔡权雄; 牛昕宇
Original assignee: Shenzhen Corerain Technologies Co Ltd
Current assignee: Shenzhen Corerain Technologies Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-04-03

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for parallel processing data by a deep learning engine, and discloses a method for parallel processing data by a deep learning engine, which comprises the following steps: acquiring a plurality of stored data sets, parameter sets and bias sets; splitting the plurality of data sets into a plurality of node data sets; preprocessing the plurality of node datasets based on the parameter set; and simultaneously computing a plurality of node data sets according to the parameter set and the bias set based on a plurality of computing engines so as to output a plurality of computing results in parallel. According to the method for processing the data in parallel by the deep learning engine, a multi-engine acceleration method is introduced into a hardware data flow architecture, so that resource waste of devices is reduced, hardware acceleration calculation of data flow of AI is accelerated, and multi-input and multi-output hardware acceleration of the data flow of a data set is performed.

Description

Deep learning engine parallel processing data method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of deep learning calculation, in particular to a method, a device, equipment and a storage medium for parallel processing of data by a deep learning engine.

Background

In recent years, Artificial Intelligence (AI) continuously breaks out a hot tide, which is not separated from the development of equipment computing power and deep learning network structure. In the case that floating point calculation is used for the calculation of the whole network, a large calculation dependency is caused on a Central Processing Unit (CPU). Network computing power can be increased if data can be converted from floating point to fixed point values and fixed point calculations are processed in parallel at the hardware device.

The fixed processing flow structure of the data flow results in a large compromise on the flexibility of the network processing. Most AIs on the market accelerate the way instruction sets are used. The accelerated configuration of various networks is adapted in the form of instruction sets. And in the process of accelerating AI data stream hardware, inputting the reference data and the reference parameters into a solidified engine for calculation to obtain an actual result. In the related art, if a fixed processing flow structure of a data flow is for different devices, resource waste is caused, an engine of the data flow can only process a single data set, and queuing processing is required for multiple input data.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for parallel processing of data by a deep learning engine, so as to realize hardware accelerated computation of AI data stream.

In an embodiment, an embodiment of the present application provides a method for parallel processing data by a deep learning engine, where the method includes:

acquiring a plurality of stored data sets, parameter sets and bias sets;

splitting the plurality of data sets into a plurality of node data sets;

preprocessing the plurality of node datasets based on the parameter set;

and simultaneously computing a plurality of node data sets according to the parameter set and the bias set based on a plurality of computing engines so as to output a plurality of computing results in parallel.

Optionally, the parameter set is used to adjust the weights of the multiple data sets in the deep learning model, and the bias set is used to adjust the error between the data in the deep learning model and the actual data in the deep learning model.

Optionally, the performing, based on a plurality of computing engines, simultaneous computation on the plurality of node data sets according to the parameter set and the bias set to output a plurality of computation results in parallel includes:

acquiring the number of calculation engines;

and simultaneously calculating the plurality of node data sets according to the number of the calculation engines and a preset rule according to the parameter set and the bias set so as to output a plurality of calculation results in parallel.

Optionally, the preset rule includes:

judging whether the number of the computing engines is larger than the number of the node data sets;

in response to a determination that the number of compute engines is greater than the number of node datasets, using the same number of compute engines as the number of node datasets.

Optionally, after the determining whether the number of computing engines is greater than the number of node data sets, the method further includes:

in response to a determination that the number of compute engines is less than the number of node datasets, using the same number of node datasets as the number of compute engines;

continuing to execute the judgment whether the number of the computing engines is larger than the number of the node data sets; in response to a determination that the number of compute engines is greater than the number of node datasets, using the same number of compute engines as the number of node datasets;

until all the node data sets are calculated.

Optionally, after the simultaneously computing the multiple node data sets according to the parameter set and the bias set based on the multiple computing engines to output multiple computing results in parallel, the method further includes: adjusting and storing the plurality of calculation results, wherein the adjustment comprises one of the following steps: adjusting the output position; the data structure is adjusted.

In an embodiment, an embodiment of the present application further provides an apparatus for parallel processing data by a deep learning engine, where the apparatus includes:

a storage module configured to obtain a plurality of stored data sets, parameter sets, and bias sets;

a scheduling module configured to split the plurality of data sets into a plurality of node data sets;

a processing module configured to pre-process the plurality of node datasets based on the parameter set;

and the calculation module is arranged for simultaneously calculating a plurality of node data sets according to the parameter set and the bias set based on a plurality of calculation engines so as to output a plurality of calculation results in parallel.

In an embodiment, an embodiment of the present application further provides an apparatus, including: one or more processors;

a storage device arranged to store one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method as in any above.

In an embodiment, the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program including program instructions, which when executed by a processor, implement the method as described in any one of the above.

Drawings

FIG. 1 is a schematic flow chart of a method for parallel processing data by a deep learning engine in an embodiment of the present application;

FIG. 2 is a schematic flow chart of another method for parallel processing data by a deep learning engine in the embodiment of the present application;

FIG. 3 is a schematic structural diagram of an apparatus for parallel processing data by a deep learning engine in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a computer device in an embodiment of the present application.

Detailed Description

The present application will be described with reference to the accompanying drawings and examples. The embodiments described herein are merely illustrative and are not intended to limit the present application. For the purpose of illustration, only some, but not all, of the structures associated with the present application are shown in the drawings.

Some example embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. Further, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.

Furthermore, the terms "first," "second," and the like may be used herein to describe various orientations, actions, steps, elements, or the like, but the orientations, actions, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. The terms "first", "second", etc. are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means at least two, e.g., two, three, etc., unless expressly defined otherwise.

Example one

Fig. 1 is a flowchart of a method for parallel processing data by a deep learning engine according to an embodiment of the present application, which is suitable for a case where multiple engines are used to process multiple data sets simultaneously, and includes steps 100 to 130.

Step 100, a plurality of stored data sets, parameter sets and bias sets are obtained.

In the embodiment, in the deep learning process, the neural network is trained through a large amount of data sets so that the neural network reaches a required model form. A data set is a collection of large amounts of data, typically input to a neural network in the form of a matrix, the input data being an image, which is represented in a computer as a large three-dimensional array of numbers. Illustratively, an image is 248 pixels wide and 400 pixels high, and has three color channels red, green, and blue (red green blue, RGB). Thus, the image consists of 248 × 400 × 3 numbers, or a total of 297600 numbers. Each number is an integer ranging from 0 (black) to 255 (white). The parameter set and the bias set are parameters derived from the neural network model through calculation and inverse adjustment of a large amount of training data, and values of the parameter set and the bias set are continuously updated in the continuous training process of the neural network. In this embodiment, before the computation of the data sets, a plurality of data sets and parameter sets and offset sets corresponding to the data sets are obtained in the neural network, and after the obtaining is completed, the parameter sets and offset sets are ready to be input to a computation engine to compute the data sets.

In other alternative embodiments, the set of parameters is used to adjust the weights of the plurality of data sets in the deep learning model, and the set of biases is used to adjust the error of the data in the deep learning model from the actual data in the deep learning model. The bias set may adjust the error of the target value and the actual value in the deep learning model.

Wherein multiple engines can share parameter sets and bias sets, i.e., different data sets can use the same parameter and bias sets simultaneously. In the deep learning neural network model, the output of a plurality of front layer neurons is received in the calculation transmission of each neuron, but each front layer neuron multiplies a weight coefficient, because of linear transformation, the weight coefficient is a parameter set, and the parameter set can be continuously updated and perfected in the training process of the neural network, so that the model requirements are better met. And the bias set adjusts the inherent error possibly generated between the data in the neural network model and the actual data, and the data set can output a calculation result through the activation function after the adjustment of the parameter set and the bias set.

And 110, splitting the plurality of data sets into a plurality of node data sets.

In this embodiment, when a large number of data sets, parameter sets, and offset sets are input to the hardware, the computing engine may adjust the dimensionality, shape, and storage position of the data according to the vector number of the data sets and the number of the computing engines, and split the multiple data sets into multiple node data sets, where the multiple node data sets are computed and processed simultaneously in multiple computing engines, and the splitting of the multiple data sets ensures that as many computing engines as possible process the data simultaneously, thereby increasing the processing speed of the data, and ensuring that the multiple computing engines process the data sets with the highest efficiency.

And 120, preprocessing the plurality of node data sets based on the parameter sets.

In this embodiment, the preprocessing of the data set has various forms, such as average subtraction, normalization and Principal Component Analysis (PCA), whitening, and the like, and the average subtraction is taken as an example in this embodiment. Average subtraction is the most common form of pre-processing. It involves subtracting the mean of each feature in the data and has a geometric interpretation, i.e. surrounding the data cloud around the origin along each dimension. In the data, this operation is implemented as: mean (X, axis = 0). Where the np.mean () function is the averaging function, axis =0, which is the averaging of the columns, which returns a 1 × n matrix, in one embodiment, for an image, a single value is usually subtracted from all pixels (e.g., X- = np.mean (X)) for convenience, or this is done separately in the three color channels. The data set is preprocessed through the parameter set, pixels of each datum are normalized, and the accurate processing speed of the data is guaranteed.

And step 130, simultaneously calculating a plurality of node data sets according to the parameter sets and the bias sets based on a plurality of calculation engines so as to output a plurality of calculation results in parallel.

In this embodiment, a plurality of calculation engines simultaneously calculate a plurality of node data sets, in the calculation process of the neural network, each neuron multiplies a corresponding parameter set before receiving the output of a plurality of front layer neurons, corrects the plurality of node data sets by using a bias set, calculates the plurality of node data according to an activation function of the neural network, and outputs a corresponding calculation result. In one embodiment, multiple computing engines share the parameter set and the bias set, and can process multiple paths of data simultaneously, so that hardware acceleration of AI data stream computation is accelerated.

The method for processing data in parallel by the deep learning engine comprises the steps of obtaining a plurality of stored data sets, parameter sets and offset sets; splitting the plurality of data sets into a plurality of node data sets; preprocessing the plurality of node datasets based on the parameter set; and simultaneously calculating a plurality of node data sets according to the parameter set and the offset set based on a plurality of calculation engines to output a plurality of calculation results in parallel, introducing a multi-engine acceleration method into a hardware data flow architecture, reducing the resource waste of devices, accelerating the hardware acceleration calculation of the AI data flow, and performing multi-input multi-output AI hardware acceleration on the data sets.

Example two

Fig. 2 is a flowchart of another method for parallel processing data by a deep learning engine according to an embodiment of the present application, where the embodiment is expanded on the basis of the first embodiment, and in an embodiment, the method includes: step 200 to step 250.

Step 200, obtaining a plurality of stored data sets, parameter sets and bias sets.

And 210, splitting the plurality of data sets into a plurality of node data sets.

Step 220, preprocessing the plurality of node data sets based on the parameter sets.

Step 230, obtain the number of compute engines.

In this embodiment, after the plurality of node data sets are shunted, the number of computing engines in the hardware device is determined. The more the number of the calculation engines is, the more the node data sets are processed and calculated simultaneously, and the processing speed of the whole data is improved; the smaller the number of the computing engines is, the fewer the node data sets which can be processed simultaneously are, and the other node data sets which are not processed wait for the previous node data sets to be processed. In this embodiment, the number of calculation engines is not limited, but at least one calculation engine performs calculation processing on a node data set, and at least two calculation engines perform calculation processing on a plurality of node data sets at the same time.

And step 240, according to the number of the calculation engines, simultaneously calculating the plurality of node data sets according to the parameter set and the bias set according to a preset rule so as to output a plurality of calculation results in parallel.

In this embodiment, a plurality of calculation engines simultaneously calculate a plurality of node data sets according to a preset rule, in the calculation process of the neural network, each neuron multiplies a corresponding parameter set before receiving the output of a plurality of front layer neurons, corrects the plurality of node data through a bias set, calculates the plurality of node data sets according to an activation function of the neural network, and outputs a corresponding calculation result.

In one embodiment, the preset rule includes: judging whether the number of the computing engines is larger than the number of the node data sets; in response to a determination that the number of compute engines is greater than the number of node datasets, using the same number of compute engines as the number of node datasets.

Wherein, if the number of the calculation engines is larger than that of the node data sets, all the node data sets can be simultaneously input to the neural network for calculation. For example, if 10000 node data sets are input and the number of computing engines in the hardware is 12000, the 10000 computing engines are used to compute the 10000 node data sets simultaneously, so as to achieve the effect of processing multiple node data sets simultaneously.

In one embodiment, in response to a determination that the number of compute engines is less than the number of node datasets, using the same number of node datasets as the number of compute engines; continuing to execute the judgment whether the number of the computing engines is larger than the number of the node data sets; in response to a determination that the number of compute engines is greater than the number of node datasets, using the same number of compute engines as the number of node datasets; until all the calculations of the plurality of node data sets are completed.

If the number of the computing engines is smaller than that of the node data sets, calculating a part of node data sets by using all the computing engines, wherein the number of the part of the node data sets is the same as that of the computing engines, judging the number of the node data sets which are left without processing, and repeating the process. For example, if 10000 node data sets are input and the number of computing engines in the hardware is 4000, then 4000 computing engines are simultaneously used to compute the 4000 node data sets. After all the calculation is finished, 6000 node data sets are left without processing, and 4000 node data sets are calculated simultaneously by using 4000 calculation engines at the moment, and 2000 node data sets are left. All the node data sets can be processed by simultaneously computing the remaining 2000 node data sets using 2000 compute engines. The preset rule ensures that all engines are utilized simultaneously, improves the calculation speed of the neural network in deep learning, and reduces the resource waste of hardware.

And step 250, adjusting and storing the plurality of calculation results.

In an embodiment, the adjusting comprises one of: adjusting the output position; the data structure is adjusted.

In this embodiment, the results of the plurality of calculation engines are adjusted, and the plurality of calculation results can be output to the memory at the same time. And when the operation of a single calculation engine is finished, outputting the calculation result of the current layer. If the layer is the last output layer, the position of the adjustment calculation result is output to the outside. If the calculation result is used as the input of the next layer, the structure of the adjustment data is adapted, which is equivalent to a linear transformation, and the placing sequence and the position of the structure of the previous layer are modified through the addition of the parameter set, so that the data of the next layer can be directly input without additional software intervention.

According to the method provided by the embodiment of the application, a plurality of stored data sets, parameter sets and offset sets are obtained; splitting the plurality of data sets into a plurality of node data sets; preprocessing the plurality of node datasets based on the parameter set; based on a plurality of calculation engines, simultaneously calculating a plurality of node data sets according to the parameter sets and the bias sets so as to output a plurality of calculation results in parallel, providing a plurality of calculation engine processing rules, introducing a multi-engine acceleration method into a hardware data flow architecture, reducing the resource waste of devices, accelerating the hardware acceleration calculation of the data flow of the AI, and performing multi-input multi-output hardware acceleration of the data flow of the AI.

EXAMPLE III

The device for the deep learning engine to process data in parallel can execute the method provided by any embodiment of the application, and has corresponding functional modules and effects of the execution method. Fig. 3 is a schematic structural diagram of an apparatus 300 for parallel processing data by a deep learning engine in an embodiment of the present application. Referring to fig. 3, the apparatus 300 for parallel processing data by a deep learning engine according to an embodiment of the present application may include: a storage module 310, a scheduling module 320, a processing module 330, and a calculation module 340.

The storage module 310 is configured to retrieve a stored plurality of data sets, parameter sets, and offset sets.

A scheduling module 320 configured to split the plurality of data sets into a plurality of node data sets.

A processing module 330 configured to pre-process the plurality of node data sets based on the parameter set.

The calculation module 340 is configured to perform simultaneous calculation on a plurality of node data sets according to the parameter set and the bias set based on a plurality of calculation engines to output a plurality of calculation results in parallel.

In an embodiment, the parameter set is used to adjust the weights of the plurality of data sets in the deep learning model, and the bias set is used to adjust the error of the data in the deep learning model from the actual data in the deep learning model.

In one embodiment, the calculation module 340 further comprises:

the get number of engines sub-module 341 is configured to get the number of compute engines.

The computation submodule 342 is configured to compute the plurality of node data sets according to the parameter set and the bias set according to a preset rule according to the number of the computation engines, so as to output a plurality of computation results in parallel.

The preset rules include:

After the determining whether the number of computing engines is greater than the number of node data sets, the preset rule further includes:

continuing to execute the judgment whether the number of the computing engines is larger than the number of the node data sets;

in response to a determination that the number of compute engines is greater than the number of node datasets, using the same number of compute engines as the number of node datasets;

until all the node data sets are calculated.

In one embodiment, the method further comprises:

an adaptation module configured to adapt and store the plurality of calculation results, the adaptation including one of: adjusting the output position; the data structure is adjusted.

The device for processing the data in parallel by the deep learning engine comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for acquiring a plurality of stored data sets, parameter sets and offset sets; splitting the plurality of data sets into a plurality of node data sets; preprocessing the plurality of node data sets by the parameter set; and simultaneously calculating a plurality of node data sets according to the parameter set and the offset set based on a plurality of calculation engines to output a plurality of calculation results in parallel, introducing a multi-engine acceleration method into a hardware data flow architecture, reducing the resource waste of devices, accelerating the hardware acceleration calculation of the AI data flow, and performing multi-input multi-output AI hardware acceleration on the data sets.

Example four

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device may include a plurality of computing engines. As shown in fig. 4, the computer device includes a memory 410 and a processor 420, the number of the processors 420 in the computer device may be one or more, and one processor 420 is taken as an example in fig. 4; the memory 410 and the processor 420 in the device may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example.

Memory 410, which is a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present application (e.g., storage module 310, scheduling module 320, processing module 330, and computation module 340 used in a deep learning engine parallel processing data device). The processor 420 executes various functional applications of the device/terminal/apparatus and data processing by executing software programs, instructions and modules stored in the memory 410, i.e., implements the above-described method.

Wherein the processor 420 is arranged to run the computer program stored in the memory 410, to carry out the steps of:

acquiring a plurality of stored data sets, parameter sets and bias sets;

splitting the plurality of data sets into a plurality of node data sets;

preprocessing the plurality of node datasets based on the parameter set;

In one embodiment, the computer program of the computer device provided in the embodiments of the present application is not limited to the above method operations, and may also execute the method provided in any embodiment of the present application.

The memory 410 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 410 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 410 may include memory located remotely from processor 420, which may be connected to devices/terminals/devices through a network. Examples of such networks include the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The computer equipment comprises a plurality of computing engines, and the plurality of computing engines share the parameter set and the offset set, can process multiple paths of data simultaneously, and quickens hardware accelerated computation of AI data stream.

The device provided by the embodiment of the application acquires a plurality of stored data sets, parameter sets and offset sets; splitting the plurality of data sets into a plurality of node data sets; preprocessing the plurality of node datasets based on the parameter set; and simultaneously calculating a plurality of node data sets according to the parameter set and the offset set based on a plurality of calculation engines to output a plurality of calculation results in parallel, introducing a multi-engine acceleration method into a hardware data flow architecture, reducing the resource waste of devices, accelerating the hardware acceleration calculation of the AI data flow, and performing multi-input multi-output AI hardware acceleration on the data sets.

EXAMPLE five

A storage medium containing computer-executable instructions, which when executed by a computer processor, perform the above method, the method including:

acquiring a plurality of stored data sets, parameter sets and bias sets;

splitting the plurality of data sets into a plurality of node data sets;

preprocessing the plurality of node datasets based on the parameter set;

The storage medium containing the computer-executable instructions provided by the embodiments of the present application is not limited to the method operations described above, and may also execute the method provided by the embodiments of the present application.

The computer-readable storage media of the embodiments of the present application may take the form of a combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of the foregoing. The computer-readable storage medium includes: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-only memory (ROM), an erasable programmable Read-only memory (EPROM) or flash memory, an optical fiber, a portable compact disc Read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or a suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including an electromagnetic signal, an optical signal, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a storage medium may be transmitted over any suitable medium, including wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through one or more networks, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the internet using an internet service provider).

According to the storage medium provided by the embodiment of the application, a plurality of stored data sets, parameter sets and offset sets are obtained; splitting the plurality of data sets into a plurality of node data sets; preprocessing the plurality of node datasets based on the parameter set; and simultaneously calculating a plurality of node data sets according to the parameter set and the offset set based on a plurality of calculation engines to output a plurality of calculation results in parallel, introducing a multi-engine acceleration method into a hardware data flow architecture, reducing the resource waste of devices, accelerating the hardware acceleration calculation of the AI data flow, and performing multi-input multi-output AI hardware acceleration on the data sets.

Claims

1. A method for parallel processing data by a deep learning engine is characterized by comprising the following steps:

acquiring a plurality of stored data sets, parameter sets and bias sets;

splitting the plurality of data sets into a plurality of node data sets;

preprocessing the plurality of node datasets based on the parameter set;

simultaneously computing the plurality of node data sets according to the parameter set and the bias set based on a plurality of computing engines to output a plurality of computing results in parallel.

2. The method of claim 1, wherein the parameter set is used to adjust the weights of the plurality of data sets in a deep learning model, and wherein the bias set is used to adjust the error between the data in the deep learning model and the actual data in the deep learning model.

3. The method of claim 1, wherein said simultaneously computing the plurality of node data sets in accordance with the parameter set and the bias set based on a plurality of compute engines to output a plurality of compute results in parallel, comprises:

acquiring the number of calculation engines;

4. The method of claim 3, wherein the preset rules comprise:

5. The method of claim 4, wherein after said determining whether the number of compute engines is greater than the number of node datasets, further comprising:

until all the node data sets are calculated.

6. The method of claim 1, wherein after said simultaneously computing the plurality of node data sets based on the plurality of compute engines according to the parameter set and the bias set to output a plurality of computed results in parallel, further comprising: adjusting and storing the plurality of calculation results, wherein the adjustment comprises one of the following steps: adjusting the output position; the data structure is adjusted.

7. An apparatus for parallel processing data by a deep learning engine, comprising:

a processing module configured to pre-process the node data set based on the parameter set;

a calculation module configured to perform simultaneous calculation on the plurality of node data sets according to the parameter set and the bias set based on a plurality of calculation engines to output a plurality of calculation results in parallel.

8. The apparatus of claim 7, further comprising: an adaptation module configured to adapt and store the plurality of calculation results, the adaptation including one of: adjusting the output position; the data structure is adjusted.

9. A computer device, the device comprising:

one or more processors;

a storage device arranged to store one or more programs,

the one or more programs are executed by the one or more processors such that the one or more processors implement the method of any of claims 1-6.

10. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program comprises program instructions which, when executed by a processor, implement the method according to any of claims 1-6.