CN114970814A - Processing method and processing device of neural network computation graph - Google Patents

Processing method and processing device of neural network computation graph Download PDF

Info

Publication number
CN114970814A
CN114970814A CN202210540587.XA CN202210540587A CN114970814A CN 114970814 A CN114970814 A CN 114970814A CN 202210540587 A CN202210540587 A CN 202210540587A CN 114970814 A CN114970814 A CN 114970814A
Authority
CN
China
Prior art keywords
subgraph
computation
computational
graph
subgraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210540587.XA
Other languages
Chinese (zh)
Inventor
吴欣洋
李涵
张抗
丁瑞强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN202210540587.XA priority Critical patent/CN114970814A/en
Publication of CN114970814A publication Critical patent/CN114970814A/en
Priority to PCT/CN2023/094832 priority patent/WO2023222047A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure provides a processing method and a processing apparatus for a neural network computational graph, an electronic device, and a computer-readable medium, where the neural network computational graph includes a plurality of operator nodes, and the processing method includes: determining all target operator nodes in the neural network calculation graph according to the output connection relations of the operator nodes, wherein the target operator nodes are provided with a first output end and a second output end, and the operator nodes connected with the first output end are data output nodes; dividing the neural network computation graph into a plurality of computation subgraphs which are connected in series by taking a second output end of the target operator node as a dividing point, wherein the output end of the operator node connected with the second output end is connected with other operator nodes; and sequentially generating an executable file corresponding to each computation subgraph according to each computation subgraph. According to the technical scheme, automatic splitting of any neural network computational graph can be achieved, the compiling efficiency and effect of the neural network computational graph are improved, and the compiling difficulty is reduced.

Description

Processing method and processing device of neural network computation graph
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a processing method and a processing apparatus for a neural network computation graph, an electronic device, and a computer-readable storage medium.
Background
The many-core architecture chip based on the integration of storage and computation reduces data carrying time and power consumption because computation and storage are placed on the chip, and is an important development direction of many-core chips.
Deep learning frameworks (e.g., Tensorflow or ONNX) typically use computational graphs to express the computation of deep learning models (neural networks). For certain acceleration hardware, the neural network computational graph needs to be compiled by a compiler to generate an instruction stream that can be run on the hardware. Where the hardware may be a computationally-based many-Core chip, which typically includes multiple physical cores (cores).
Disclosure of Invention
The disclosure provides a processing method and a processing device for a neural network computation graph, an electronic device and a computer-readable storage medium.
In a first aspect, the present disclosure provides a processing method of a neural network computational graph, the neural network computational graph including a plurality of operator nodes, the processing method including:
determining all target operator nodes in the neural network calculation graph according to the output connection relations of the operator nodes, wherein the target operator nodes are provided with a first output end and a second output end, and the operator nodes connected with the first output end are data output nodes;
taking a second output end of the target operator node as a splitting point, splitting the neural network computational graph into a plurality of serially connected computational subgraphs, wherein the output end of the operator node connected with the second output end is connected with other operator nodes;
and sequentially generating an executable file corresponding to each computation subgraph according to each computation subgraph.
In a second aspect, the present disclosure provides a processing apparatus for processing a neural network computational graph to be processed, the neural network computational graph comprising a plurality of operator nodes, the processing apparatus comprising:
a determining module, configured to determine all target operator nodes in the neural network computational graph according to output connection relationships of a plurality of operator nodes, where the target operator nodes have a first output end and a second output end, and the operator nodes connected to the first output end are data output nodes;
the first splitting module is used for splitting the neural network computational graph into a plurality of serially connected computational subgraphs by taking a second output end of the target operator node as a splitting point, and the output end of the operator node connected with the second output end is connected with other operator nodes;
and the generating module is used for sequentially generating an executable file corresponding to each computation subgraph according to each computation subgraph.
In a third aspect, the present disclosure provides an electronic device comprising:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executable by the at least one processor to enable the at least one processor to perform the method of processing a computational graph of a neural network as described above.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor/processing core, implements the processing method of the neural network computation graph described above.
According to the technical scheme of the processing method of the neural network computational graph provided by the embodiment of the disclosure, on one hand, the processing method is suitable for any neural network computational graph, and can realize automatic splitting of any neural network computational graph, so that the neural network computational graph can be compiled in sections, the compiling difficulty of the neural network computational graph is reduced, the compiling efficiency and effect of the neural network computational graph are improved, the requirement of compiling of the neural network computational graph on chip hardware storage resources is effectively reduced, the problem that the actual chip hardware storage resources cannot be met due to large storage resources required by compiling of the neural network computational graph is favorably solved, the reasonable utilization of the chip hardware storage resources is realized, and the utilization efficiency of the chip hardware storage resources is improved; on the other hand, the output of each computation subgraph obtained by automatic splitting is used as a part of the output result of the neural network computation graph, and can be stored in an external memory or a host corresponding to the chip without being stored on the chip or being stored on the chip for a long time, so that the on-chip storage resources can be effectively saved, and the utilization rate of the on-chip storage resources is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
fig. 1 is a schematic flowchart of a processing method of a neural network computational graph according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a neural network computational graph;
FIG. 3 is a flowchart illustrating an embodiment of step S13 in FIG. 1;
FIG. 4 is a schematic flow chart of another specific implementation of step S13 in FIG. 1;
fig. 5 is a schematic flow chart of another processing method of a neural network computation graph according to an embodiment of the present disclosure;
fig. 6 is a schematic flow chart of another processing method of a neural network computation graph according to an embodiment of the present disclosure;
fig. 7 is a schematic flow chart of another processing method of a neural network computation graph according to an embodiment of the present disclosure;
FIG. 8 is a schematic flow chart diagram illustrating another method for processing a neural network computational graph according to an embodiment of the present disclosure;
fig. 9 is a block diagram of a processing device according to an embodiment of the disclosure;
fig. 10 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To facilitate a better understanding of the technical aspects of the present disclosure, exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, wherein various details of the embodiments of the present disclosure are included to facilitate an understanding, and they should be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the related art, the computational graph of a large neural network usually requires a large amount of computation and data, and the computation and storage resources of a chip usually cannot meet the resource requirements of the whole neural network computational graph, so that the compiling difficulty of the neural network computational graph is large and the efficiency is low.
Therefore, the embodiment of the present disclosure provides a processing method and a processing apparatus for a neural network computation graph, an electronic device, and a computer-readable storage medium, which are intended to effectively solve at least one of the technical problems in the related art.
The processing method of the embodiment of the disclosure may be executed by a processing apparatus as an execution main body, and the processing apparatus may be integrated in an electronic device such as a terminal device or a server in a software and/or hardware manner, for example, the terminal device may be an in-vehicle device, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. In some embodiments, the processing method of the embodiments of the present disclosure may be implemented by a processor calling a computer readable program instruction stored in a memory, or may be executed by a server.
Fig. 1 is a schematic flowchart of a processing method of a neural network computational graph according to an embodiment of the present disclosure.
The embodiment of the present disclosure provides a processing method of a neural network computational graph, which is used to implement automatic splitting of a neural network computational graph to be processed and generate an executable file running on a corresponding many-core chip, where the neural network computational graph to be processed may include a plurality of operator nodes, the operator nodes are basic computing units constituting a neural network, the operator nodes may be, for example, convolution, pooling and other operation operations in the neural network, the neural network may be any type of deep learning network, the neural network may be used to execute any one of an image processing task, a voice processing task, a text processing task, and a video processing task, and input data of the neural network may be any one of image data, voice data, text data, and video data.
Referring to fig. 1, the processing method may include: step S11 to step S13.
And step S11, determining all target operator nodes in the neural network calculation graph according to the output connection relations of the operator nodes, wherein the target operator nodes are provided with a first output end and a second output end, and the operator nodes connected with the first output end are data output nodes.
And step S12, taking the second output end of the target operator node as a splitting point, splitting the neural network computation graph into a plurality of computation subgraphs connected in series, wherein the output end of the operator node connected with the second output end is connected with other operator nodes.
And step S13, sequentially generating an executable file corresponding to each computation subgraph according to each computation subgraph.
In this embodiment of the present disclosure, for a to-be-processed neural network computational graph, before determining all target operator nodes in the neural network computational graph according to the output connection relationships of the plurality of operator nodes, that is, before step S11, the processing method further includes: and acquiring node information of each operator node in the neural network computational graph.
The node information of the operator node may include an input connection relationship, an output connection relationship, required parameter information, attribute information of the operator node, an execution order of the operator node, and the like of the operator node. The input connection relation of the operator node is used for describing the connection relation between the input of the operator node and the output of other operator nodes in the neural network computation graph, the output connection relation of the operator node is used for describing the connection relation between the output of the operator node and the input of other operator nodes in the neural network computation graph, parameter information required by the operator node comprises but is not limited to a weight parameter which is configured in advance and is required for realizing the operation of the operator node, attribute information of the operator node is information for representing the characteristic attribute of the operator node, the attribute information of the operator node can comprise but is not limited to the operator type of the operator node, the calculation amount and the storage amount required by the operator node, and the execution sequence of the operator node is the time sequence for representing the operation of the operator node.
In step S11, according to the output connection relationship of the plurality of operator nodes in the neural network computation graph, so as to have a first output end and a second output end, and the result output by the first output end is a part of the output result of the neural network computation graph as a filtering condition, all target operator nodes in the neural network computation graph that satisfy the filtering condition can be determined.
Specifically, for each operator node in the neural network computational graph, according to the output connection relationship of the operator node, the number of other operator nodes connected to the output of the operator node in the neural network computational graph can be determined, so as to determine the number of output branches of the operator node, that is, determine that the operator node is a single output node, a double output node, or a multiple output node.
Further, according to the output connection relations and node information of the operator nodes in the neural network computation graph, the output connection relations and the operator types of other operator nodes connected with each output end of each operator node can be determined, whether the other operator nodes connected with each output end of each operator node are data output nodes is further determined, and whether each operator node is a target operator node is further determined. The data output node refers to an operator node of which the operator type is used for outputting data, the output data is used as a part of an output result of the neural network computational graph, and the operator node of the operator type is not used for data operation and does not have an output connection relation with other operator nodes.
When one operator node is a double-output or multi-output node, the operator node connected with one output end of the operator node does not have an output connection relation with other operator nodes, the operator node connected with the one output end is used as a data output node, and the operator node connected with the other output end of the operator node has an output connection relation with other operator nodes, namely the operator node connected with the other output end is a data operation node and is used for carrying out data operation, the output result of the operator node is shown to be used as a part of the output result of the neural network computation graph and also used as the input of the subsequent operator node, so that the operator node meets the screening condition of the target operator node, and the operator node is determined to be the target operator node. Meanwhile, for the sake of distinction, the one input terminal of the operator node is defined herein as a first output terminal, and the other input terminal of the operator node is defined herein as a second output terminal.
Fig. 2 is a schematic structural diagram of a neural network computational graph, for example, referring to fig. 2, operator node 1 is a single output node, and does not meet the screening condition of the target operator node; the operator node 2 is a double-output node, the operator node 3 connected with the first output end 21 of the operator node 2 does not have an output connection relation with other operator nodes, the operator node 3 is a data output node, and the output end of the operator node 4 connected with the second output end 22 has an output connection relation with other operator nodes 6, so that the operator node 2 meets the screening condition of the target operator node; similarly, operator nodes 5 and 7 are both data output nodes, so operator 4 and operator node 6 both satisfy the screening condition of the target operator node, and other operator nodes do not satisfy the screening condition of the target operator node. Therefore, the operator nodes 2, 4, and 6 in the neural network computation graph shown in fig. 2 are respectively used as a target operator node.
In step S12, the neural network computation graph is split with the second output end of each target operator node as a splitting point, so that a plurality of computation subgraphs connected in series can be obtained. Illustratively, taking the neural network computation graph shown in fig. 2 as an example, according to step S11, it may be determined that target operator nodes satisfying the above-mentioned screening condition in the neural network computation graph shown in fig. 2 include operator node 2, operator node 4 and operator node 6, and the neural network computation graph shown in fig. 2 is split by taking second output end 22 of operator node 2, second output end 42 of operator node 4 and second output end 62 of operator node 6 as splitting points, respectively, to obtain 4 computation subgraphs connected in series, assuming that the 4 computation subgraphs are respectively denoted as computation subgraph 1, computation subgraph 2, computation subgraph 3 and computation subgraph 4, the computation subgraph 1 includes operator node 1, operator node 2 and operator node 3, the computation subgraph 2 includes operator node 4 and operator node 5, the computation subgraph 3 includes operator node 6 and operator node 7, the computational graph 4 comprises operator nodes 8 and operator nodes 9.
It should be noted that the number of the first input ends of the target operator node may be one, and the number of the second input ends may be one or more, which is not limited in this disclosure.
For the multiple computation subgraphs split by step S12, each computation subgraph has one or more operator nodes respectively.
In step S13, for each computation subgraph obtained by splitting in step S12, an executable file corresponding to the computation subgraph is generated according to the computation subgraph, where the executable file is an execution code executable on a corresponding chip, so that compiling the computation subgraph to the corresponding chip is realized, and the corresponding chip executes a computation task corresponding to the computation subgraph.
According to the technical scheme of the processing method of the neural network computational graph provided by the embodiment of the disclosure, on one hand, the processing method is suitable for any neural network computational graph, and can realize automatic splitting of any neural network computational graph, so that the neural network computational graph can be compiled in a segmented manner, the compiling difficulty of the neural network computational graph is reduced, the compiling efficiency and effect of the neural network computational graph are improved, the requirement of compiling of the neural network computational graph on chip hardware storage resources is effectively reduced, the problem that the storage resources required by compiling of the neural network computational graph are large and the actual chip hardware storage resources cannot be met is favorably solved, the reasonable utilization of the chip hardware storage resources is realized, and the utilization efficiency of the chip hardware storage resources is improved; on the other hand, the output of each computation subgraph obtained by automatic splitting is used as a part of the output result of the neural network computation graph, and can be stored in an external memory or a host corresponding to the chip without being stored on the chip or being stored on the chip for a long time, so that on-chip storage resources can be effectively saved, and the utilization rate of the on-chip storage resources is improved.
In some embodiments, after sequentially generating an executable file corresponding to each computation subgraph from each computation subgraph, the processing method further comprises: loading the executable file corresponding to each computer subgraph to the corresponding chip; and responding to the completion of the running of the executable file corresponding to the computation subgraph on the chip, and storing the output result corresponding to the computation subgraph to an external memory or a host corresponding to the chip.
The output result corresponding to the computation subgraph is stored in an external memory or a host corresponding to the chip, and the output result corresponding to the computation subgraph is read from the external memory (such as a double-rate synchronous dynamic random access memory) or the host to the chip when needed, so that the on-chip storage resource of the chip can be effectively saved.
Fig. 3 is a schematic flowchart of a specific implementation manner of step S11 in fig. 1, and referring to fig. 3, in some embodiments, in step S13, sequentially generating a chip executable file corresponding to each computation sub-graph according to each computation sub-graph, where the method may further include: step S31 to step S32.
Step S31, in response to failure error reporting occurring when generating the executable file corresponding to the current computation subgraph, further splitting the current computation subgraph into a plurality of serially connected computation subgraphs.
And step S32, sequentially generating an executable file corresponding to each sub-graph obtained by further splitting according to each sub-graph obtained by further splitting.
In step S31, a failure error occurs when the executable file corresponding to the current computation subgraph is generated, which indicates that the current computation subgraph cannot be supported by the corresponding chip and cannot normally run on the corresponding chip, that is, the corresponding chip cannot support the compilation and execution of the current computation subgraph on the chip, for example, the computation power, the computation amount, or the storage amount required by the current computation subgraph exceeds the limit of the corresponding chip, which causes a failure error when the executable file corresponding to the generated computation subgraph occurs, so that the current computation subgraph is further split into multiple computation subgraphs connected in series.
In some embodiments, the current computational sub-graph may be further split in the splitting manner described in steps S11 and S12 above. In some embodiments, the current computation sub-graph may be further split by configuring another splitting manner as needed, or the current computation sub-graph may be further split by configuring another splitting manner and combining the splitting manner described in the foregoing step S11 and step S12, or the current computation sub-graph may be further split manually. The other splitting manner may be, for example, a manner of screening and splitting the target operator node according to other screening conditions of the configured target operator node, and the embodiment of the present disclosure does not specifically limit the manner of further splitting the current computational graph.
For the multiple computational subgraphs further split by step S31, each computational subgraph may have one or more operator nodes, respectively.
In step S32, for each computation subgraph obtained by further splitting in step S31, an executable file corresponding to the computation subgraph is generated according to the computation subgraph, where the executable file is an executable code executable on a corresponding chip, so as to implement compiling the computation subgraph to run on the corresponding chip, so that the corresponding chip executes a computation task corresponding to the computation subgraph.
Through the steps S31 and S32, a failure error is reported when the executable file corresponding to the current computational subgraph is generated, which indicates that the current computational subgraph may not be supported by the corresponding chip, so that the current computational subgraph is further divided and split, thereby improving the compiling effect, facilitating each segment of the obtained computational subgraph to be supported by the chip hardware, and utilizing the hardware resources of the chip as much as possible, and improving the execution efficiency of the chip for processing the neural network computational graph.
Fig. 4 is a flowchart illustrating another specific implementation manner of step S13 in fig. 1, and referring to fig. 4, in some embodiments, in step S13, sequentially generating a chip executable file corresponding to each computation subgraph according to each computation subgraph, the method may further include: step S41 to step S42.
Step S41, in response to no failure error report when generating the executable file corresponding to the current computation subgraph, taking the current computation subgraph as a target computation subgraph;
and step S42, loading the executable file corresponding to each target computation subgraph to the corresponding chip.
In step S41, no failure error is reported when the executable file corresponding to the current compute subgraph is generated, which indicates that the current compute subgraph can be supported by the corresponding chip and can run normally on the corresponding chip, that is, the corresponding chip supports the current compute subgraph to be compiled and executed on the chip, so that the current compute subgraph can be used as the target compute subgraph without further processing the current compute subgraph, and step S42 is executed.
In step S42, since no failure error occurs when the executable file corresponding to the target computation sub-graph is generated, the executable file of the target computation sub-graph can be supported by the corresponding chip and can be normally executed on the corresponding chip, and therefore, for each target computation sub-graph, the executable file corresponding to the target computation sub-graph can be loaded to the corresponding chip to run the executable file corresponding to the target computation sub-graph on the corresponding chip to perform the corresponding computation task.
It should be noted that the computation subgraph obtained by splitting in step S12 and the computation subgraph obtained by further splitting in step S31 are both subgraphs in the neural network computation graph, and therefore, in the embodiment of the present disclosure, specific distinction in name and definition is not made between the computation subgraph obtained by splitting in step S12 and the computation subgraph obtained by further splitting in step S31, identities and functions of the computation subgraph and the computation subgraph are substantially equal, and for specific description of generating an executable file in step S32, reference may be made to related description of generating an executable file in step S13 in the embodiment of the present disclosure, which is not described in detail herein.
In the embodiment of the present disclosure, no failure error report occurs when the executable file corresponding to each computation subgraph is generated, which indicates that all computation subgraphs can be supported by the corresponding chip and can normally operate on the corresponding chip, but there may exist a situation that a part of computation subgraphs actually require less resources and a part of computation subgraphs actually require more resources, and when the executable files of all computation subgraphs are directly loaded on the corresponding chip to operate, it may cause that the chip hardware resources cannot be reasonably utilized, resulting in that the load of part of the chip hardware resources is lower, and the load of part of the chip hardware resources is higher, which is not beneficial to the load balance of the chip hardware resources, therefore, in some embodiments, in order to achieve the reasonable utilization of the chip hardware resources, improve the utilization efficiency of the chip hardware resources, achieve the load balance of the chip hardware resources, after the part of computation subgraphs need to be fused and spliced into one computation subgraph, and generating corresponding executable files and loading the executable files to corresponding chips.
Fig. 5 is a schematic flowchart of another processing method of a neural network computation graph according to an embodiment of the present disclosure, and referring to fig. 5, in some embodiments, after an executable file corresponding to each computation sub-graph is sequentially generated according to each computation sub-graph, that is, after step S13, the processing method may further include: step S51 to step S54.
And step S51, responding to no failure and error report when the executable file corresponding to each computation subgraph is generated, acquiring at least one group of computation subgraphs, wherein each group of computation subgraphs comprises at least two computation subgraphs which are connected in series.
And step S52, performing fusion processing on at least two computation subgraphs in each group of computation subgraphs to obtain alternative computation subgraphs corresponding to each group of computation subgraphs.
And step S53, taking the alternative computation subgraph as the target computation subgraph under the condition that no failure and error are reported when the executable file corresponding to the alternative computation subgraph is generated.
And step S54, loading the executable file corresponding to each target computation subgraph to the corresponding chip.
In step S51, in order to achieve reasonable utilization of chip hardware resources, improve utilization efficiency of chip hardware resources, and achieve load balancing of chip hardware resources, when no failure error report occurs during generation of an executable file corresponding to each computation subgraph, at least one group of computation subgraphs is first obtained from all computation subgraphs connected in series in sequence, where each group of computation subgraphs includes at least two computation subgraphs connected in series.
Exemplarily, referring to the neural network computation graph shown in fig. 2, assuming that the neural network computation graph shown in fig. 2 is currently split into computation subgraph 1, computation subgraph 2, computation subgraph 3 and computation subgraph 4, in step S51, a set of computation subgraphs may be obtained, which may include, for example, computation subgraph 1 and computation subgraph 2 connected in series, or computation subgraph 2 and computation subgraph 3 connected in series, or computation subgraph 2, computation subgraph 3 and computation subgraph 4 connected in series.
In step S52, the fusion process is to connect the output of the previous computational subgraph and the input of the next computational subgraph in every two adjacent computational subgraphs in a group of computational subgraphs in a one-to-one correspondence, so as to merge at least two computational subgraphs in a group of computational subgraphs into one computational subgraph as a candidate computational subgraph.
Exemplarily, referring to the neural network computation graph shown in fig. 2, a group of computation subgraphs includes computation subgraph 1 and computation subgraph 2 connected in series, computation subgraph 1 includes operator node 1, operator node 2 and operator node 3, computation subgraph 2 includes operator node 4 and operator node 5, and in step S52, the computation subgraph 1 and the computation subgraph 2 connected in series are fused to obtain an alternative computation subgraph, where the alternative computation subgraph includes operator node 1 to operator node 5.
After the candidate computational subgraphs obtained by fusing each group of computational subgraphs are processed, corresponding executable files can be generated according to the candidate computational subgraphs obtained by fusing, in step S53, when no failure error report occurs when the executable files corresponding to the candidate computational subgraphs obtained by fusing are generated, it indicates that the candidate computational subgraphs obtained by fusing can still be supported by a corresponding chip and can normally operate on the corresponding chip, that is, the corresponding chip can support at least two computational subgraphs corresponding to one group of computational subgraphs to be compiled and executed on the chip, so that the candidate computational subgraphs can be used as target computational subgraphs.
In some embodiments, for any group of computation subgraphs, when the group of computation subgraphs includes at least two computation subgraphs connected in series and an error occurs in failure and error reporting when an executable file corresponding to a candidate computation subgraph obtained by fusion processing is generated, it indicates that a corresponding chip cannot support compilation execution of the at least two computation subgraphs of the group of computation subgraphs on the chip, and therefore, the at least two computation subgraphs in the group of computation subgraphs before the fusion processing can be respectively used as a target computation subgraph.
For example, assuming that the group of computation subgraphs acquired in step S51 includes a computation subgraph a and a computation subgraph B, a candidate computation subgraph C is obtained after the fusion processing, and when a failure error occurs when generating an executable file corresponding to the candidate computation subgraph C obtained by the fusion processing, the computation subgraph a before the fusion processing may be used as a target computation subgraph, and the computation subgraph B before the fusion processing may be used as a target computation subgraph.
In some embodiments, for any group of computation subgraphs, in the case that the group of computation subgraphs includes multiple (e.g., 3 or 4) computation subgraphs connected in series and a failure error occurs when generating an executable file corresponding to an alternative computation subgraph obtained by fusion processing, the group of computation subgraphs may be further divided into one or more groups of computation subgraphs and the operations in steps S52 and S53 described above are performed, and the computation subgraphs not divided into groups may be respectively used as one target computation subgraph.
For example, assuming that the group of computed subgraphs obtained in step S51 includes a computed subgraph D, a computed subgraph E, and a computed subgraph F, after the fusion processing, a candidate computed subgraph G is obtained, and when an executable file corresponding to the candidate computed subgraph G obtained by the fusion processing is generated and a failure error occurs, the computed subgraph D and the computed subgraph E may be re-partitioned into a group of computed subgraphs, and the computed subgraph D and the computed subgraph E are subjected to the fusion processing again, and the fused subgraph is subjected to the operation in step S53, but the computed subgraph F is not partitioned into a group, and the computed subgraph F is solely used as a target computed subgraph.
In addition, reference may be made to the above description of step S42 for the description of step S54, and details are not repeated here.
In some embodiments, in step S51, obtaining at least one set of computational subgraphs may further include: determining at least one group of computational subgraphs which can be subjected to fusion processing in all the computational subgraphs according to the subgraph attribute parameters corresponding to the computational subgraphs; the subgraph attribute parameters corresponding to the computational subgraph can include the computation amount corresponding to the computational subgraph, weight information and the number of nodes of the corresponding vector acceleration unit graph.
Wherein, the computation amount corresponding to the computation subgraph can be the sum of the computation amounts required by all the operator nodes contained in the computation subgraph, the weight information corresponding to the computational subgraph may comprise the sum of the weight sizes required for computing all the operator nodes contained in the subgraph, the method comprises the steps of mapping a neural network computation graph to specific chip hardware in advance before splitting the neural network computation graph to obtain a vector acceleration unit (APU) graph corresponding to the neural network computation graph, wherein the vector acceleration unit (APU) graph corresponding to the neural network computation graph characterizes the mapping relation of operator nodes of the neural network computation graph on the specific chip hardware (such as a many-core chip and a physical core on the chip), and correspondingly, the vector acceleration unit (APU) graph corresponding to a computation subgraph split from the neural network computation graph characterizes the mapping relation of the operator nodes of the computation subgraph on the specific chip hardware.
In some embodiments, the step of determining at least one group of computational subgraphs capable of performing fusion processing in all the computational subgraphs according to the subgraph attribute parameters corresponding to the computational subgraphs may further include: according to the serial connection relation and the execution sequence of all the computation subgraphs, sequentially checking whether at least two computation subgraphs in serial connection meet the fusion condition; and when the at least two serially connected computer subgraphs meet the fusion condition, taking the at least two computer subgraphs as a group of computer subgraphs.
Wherein the fusion condition may include: the sum of the calculated amount corresponding to the at least two calculated subgraphs is greater than or equal to a minimum calculated amount threshold value and less than or equal to a maximum calculated amount threshold value; the sum of the weight information corresponding to the at least two computation subgraphs is greater than or equal to a minimum weight threshold value and is less than or equal to a maximum weight threshold value; the sum of the number of nodes of the APU graph corresponding to the at least two computation subgraphs is greater than or equal to a minimum number threshold and less than or equal to a maximum number threshold.
The minimum calculation threshold and the maximum calculation threshold may be configured according to actual needs, and similarly, the minimum weight threshold and the maximum weight threshold may also be configured according to actual needs, and the minimum number threshold and the maximum number threshold may also be configured according to actual needs.
For example, assuming that all current computation subgraphs include a computation subgraph H, a computation subgraph I, a computation subgraph J, a computation subgraph K and a computation subgraph L which are sequentially connected in series, when it is determined that the computation subgraph H and the computation subgraph I which are connected in series satisfy the above-mentioned fusion condition, the computation subgraph H and the computation subgraph I which are connected in series are taken as a group of computation subgraphs, and when it is determined that the computation subgraph J, the computation subgraph K and the computation subgraph L which are connected in series satisfy the above-mentioned fusion condition, the computation subgraph J, the computation subgraph K and the computation subgraph L which are connected in series are taken as a group of computation subgraph.
In some embodiments, the at least two serially connected computer subgraphs capable of being fused are judged according to the fusion condition, so that the efficiency of obtaining at least one group of computer subgraphs capable of being fused can be improved, the compiling effect of the fused subgraphs is improved, and the probability of failure and error reporting of the generated fused subgraphs is reduced.
In some embodiments, in step S51, obtaining at least one set of computational subgraphs may further include: and according to the serial connection relation and the execution sequence of all the computation subgraphs, respectively taking every two serially connected computation subgraphs in all the computation subgraphs as a group of computation subgraphs.
For example, assuming that all current computation subgraphs include a computation subgraph H, a computation subgraph I, a computation subgraph J and a computation subgraph K which are serially connected in sequence, the serially connected computation subgraph H and computation subgraph I are used as a group of computation subgraphs, and the serially connected computation subgraph J and computation subgraph K are used as a group of computation subgraphs.
In some embodiments, by directly taking every two serially connected computation subgraphs in all the computation subgraphs as a group of computation subgraphs, the efficiency of obtaining at least one group of computation subgraphs can be effectively improved.
Fig. 6 is a schematic flowchart of another processing method of a neural network computation graph according to an embodiment of the present disclosure, and referring to fig. 6, in some embodiments, after an executable file corresponding to each computation sub-graph is sequentially generated according to each computation sub-graph, that is, after step S13, the processing method may further include: step S61 to step S64.
And step 61, in response to no failure and error report when the executable file corresponding to each computation subgraph is generated, merging the current computation subgraph and the next computation subgraph into alternative computation subgraphs according to the direction of the execution sequence of all computation subgraphs.
And step S62, taking the current computation subgraph as the target computation subgraph when failure error reporting occurs when the executable file corresponding to the alternative computation subgraph is generated.
And step S63, taking the next computational subgraph as the current computational subgraph, and returning to the step of fusing the current computational subgraph and the next computational subgraph into an alternative computational subgraph.
And step S64, loading the executable file corresponding to each target computation subgraph to the corresponding chip.
In step S61, when no failure error report occurs during the generation of the executable file corresponding to each computation subgraph, in order to achieve reasonable utilization of chip hardware resources, improve utilization efficiency of chip hardware resources, and achieve load balancing of chip hardware resources, first, according to the execution sequence direction (e.g., the direction from top to bottom in fig. 2) of all computation subgraphs serially connected in sequence, the current computation subgraph and the next computation subgraph are fused to obtain an alternative computation subgraph. The current computation subgraph and the next computation subgraph are two computation subgraphs which are connected in series and adjacent in the neural network computation subgraph, and the next computation subgraph is the next computation subgraph of the current computation subgraph along the execution sequence direction, namely the computation subgraph connected with the output of the current computation subgraph along the execution sequence direction.
For example, referring to the neural network computation graph shown in fig. 2, assuming that the neural network computation graph shown in fig. 2 is currently split into a computation subgraph 1, a computation subgraph 2, a computation subgraph 3 and a computation subgraph 4, and according to the direction of the execution sequence, the current computation subgraph is the computation subgraph 1, in step S61, the computation subgraph 1 and the computation subgraph 2 are first fused to obtain an alternative computation subgraph.
After the candidate computational subgraph is obtained, a corresponding executable file can be generated according to the candidate computational subgraph obtained through the fusion processing, in step S62, when a failure error occurs when the executable file corresponding to the candidate computational subgraph is generated, it is indicated that the candidate computational subgraph obtained through the fusion processing cannot be supported by a corresponding chip and cannot normally run on the corresponding chip, and therefore the current computational subgraph is taken as a target computational subgraph alone and is subjected to step S63 to continue the fusion and judgment according to the execution sequence direction.
In step S63, the next computational sub-graph is taken as the current computational sub-graph, and the step of merging the current computational sub-graph and the next computational sub-graph into the alternative computational sub-graph is executed, that is, the step S61 is executed, so as to continue to merge and judge the current computational sub-graph and the next computational sub-graph of the current computational sub-graph according to the execution sequence direction until all the computational sub-graphs are traversed.
For example, if a failure error is reported when the executable file corresponding to the candidate computational subgraph is obtained by fusing the computational subgraph 1 and the computational subgraph 2 is generated, it indicates that the computational subgraph 1 and the computational subgraph 2 are not suitable for fusion, so that the computational subgraph 1 is taken as a target computational subgraph, the computational subgraph 2 is taken as a current computational subgraph, the computational subgraph 2 and the computational subgraph 3 are continuously fused and judged, and so on until all the computational subgraphs are traversed.
In addition, for the description of step S64, reference may be made to the above description of step S42, which is not repeated herein.
Fig. 7 is a schematic flowchart of another processing method of a neural network computation graph according to an embodiment of the present disclosure, and referring to fig. 7, in some embodiments, after an executable file corresponding to each computation sub-graph is sequentially generated according to each computation sub-graph, that is, after step S13, the processing method may further include: step S71 to step S74.
And step 71, in response to no failure and error report when the executable file corresponding to each computation subgraph is generated, merging the current computation subgraph and the next computation subgraph into alternative computation subgraphs according to the direction of the execution sequence of all computation subgraphs.
And step S72, taking the alternative computation subgraph as the target computation subgraph under the condition that no failure and error are reported when the executable file corresponding to the alternative computation subgraph is generated.
And step S73, taking the computation subgraph of which the execution sequence is after the next computation subgraph as the current computation subgraph, and returning to execute the step of merging the current computation subgraph and the next computation subgraph into an alternative computation subgraph.
And step S74, loading the executable file corresponding to each target computation subgraph to the corresponding chip.
For the description of step S71, reference may be made to the above description of step S61, which is not repeated herein.
After the candidate computation subgraph is obtained, a corresponding executable file can be generated according to the candidate computation subgraph obtained through the fusion processing, in step S72, under the condition that no failure error report occurs when the executable file corresponding to the candidate computation subgraph is generated, it is indicated that the candidate computation subgraph obtained through the fusion processing can still be supported by a corresponding chip and can normally operate on the corresponding chip, and therefore the candidate computation subgraph is used as a target computation subgraph and is subjected to step S73 to continue the fusion and judgment according to the direction of the execution sequence.
In step S73, the computation subgraph located after the next computation subgraph in the direction of the execution sequence and adjacent to the next computation subgraph is taken as the current computation subgraph, and the step of fusing the current computation subgraph and the next computation subgraph into the alternative computation subgraph is executed, that is, the step S71 is executed again to continue to fuse and judge the current computation subgraph and the next computation subgraph of the current computation subgraph in the direction of the execution sequence until all computation subgraphs are traversed.
For example, assuming that all current computation subgraphs include a computation subgraph H, a computation subgraph I, a computation subgraph J and a computation subgraph K which are sequentially connected in series, the current computation subgraph is the computation subgraph H, the next computation subgraph is the computation subgraph I, and when an executable file corresponding to the alternative computation subgraph is obtained by fusing the computation subgraph H and the computation subgraph I is generated, failure error reporting does not occur, it is indicated that the computation subgraph H and the computation subgraph I are suitable for further fusion, so that the alternative computation subgraph obtained after the computation subgraph H and the computation subgraph I are fused is used as a target computation subgraph, the computation subgraph J is used as the current computation subgraph, the computation subgraph J and the computation subgraph K are continuously fused and judged, and so on until all the computation subgraphs are traversed.
In addition, for the description of step S74, reference may be made to the above description of step S42, which is not repeated herein.
Fig. 8 is a schematic flowchart of another processing method of a neural network computation graph according to an embodiment of the present disclosure, and referring to fig. 8, in some embodiments, after an executable file corresponding to each computation sub-graph is sequentially generated according to each computation sub-graph, that is, after step S13, the processing method may further include: step S81 to step S84.
And step S81, responding to no failure and error report when the executable file corresponding to each computation subgraph is generated, and sequentially fusing two adjacent computation subgraphs according to the execution sequence direction of all the computation subgraphs.
And step S82, sequentially fusing two adjacent computation subgraphs in the direction opposite to the execution sequence direction.
And step S83, regarding any computational subgraph obtained through the fusion processing, taking the computational subgraph obtained through the fusion processing as a target computational subgraph under the condition that no failure and error are reported when an executable file corresponding to the computational subgraph obtained through the fusion processing is generated.
And step S84, loading the executable file corresponding to each target computation subgraph to the corresponding chip.
In step S81, in order to achieve reasonable utilization of chip hardware resources, improve utilization efficiency of chip hardware resources, and achieve load balancing of chip hardware resources when no failure error report occurs during generation of the executable file corresponding to each computation subgraph, first, in step S81, adjacent computation subgraphs are fused according to the execution sequence direction (e.g., the direction from top to bottom in fig. 2) of all computation subgraphs serially connected in sequence, so as to obtain alternative computation subgraphs, and after obtaining each alternative computation subgraph, a corresponding executable file can be generated according to the alternative computation subgraph obtained through the fusion processing.
Meanwhile, in step S82, the adjacent computation subgraphs are fused according to the direction opposite to the above-mentioned execution sequence direction (e.g. the direction from bottom to top in fig. 2), so as to obtain alternative computation subgraphs, and after each alternative computation subgraph is obtained, a corresponding executable file may be generated according to the alternative computation subgraph obtained through the fusion processing.
For example, assume that all the current computation subgraphs include computation subgraph H, computation subgraph I, computation subgraph J, computation subgraph K, computation subgraph L and computation subgraph M connected in series in this order, the direction of execution order is from computation subgraph H to computation subgraph M, and the direction opposite to the direction of execution order is from computation subgraph M to computation subgraph H. In step S81, according to the execution sequence direction, first performing fusion processing on the computational sub-graph H and the computational sub-graph I; at the same time, in step S82, the computation sub-graph M and the computation sub-graph L are first fused in the direction opposite to the execution sequence direction.
After each candidate computational subgraph is obtained, a corresponding executable file can be generated according to the candidate computational subgraph obtained through the fusion processing. In step S83, for the computed subgraph obtained by the fusion processing in any direction, when an executable file corresponding to the candidate computed subgraph obtained by the fusion processing is generated without failure and error reporting, it indicates that the computed subgraph obtained by the fusion processing can still be supported by the corresponding chip and can normally run on the corresponding chip, so that the computed subgraph obtained by the fusion processing is used as a target computed subgraph, and the fusion processing and judgment of two adjacent computed subgraphs are continued simultaneously according to different directions, for example, the computed subgraph J and K are continued to be subjected to the fusion processing and judgment, and so on until all the computed subgraphs are traversed.
In some embodiments, for a computation subgraph obtained by fusion processing in any direction, when a failure error occurs when an executable file corresponding to the computation subgraph obtained by the fusion processing is generated, it indicates that two adjacent computation subgraphs before the fusion processing are not suitable for fusion, so the processing method may further include:
and when the computational subgraph obtained by the fusion processing is the computational subgraph obtained by the fusion processing according to the execution sequence direction, taking the computational subgraph with earlier execution sequence in the two adjacent computational subgraphs before the fusion processing as a target computational subgraph, and continuously performing the fusion processing on the computational subgraph with later execution sequence and the next computational subgraph adjacent to the computational subgraph with later execution sequence according to the execution sequence direction.
And when the computational subgraph obtained by the fusion processing is the computational subgraph obtained by the fusion processing according to the direction opposite to the direction of the execution sequence, taking the computational subgraph with the later execution sequence in the two adjacent computational subgraphs before the fusion processing as a target computational subgraph, and continuously performing the fusion processing on the computational subgraph with the earlier execution sequence and the previous computational subgraph adjacent to the computational subgraph with the earlier execution sequence according to the direction opposite to the direction of the execution sequence.
In addition, for the description of step S84, reference may be made to the above description of step S42, which is not repeated herein.
In some embodiments, trial fusion and judgment are sequentially performed on adjacent computation subgraphs along different directions, so that the efficiency of trial fusion and judgment can be effectively improved.
In some embodiments, after the executable file corresponding to the target operator node is loaded to the corresponding chip, the processing method further includes: and responding to the completion of the running of the executable file corresponding to the target computation subgraph on the corresponding chip, and storing the output result corresponding to the target computation subgraph to an external memory or a host corresponding to the chip.
The output result corresponding to the target computation subgraph is stored in an external memory or a host corresponding to the chip, and the output result corresponding to the target computation subgraph is read from the external memory (such as a double-rate synchronous dynamic random access memory) or the host to the chip when needed, so that on-chip storage resources of the chip can be effectively saved.
In some embodiments, the chip may be any one of many-core chips in a many-core system, which may include one or more many-core chips based on a storage-integrated many-core architecture, each of which may include multiple physical cores (also referred to as compute cores) each having independent memory.
In some embodiments, for a plurality of serially connected target computation subgraphs obtained through the above processing, chips corresponding to the plurality of target computation subgraphs may be the same many-core chip, or may also respectively correspond to different many-core chips, or a part of the target computation subgraphs respectively correspond to different many-core chips, and a part of the target computation subgraphs correspond to the same many-core chip, which may be determined specifically according to actual needs, and the embodiments of the present disclosure are not particularly limited.
It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.
In addition, the present disclosure also provides a processing apparatus, an electronic device, and a computer-readable storage medium, where the processing apparatus is configured to implement the processing method of the neural network computational graph provided by the present disclosure, and both the electronic device and the computer-readable storage medium may be configured to implement the processing method of the neural network computational graph provided by the present disclosure, and the corresponding technical solutions and descriptions and the corresponding descriptions in the method sections are not described herein again.
Fig. 9 is a block diagram of a processing apparatus according to an embodiment of the present disclosure, and referring to fig. 9, an embodiment of the present disclosure provides a processing apparatus 90, where the processing apparatus 90 is configured to process a neural network computation graph to be processed, where the neural network computation graph includes a plurality of operator nodes, and the processing apparatus 90 includes: a determination module 91, a first splitting module 92 and a generation module 93.
The determining module 91 is configured to determine all target operator nodes in the neural network computational graph according to the output connection relationships of the plurality of operator nodes, where the target operator nodes have a first output end and a second output end, and the operator nodes connected to the first output end are data output nodes.
The first splitting module 92 is configured to split the neural network computation graph into multiple computation subgraphs connected in series by using the second output end of the target operator node as a splitting point, where the output end of the operator node connected to the second output end is connected to other operator nodes.
The generating module 93 is configured to generate an executable file corresponding to each computation subgraph in sequence according to each computation subgraph.
In some embodiments, the processing device 90 may further include a loading module (not shown in the figure) and a storing module (not shown in the figure), wherein the loading module is configured to load the executable file corresponding to each computer sub-graph to the corresponding chip; and the storage module is used for responding to the completion of the running of the executable file corresponding to the computation subgraph on the chip and storing the output result corresponding to the computation subgraph to an external memory or a host corresponding to the chip.
In some embodiments, the processing device 90 may further include a second splitting module (not shown); the generating module 93 is configured to: triggering a second splitting module to further split the current computation sub-graph into a plurality of serially connected computation sub-graphs in response to failure error reporting when an executable file corresponding to the current computation sub-graph is generated; and sequentially generating an executable file corresponding to each computation subgraph obtained by further splitting according to each computation subgraph obtained by further splitting.
In some embodiments, the generating module 93 is configured to: and in response to no failure and no error when generating the executable file corresponding to the current computational sub-graph, taking the current computational sub-graph as a target computational sub-graph.
In some embodiments, the processing device 90 may further include a loading module (not shown in the figure) for loading the executable file corresponding to each target sub-computer to the corresponding chip.
In some embodiments, the processing device 90 may further include an obtaining module (not shown), a first fusing module (not shown), and a determining module (not shown). The obtaining module is configured to obtain at least one group of computation subgraphs in response to no failure error reporting when the generating module 93 generates the executable file corresponding to each computation subgraph, where each group of computation subgraphs includes at least two computation subgraphs connected in series; the first fusion module is used for carrying out fusion processing on at least two computational subgraphs in each group of computational subgraphs to obtain alternative computational subgraphs corresponding to each group of computational subgraphs; the judging module is configured to take the candidate computation sub-graph as a target computation sub-graph under the condition that no failure or error is reported when the generating module 93 generates the executable file corresponding to the candidate computation sub-graph.
In some embodiments, the obtaining module is configured to determine, according to the subgraph attribute parameter corresponding to each computational subgraph, at least one group of computational subgraphs which can be subjected to fusion processing in all the computational subgraphs; the subgraph attribute parameters corresponding to the computational subgraph comprise the calculated amount corresponding to the computational subgraph, weight information and the number of nodes of the corresponding vector acceleration unit APU graph.
In some embodiments, the obtaining module is to: according to the serial connection relation and the execution sequence of all the computation subgraphs, sequentially checking whether at least two computation subgraphs in serial connection meet the fusion condition; when determining that the at least two serially connected computer subgraphs meet the fusion condition, taking the at least two computer subgraphs as a group of computer subgraphs; wherein the fusion conditions include: the sum of the calculated amount corresponding to the at least two calculated subgraphs is greater than or equal to a minimum calculated amount threshold value and less than or equal to a maximum calculated amount threshold value; the sum of the weight information corresponding to the at least two computation subgraphs is greater than or equal to a minimum weight threshold value and is less than or equal to a maximum weight threshold value; the sum of the number of nodes of the APU graph corresponding to the at least two computation subgraphs is greater than or equal to a minimum number threshold and less than or equal to a maximum number threshold.
In some embodiments, the obtaining module is configured to respectively use every two serially connected computation subgraphs in all computation subgraphs as a group of computation subgraphs according to the serial connection relationship and the execution order of all computation subgraphs.
In some embodiments, the processing device 90 may further include a second fusion module (not shown), which is configured to: in response to no failure and error reporting when the generation module generates the executable file corresponding to each computational sub-graph, fusing the current computational sub-graph and the next computational sub-graph into alternative computational sub-graphs according to the execution sequence direction of all the computational sub-graphs; when the generation module 93 generates an executable file corresponding to the candidate computation subgraph and fails to report errors, taking the current computation subgraph as a target computation subgraph, taking the next computation subgraph as the current computation subgraph, and returning to execute the step of fusing the current computation subgraph and the next computation subgraph into the candidate computation subgraph; and under the condition that no failure and error report exist when the generation module 93 generates the executable file corresponding to the alternative computation subgraph, taking the alternative computation subgraph as a target computation subgraph, taking the computation subgraph with the execution sequence after the next computation subgraph as the current computation subgraph, and returning to execute the step of fusing the current computation subgraph and the next computation subgraph into the alternative computation subgraph.
In some embodiments, the processing device 90 may further include a third fusion module (not shown). Wherein the third fusion module is configured to: in response to that no failure and error report exists when the generation module 93 generates the executable file corresponding to each computational sub-graph, sequentially fusing two adjacent computational sub-graphs according to the execution sequence direction of all the computational sub-graphs; simultaneously, sequentially fusing two adjacent computation subgraphs according to the direction opposite to the direction of the execution sequence; for any computational sub-graph obtained through fusion processing, under the condition that no failure and error report exist when an executable file corresponding to the computational sub-graph obtained through fusion processing is generated, the computational sub-graph obtained through fusion processing is used as a target computational sub-graph; under the condition that failure error reporting occurs when the executable file corresponding to the computation subgraph obtained by the fusion processing is generated, when the computation subgraph obtained by the fusion processing is the computation subgraph obtained by the fusion processing according to the direction of the execution sequence, the computation subgraph with the earlier execution sequence in the two adjacent computation subgraphs before the fusion processing is taken as a target computation subgraph; and when the computation subgraph obtained by the fusion processing is the computation subgraph obtained by the fusion processing according to the direction opposite to the execution sequence direction, taking the computation subgraph with the later execution sequence in two adjacent computation subgraphs before the fusion processing as a target computation subgraph.
In some embodiments, the processing device 90 further includes a memory module (not shown); and the storage module is used for responding to the completion of the running of the executable file corresponding to the target computation subgraph on the chip and storing the output result corresponding to the target computation subgraph to an external memory or a host corresponding to the chip.
The processing apparatus provided in the embodiments of the present disclosure is used to implement the processing method provided in the embodiments, and specific descriptions may refer to relevant descriptions in the processing method of the embodiments, and are not described herein again.
Fig. 10 is a block diagram of an electronic device according to an embodiment of the present disclosure, and referring to fig. 10, an embodiment of the present disclosure provides an electronic device including: at least one processor 101; at least one memory 102, and one or more I/O interfaces 103 coupled between the processor 101 and the memory 102; the memory 102 stores one or more computer programs executable by the at least one processor 101, and the one or more computer programs are executed by the at least one processor 101 to enable the at least one processor 101 to execute the processing method of the neural network computation graph.
The disclosed embodiments also provide a computer-readable storage medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the processing method of the neural network computation graph. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium, among others.
The disclosed embodiments also provide a computer program product including computer readable code or a non-volatile computer readable storage medium carrying computer readable code, when the computer readable code is executed in a processor of an electronic device, the processor in the electronic device executes the processing method of the neural network computation graph.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM), Static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer. In addition, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims (15)

1. A method of processing a neural network computational graph, the neural network computational graph comprising a plurality of operator nodes, the method comprising:
determining all target operator nodes in the neural network calculation graph according to the output connection relations of the operator nodes, wherein the target operator nodes are provided with a first output end and a second output end, and the operator nodes connected with the first output end are data output nodes;
splitting the neural network computation graph into a plurality of computation subgraphs which are connected in series by taking a second output end of the target operator node as a splitting point, wherein the output end of the operator node connected with the second output end is connected with other operator nodes;
and sequentially generating an executable file corresponding to each computation subgraph according to each computation subgraph.
2. The processing method according to claim 1, wherein after said sequentially generating an executable file corresponding to each said computational sub-graph from each said computational sub-graph, the processing method further comprises:
loading the executable file corresponding to each computer subgraph to a corresponding chip;
and responding to the completion of the running of the executable file corresponding to the computation subgraph on the chip, and storing the output result corresponding to the computation subgraph to an external memory or a host corresponding to the chip.
3. The processing method of claim 1, wherein the sequentially generating a chip executable file corresponding to each of the computation subgraphs according to each of the computation subgraphs comprises:
in response to a failure error report occurring while generating an executable file corresponding to the current compute subgraph, further splitting the current compute subgraph into a plurality of serially connected compute subgraphs;
and sequentially generating an executable file corresponding to each computation subgraph obtained by further splitting according to each computation subgraph obtained by further splitting.
4. The processing method of claim 1, wherein the sequentially generating a chip executable file corresponding to each of the computation subgraphs according to each of the computation subgraphs comprises:
in response to no failure and no error being reported when the executable file corresponding to the current computational sub-graph is generated, taking the current computational sub-graph as a target computational sub-graph;
and loading the executable file corresponding to each target computing subgraph to a corresponding chip.
5. The processing method according to claim 1, wherein after said sequentially generating an executable file corresponding to each said computational sub-graph from each said computational sub-graph, the processing method further comprises:
responding to no failure and error reporting when an executable file corresponding to each computational sub-graph is generated, and acquiring at least one group of computational sub-graphs, wherein each group of computational sub-graphs comprises at least two serially connected computational sub-graphs;
performing fusion processing on at least two computational subgraphs in each group of computational subgraphs to obtain alternative computational subgraphs corresponding to each group of computational subgraphs;
taking the alternative computation subgraph as a target computation subgraph under the condition that no failure and error are reported when an executable file corresponding to the alternative computation subgraph is generated;
and loading the executable file corresponding to each target computing subgraph to a corresponding chip.
6. The processing method of claim 5, wherein the obtaining at least one set of computational subgraphs comprises:
determining at least one group of computational subgraphs which can be subjected to fusion processing in all the computational subgraphs according to the subgraph attribute parameters corresponding to the computational subgraphs;
and the subgraph attribute parameters corresponding to the computation subgraph comprise the computation amount corresponding to the computation subgraph, weight information and the number of nodes of the corresponding vector acceleration unit APU graph.
7. The processing method according to claim 6, wherein the determining at least one group of the computational subgraphs which can be fused from among all the computational subgraphs according to the subgraph attribute parameters corresponding to each computational subgraph comprises:
according to the serial connection relation and the execution sequence of all the computation subgraphs, sequentially checking whether at least two computation subgraphs in serial connection meet the fusion condition;
when determining that the at least two serially connected computer subgraphs meet the fusion condition, taking the at least two computer subgraphs as a group of computer subgraphs;
wherein the fusion conditions include: the sum of the calculated amount corresponding to the at least two calculated subgraphs is greater than or equal to a minimum calculated amount threshold value and less than or equal to a maximum calculated amount threshold value; the sum of the weight information corresponding to the at least two computation subgraphs is greater than or equal to a minimum weight threshold value and is less than or equal to a maximum weight threshold value; the sum of the number of nodes of the APU graph corresponding to the at least two computation subgraphs is greater than or equal to a minimum number threshold and less than or equal to a maximum number threshold.
8. The processing method of claim 5, wherein the obtaining at least one set of computational subgraphs comprises:
and according to the serial connection relation and the execution sequence of all the computation subgraphs, respectively taking every two serially connected computation subgraphs in all the computation subgraphs as a group of computation subgraphs.
9. The processing method according to claim 1, wherein after said sequentially generating an executable file corresponding to each said computational sub-graph from each said computational sub-graph, the processing method further comprises:
in response to no failure and error reporting when the executable file corresponding to each computational sub-graph is generated, fusing the current computational sub-graph and the next computational sub-graph into alternative computational sub-graphs according to the execution sequence direction of all the computational sub-graphs;
taking the current computation sub-graph as a target computation sub-graph under the condition of failure error reporting when generating the executable file corresponding to the alternative computation sub-graph; and
taking the next computational subgraph as the current computational subgraph, and returning to execute the step of fusing the current computational subgraph and the next computational subgraph into an alternative computational subgraph;
and loading the executable file corresponding to each target computing subgraph to a corresponding chip.
10. The processing method of claim 9, wherein after said merging the current computational subgraph and the next computational subgraph into an alternative computational subgraph, the processing method further comprises:
taking the alternative computation subgraph as a target computation subgraph under the condition that no failure and error are reported when an executable file corresponding to the alternative computation subgraph is generated; and
and taking the computational subgraph with the execution sequence after the next computational subgraph as the current computational subgraph, and returning to execute the step of fusing the current computational subgraph and the next computational subgraph into the alternative computational subgraph.
11. The processing method according to claim 1, wherein after said sequentially generating an executable file corresponding to each said computational sub-graph from each said computational sub-graph, the processing method further comprises:
in response to no failure and error reporting when the executable file corresponding to each computational sub-graph is generated, sequentially fusing two adjacent computational sub-graphs according to the execution sequence direction of all the computational sub-graphs;
simultaneously, sequentially fusing two adjacent computation subgraphs according to the direction opposite to the direction of the execution sequence;
for any computational sub-graph obtained through fusion processing, under the condition that no failure and error report exist when an executable file corresponding to the computational sub-graph obtained through fusion processing is generated, the computational sub-graph obtained through fusion processing is used as a target computational sub-graph;
and loading the executable file corresponding to each target computing subgraph to a corresponding chip.
12. The processing method according to claim 11, wherein when an error occurs in a failure and reporting when generating the executable file corresponding to the computation sub-graph obtained by the fusion processing, the processing method further comprises:
when the computed subgraph obtained by the fusion processing is the computed subgraph obtained by the fusion processing according to the execution sequence direction, taking the computed subgraph with earlier execution sequence in two adjacent computed subgraphs before the fusion processing as a target computed subgraph;
and when the computational subgraph obtained by the fusion processing is the computational subgraph obtained by the fusion processing according to the direction opposite to the execution sequence direction, taking the computational subgraph with the later execution sequence in the two adjacent computational subgraphs before the fusion processing as the target computational subgraph.
13. A processing apparatus for processing a neural network computational graph to be processed, the neural network computational graph comprising a plurality of operator nodes, the processing apparatus comprising:
a determining module, configured to determine all target operator nodes in the neural network computational graph according to output connection relationships of a plurality of operator nodes, where the target operator nodes have a first output end and a second output end, and the operator nodes connected to the first output end are data output nodes;
the first splitting module is used for splitting the neural network computational graph into a plurality of serially connected computational subgraphs by taking a second output end of the target operator node as a splitting point, and the output end of the operator node connected with the second output end is connected with other operator nodes;
and the generating module is used for sequentially generating an executable file corresponding to each computation subgraph according to each computation subgraph.
14. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the processing method of any one of claims 1-12.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the processing method of any one of claims 1 to 12.
CN202210540587.XA 2022-05-17 2022-05-17 Processing method and processing device of neural network computation graph Pending CN114970814A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210540587.XA CN114970814A (en) 2022-05-17 2022-05-17 Processing method and processing device of neural network computation graph
PCT/CN2023/094832 WO2023222047A1 (en) 2022-05-17 2023-05-17 Processing method and processing unit for neural network computing graph, and device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210540587.XA CN114970814A (en) 2022-05-17 2022-05-17 Processing method and processing device of neural network computation graph

Publications (1)

Publication Number Publication Date
CN114970814A true CN114970814A (en) 2022-08-30

Family

ID=82983174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210540587.XA Pending CN114970814A (en) 2022-05-17 2022-05-17 Processing method and processing device of neural network computation graph

Country Status (1)

Country Link
CN (1) CN114970814A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467061A (en) * 2023-06-19 2023-07-21 之江实验室 Task execution method and device, storage medium and electronic equipment
WO2023222047A1 (en) * 2022-05-17 2023-11-23 北京灵汐科技有限公司 Processing method and processing unit for neural network computing graph, and device and medium
WO2024067303A1 (en) * 2022-09-30 2024-04-04 深圳市中兴微电子技术有限公司 Simulation method, and electronic device and computer-readable medium
WO2024120050A1 (en) * 2022-12-09 2024-06-13 华为技术有限公司 Operator fusion method used for neural network, and related apparatus

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023222047A1 (en) * 2022-05-17 2023-11-23 北京灵汐科技有限公司 Processing method and processing unit for neural network computing graph, and device and medium
WO2024067303A1 (en) * 2022-09-30 2024-04-04 深圳市中兴微电子技术有限公司 Simulation method, and electronic device and computer-readable medium
WO2024120050A1 (en) * 2022-12-09 2024-06-13 华为技术有限公司 Operator fusion method used for neural network, and related apparatus
CN116467061A (en) * 2023-06-19 2023-07-21 之江实验室 Task execution method and device, storage medium and electronic equipment
CN116467061B (en) * 2023-06-19 2023-09-19 之江实验室 Task execution method and device, storage medium and electronic equipment
US12039361B1 (en) 2023-06-19 2024-07-16 Zhejiang Lab Methods and apparatuses for executing tasks, storage mediums, and electronic devices

Similar Documents

Publication Publication Date Title
CN114970814A (en) Processing method and processing device of neural network computation graph
US20240220765A1 (en) Tape pasting mechanism with multiple functions of clutch-type synchronous punching, tape pasting and cutting
CN114841322A (en) Processing method and processing device of neural network computation graph
CN114841323A (en) Processing method and processing device of neural network computation graph
CN113703775B (en) Compiling method, compiling device, compiling equipment and storage medium
CN106547520B (en) Code path analysis method and device
CN105573734B (en) method and equipment for providing SDK file
CN110377340B (en) Operation method, device and related product
CN114818446B (en) Power service decomposition method and system facing 5G cloud edge terminal cooperation
CN111143446A (en) Data structure conversion processing method and device of data object and electronic equipment
CN104866556A (en) Database fault handling method and apparatus, and database system
CN114881214A (en) Processing method and processing device of neural network computation graph
CN115034358A (en) Processing method and processing device of neural network computation graph
US9262308B2 (en) Test paths generation for a physical system
CN118034924A (en) Data processing method and device based on many-core system, electronic equipment and medium
WO2023222047A1 (en) Processing method and processing unit for neural network computing graph, and device and medium
CN112633502A (en) Cross-platform execution method and device of deep learning model and electronic equipment
CN115809688B (en) Model debugging method and device, electronic equipment and storage medium
CN111027688A (en) Neural network calculator generation method and device based on FPGA
CN114881221A (en) Mapping scheme optimization method and device, electronic equipment and readable storage medium
CN110018831A (en) Program processing method, device and Related product
CN114449063B (en) Message processing method, device and equipment
CN113141407B (en) Page resource loading method and device and electronic equipment
CN114819106A (en) Calculation graph optimization method and device, electronic equipment and computer readable medium
CN114968225A (en) Micro-service unified construction method, environment generation method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination