CN110363700A - A kind of custom instruction parallel enumerating method based on depth map segmentation - Google Patents

A kind of custom instruction parallel enumerating method based on depth map segmentation Download PDF

Info

Publication number
CN110363700A
CN110363700A CN201910627526.5A CN201910627526A CN110363700A CN 110363700 A CN110363700 A CN 110363700A CN 201910627526 A CN201910627526 A CN 201910627526A CN 110363700 A CN110363700 A CN 110363700A
Authority
CN
China
Prior art keywords
subgraph
segmentation
custom instruction
time
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910627526.5A
Other languages
Chinese (zh)
Inventor
肖成龙
王珊珊
王心霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN201910627526.5A priority Critical patent/CN110363700A/en
Publication of CN110363700A publication Critical patent/CN110363700A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30141Implementation provisions of register files, e.g. ports
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of custom instruction parallel enumerating method based on depth map segmentation, is related to EDA Technique field.This method uses Master-slave parallel mode first, and the data flow diagram that the host node in computing cluster receives the intermediate representation generation phase generation that dedicated custom instruction collection automatically generates process is used as input;Then initial data flow graph is divided by several subgraphs using the depth image segmentation method based on nonlinear regression Runtime prediction model, and the subgraph after segmentation is distributed to the idle calculate node in computing cluster;The runing time of the subtask of segmentation is predicted simultaneously, according to the number of the predicted time of all subtasks and calculate node, judges whether to need to continue to be split complicated subtask;Calculate node enumerates custom instruction from the subgraph received using convex portion enumeration of graph algorithm.The method of the present invention can more effectively guarantee the load balancing between calculate node, reach the speed-up ratio of approximately linear.

Description

A kind of custom instruction parallel enumerating method based on depth map segmentation
Technical field
The present invention relates to EDA Technique field more particularly to a kind of customized fingers based on depth map segmentation Enable parallel enumerating method.
Background technique
In order to meet Embedded Application to high-performance and low-power consumption increasing need, by using accelerator or make by oneself The custom calculation of adopted functional unit (Custom Function Unit) is applied in embedded system more and more.Its In, application specific processor is one of the most important scheme for realizing customization operation.
Application specific processor is the processor of a kind of framework and instruction set optimization design, by expansion instruction set, so that mesh The partial code for marking application program is executed in reference processor, and other computation-intensive codes are real in the hardware of custom instruction It is executed in existing-custom feature unit.Application specific chip is designed for some specific application, can only be run and specifically be answered With, therefore lack certain flexibility.Reference processor relative to application specific chip (ASIC), in application specific processor framework It can guarantee certain flexibility.Have the design cycle long high with testing cost secondly, designing and producing application specific chip Disadvantage.
Relative to general processor, application specific processor encapsulates a series of basic operations by using custom instruction and instructs (for example, addition, subtraction, multiplication and logical operation), so that between these basic operations instruction certainly according to data dependence relation Dynamic chain, the not elementary instruction parallelization of data dependence relation, to largely improve the speed of operation.In addition, by It is encapsulated into a custom instruction in multiple elementary instructions, so that instruction fetch number and data are between register and processor The number of transmission is reduced, and then the power consumption of application specific processor is substantially less than general processor.
Automatically generating for expansion instruction set is the key that application specific processor design is realized.The application specific of application specific processor expands Automatically generating for instruction set of exhibition generally comprises four steps as shown in Figure 1: intermediate representation generates, and custom instruction is enumerated, made by oneself Justice instruction selection, code building.Application program is converted to suitable intermediate representation by intermediate representation generation phase, for example, control Data flow diagram;Custom instruction enumeration stage enumerates all subgraphs for meeting constraint condition as candidate's under framework constraint Custom instruction;The custom instruction choice phase is according to different purposes of design from the subgraph (figure of custom instruction enumerated Shapeization indicates) in select a part of best subgraph as final custom instruction.These custom instructions selected just are constituted Final expansion instruction set.Code generation phase is responsible for automatically generating the hardware realization code of custom instruction, and by source Code conversion is the fresh code comprising custom instruction.
It is the most key during expansion instruction set automatically generates and multiple that custom instruction, which is enumerated with custom instruction selection, Two miscellaneous stages.Aspect is enumerated in custom instruction and has a lot of research work, these research work are all using serial Method, however, when the problem is large in scale, serial approach possibly can not provide within reasonable time optimization design scheme or Optimization design scheme cannot be provided.Custom instruction enumerate problem be enumerated from the corresponding data flow diagram of application program it is all Meet the convex portion figure of certain design constraint or user's constraint as candidate custom instruction.Son in one given data flow diagram Figure (AS) number may at most have 2n, and wherein n is the number of node in data flow diagram.It can be seen that custom instruction piece Act is the very high problem of algorithm complexity.In order to reduce the complexity of problem, research or introducing microarchitecture before Constraint or some constraint conditions are artificially added.According to the difference of constraint condition, research before can classify and divide as follows Analysis:
(1) tree-like subgraph (TS): in order to reduce the complexity enumerated, the research of early stage, which focuses primarily upon, enumerates all trees Shape subgraph.However, only enumerate tree-like subgraph as custom instruction, very limited ground improving performance or power consumption can only be reduced.
(2) multiple input single output subgraph (MISO): such research is focused on enumerating all single defeated with multiple inputs Subgraph out.Although the complexity of problem, multi input can largely be reduced by only enumerating multiple input single output subgraph Single output subgraph is as the performance boost of custom instruction bring or lower power consumption or very limited.
(3) multiple-input and multiple-output subgraph (MIMO): due to the Limited Number (I/O) of the reading-writing port of register, recent Some researchs are mainly focused on enumerating the subgraph for meeting I/O limitation.Wherein Pozzi et al. is proposed based on binary decision tree Enumeration, the algorithm cut down search using the monotonicity for the output number for forming subgraph in data flow diagram by topological order Space.In order to more preferably utilize the topological structure characteristic of data flow diagram, Xiao et al. proposes more effective algorithm.Xiao et al. is mentioned Go out based on disposable figure dividing method custom instruction parallel enumerating method, when calculate node is less, the method can be obtained Approximately linear speed-up ratio.The upper limit that Chen et al. demonstrates the subgraph number for meeting I/O condition for the first time is nIN+OUT, wherein n is number According to the number of node in flow graph, while it also proposed one and can enumerate respectively connected subgraph and separation by parameter setting The algorithm of subgraph.However, this algorithm is in the runing time enumerated on all subgraphs (including connected subgraph and separation subgraph) It is suitable with the runing time of algorithm that Pozzi et al. is proposed.Xiao et al. proposes one and can be required neatly according to user It enumerates connected subgraph or separates the algorithm of subgraph.Experiment show the algorithm enumerate the runing time on all subgraphs than Fast one to two orders of magnitude of algorithm that Chen et al. is proposed.
(4) maximum convex portion figure (MaxMIMO): being found through experiments that, enumerates in the case where loosening I/O limitation customized Instruction, tends to bring higher performance boost.Therefore, in recent years, also there is part research to concentrate on and enumerate maximum convex portion figure, Limitation without considering I/O.It should be noted that although MaxMIMO subgraph can bring higher property as custom instruction It can be promoted, but it does not have good reusability or versatility usually.
In addition to the research that above type of custom instruction is enumerated, in recent years, also there are some scholars to propose relevant algorithm To enumerate all convex portion figures (ACS).The upper limit that Gutin et al. demonstrates the number of the convex portion figure in data flow diagram for the first time is 2n+n+1-dn, wherein n is the number of network nodes in data flow diagram, if n is even number dn=2*2n/2If n is odd number dn=3*2(n-1)/2.Wang et al., which is proposed, can enumerate all convex portion figures or enumerate the algorithm for meeting the convex portion figure of size constraint, the calculation Method is fast 3.29 times average compared with the algorithm that Balistera et al. is proposed.Wang et al. also proposed a kind of based on disposable figure simultaneously The subgraph parallel enumerating method of segmentation, the experimental results showed that, when calculate node number is less, this is parallel relative to serial approach Method can reach the speed-up ratio of approximately linear.Increasing however as calculate node number, the problem of load imbalance, gradually shows, It is in downward trend that this method, which obtains speed-up ratio,.Rezgui et al. proposes a kind of enough by the way that initial problem to be decomposed into Subproblem guarantees the method for parallel processing of the load balancing between calculate node.The author is usually each by many experiments discovery It, being capable of preferably proof load equilibrium when the subproblem number of calculate node distribution is between 30 to 100.However, due to each Calculate node distribute suitable subproblem number be it is very doubt, this method still may cause the load between calculate node It is unbalanced, to influence the efficiency of parallel processing.
Summary of the invention
It is a kind of based on depth map segmentation the technical problem to be solved by the present invention is in view of the above shortcomings of the prior art, provide Custom instruction parallel enumerating method, using Master-slave parallel mode realize custom instruction parallel enumerating.
In order to solve the above technical problems, the technical solution used in the present invention is: a kind of making by oneself based on depth map segmentation Adopted parallel instructions enumeration methodology, comprising the following steps:
Step 1, using Master-slave parallel mode, the host node in computing cluster receives dedicated custom instruction collection and automatically generates The data flow diagram that the intermediate representation generation phase of process generates is as input;
The data flow diagram is a kind of directed acyclic graph G=(V, E), wherein nodal set V={ v1..., vnIndicate basic Instruction, n are the number of data flow diagram node, side collection E={ e1..., em∈ V × V indicate instruction between data dependence relation, m Indicate the number on data flow diagram side;
Step 2, using the depth image segmentation method based on nonlinear regression Runtime prediction model by original number Several subgraphs are divided into according to flow graph, and the subgraph after segmentation is distributed into the idle calculate node in computing cluster, specific method Are as follows:
Step 2.1, custom instruction is enumerated task T progress be disposably divided into k subtask, shown in following formula:
Wherein, Gk=G- { v1, v2..., vk-1It is k-th of subgraph generating after data flow diagram G segmentation, and k=1,2 ..., | V |, | v | for the nodal point number in figure G, E (Gk, vk) indicate from k-th of subgraph GkIn enumerate it is all comprising node vkCustomized finger It enables;
Step 2.2 establishes the Runtime prediction model based on nonlinear regression, the operation to the subtask of segmentation Time is predicted;
When constraint condition is certain, custom instruction enumerate the time in figure nodal point number and number of edges it is related, make by oneself Justice instruction is enumerated the time and is increased with the increase of nodal point number or number of edges;The subgraph given for one enumerates task TkAnd its it is right The data flow diagram G answeredk(Vk, Ek), the worst runing time that custom instruction is enumerated isWherein α is constant, | Vk | for figure GkIn nodal point number;There are many extensive forms of the truth of a matter in formula, it is assumed here that:
Wherein, f (| Vk|, | Ek|) be nodal point number | Vk| and number of edges | Ek| polynomial function;
Taylor series expansion is carried out to formula (2) and obtains Runtime prediction model, shown in following formula:
Wherein, parameter k ' is used for the expansion of Controlling model, parameter aI, jBy using the method for nonlinear regression, according to reality Data fitting is tested to obtain;
Step 2.3, according to the predicted time of all subtasks and the number of calculate node, judge whether to need to continue to multiple Miscellaneous subtask is split;If the prediction runing time of subtask is more than given time upper limit, which is continued Several subtasks are divided into, until the prediction runing time of all subtasks is less than or equal to given time upper limit, are otherwise held Row step 3;The given time upper limit is the consensus forecast runing time of current all subtasks;
Assuming that subtask TkPrediction runing time be greater than given time upper limit, to TkContinue to divide, following formula It is shown:
Wherein, GK, l=Gk-{w1, w2..., wl-1It is subtask TkFirst of the subgraph generated after segmentation, E (GK, l, { vk, wl) indicate from subgraph GK, lIt enumerates comprising node vkWith node wlAll custom instructions;H is as cutoff value, TK, h-1Prediction Runing time is greater than current consensus forecast runing time, TK, hPrediction runing time be less than current consensus forecast runing time;Such as Fruit task TK, lPrediction runing time upper limit value is still greater than, then continue TK, lIt is divided into several subtasks;
Step 3, calculate node enumerate custom instruction from the subgraph received using convex portion enumeration of graph algorithm;
It is described to enumerate custom instruction are as follows: to be enumerated from a given data flow diagram G=(V, E) and meet the following conditions All subgraph S:(1) subgraph S is convex portion figure;(2) subgraph S is connected graph;
The convex portion figure are as follows: for the subgraph S of G,v∈VsIf all only passing through S in any path in G between u and v In node, then claiming S is the convex portion figure of G;
The connected graph are as follows: for the subgraph S of G,v∈Vs, there are at least one paths to connect u and v, then S is connection Figure.
The beneficial effects of adopting the technical scheme are that provided by the invention a kind of based on depth map segmentation A kind of custom instruction parallel enumerating method, it is contemplated that custom instruction enumerates the high complexity of problem, is transported using task based access control Initial problem is divided into several subproblems by the depth image segmentation method of row time prediction model, and is distinguished by calculate node only On the spot solve subproblem.Compared with existing parallel method, the method for the present invention can more effectively guarantee negative between calculate node Equilibrium is carried, the speed-up ratio of approximately linear is reached.
Detailed description of the invention
Fig. 1 is that the application specific expansion instruction set that background of invention provides automatically generates flow chart;
Fig. 2 is a kind of stream of the custom instruction parallel enumerating method based on depth map segmentation provided in an embodiment of the present invention Cheng Tu;
Fig. 3 is a kind of frame of the custom instruction parallel enumerating method based on depth map segmentation provided in an embodiment of the present invention Frame figure;
Fig. 4 is the speed-up ratio pair provided in an embodiment of the present invention for being directed to the acquirement of four test benchmark programs, three kinds of parallel methods Than figure, wherein (a) is benchmark program MP3, (b) is benchmark program MESA, (c) is benchmark program IIR, (d) is benchmark program DES3;
Fig. 5 is that the total run time provided in an embodiment of the present invention for using three kinds of each calculate nodes of parallel method compares, Wherein, (a) is MGP parallel method, (b) is EPS parallel method, (c) is ODP parallel method;
Fig. 6 is by the time provided in an embodiment of the present invention predicted using runing time prediction model of the invention and reality The comparison figure of runing time.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.
In the present embodiment, a kind of custom instruction parallel enumerating method based on depth map segmentation, as shown in Figures 2 and 3, The following steps are included:
Step 1, using Master-slave parallel mode, the host node in computing cluster receives dedicated custom instruction collection and automatically generates The data flow diagram that the intermediate representation generation phase of process generates is as input;
The data flow diagram is a kind of directed acyclic graph G=(V, E), wherein nodal set V={ v1..., vnIndicate basic Instruction, n are the number of data flow diagram node, side collection E={ e1..., em∈ V × V indicate instruction between data dependence relation, m Indicate the number on data flow diagram side;
Step 2, using the depth image segmentation method based on nonlinear regression Runtime prediction model by original number Several subgraphs are divided into according to flow graph, and the subgraph after segmentation is distributed into the idle calculate node in computing cluster, specific method Are as follows:
Step 2.1, custom instruction is enumerated task T progress be disposably divided into k subtask, shown in following formula:
Wherein, Gk=G- { v1, v2..., vk-1It is k-th of subgraph generating after data flow diagram G segmentation, and k=1,2 ..., | V |, | v | for the nodal point number in figure G, E (Gk, vk) indicate from k-th of subgraph GkIn enumerate it is all comprising node vkCustomized finger It enables;
Step 2.2 can be seen that the subtask that the above segmentation generates, some tasks, cannot obviously compared with other tasks complexity Guarantee the load balancing between calculate node, need the prediction runing time according to subtask, select complicated subtask carry out into The segmentation of one step;Therefore, the Runtime prediction model based on nonlinear regression is established, when to the operation of the subtask of segmentation Between predicted;
When constraint condition is certain, custom instruction enumerate the time in figure nodal point number and number of edges it is related, make by oneself Justice instruction is enumerated the time and is increased with the increase of nodal point number or number of edges;The subgraph given for one enumerates task TkAnd its it is right The data flow diagram G answeredk(Vk, Ek), the worst runing time that custom instruction is enumerated isWherein α is constant, | Vk | for figure GkIn nodal point number;There are many extensive forms of the truth of a matter in formula, it is assumed here that:
Wherein, f (| Vk|, | Ek|) be nodal point number | Vk| and number of edges | Ek| polynomial function;
Taylor series expansion is carried out to formula (2) and obtains Runtime prediction model, shown in following formula:
Wherein, parameter k ' is used for the expansion of Controlling model, parameter aI, jBy using the method for nonlinear regression, according to reality Data fitting is tested to obtain;
Step 2.3, according to the predicted time of all subtasks and the number of calculate node, judge whether to need to continue to multiple Miscellaneous subtask is split;If the prediction runing time of subtask is more than given time upper limit, which is continued Several subtasks are divided into, until the prediction runing time of all subtasks is less than or equal to given time upper limit;It is described to give Fixed time upper limit is the consensus forecast runing time of current all subtasks;
Assuming that subtask TkPrediction runing time be greater than given time upper limit, to TkContinue to divide, following formula It is shown:
Wherein, GK, l=Gk-{w1, w2..., wl-1It is subtask TkFirst of the subgraph generated after segmentation, E (GK, l, { vk, wl) indicate from subgraph GK, lIt enumerates comprising node vkWith node w1All custom instructions;H is as cutoff value, TK, m-1Prediction Runing time is greater than current consensus forecast runing time, TK, nPrediction runing time be less than current consensus forecast runing time;Such as Fruit task TK, lPrediction runing time upper limit value is still greater than, then continue TK, lIt is divided into several subtasks;
Step 3, calculate node enumerate custom instruction from the subgraph received using convex portion enumeration of graph algorithm;
It is described to enumerate custom instruction are as follows: to be enumerated from a given data flow diagram G=(V, E) and meet the following conditions All subgraph S:(1) subgraph S is convex portion figure;(2) subgraph S is connected graph;
The convex portion figure are as follows: for the subgraph S of G,v∈VsIf all only passing through S in any path in G between u and v In node, then claiming S is the convex portion figure of G;
The connected graph are as follows: for the subgraph S of G,v∈Vs, there are at least one paths to connect u and v, then S is connection Figure.
The present embodiment gives a kind of pseudocode of custom instruction parallel enumerating method based on depth map segmentation, such as Shown in algorithm 1:
Algorithm 1: custom instruction parallel enumerating method
Input: data flow diagram G (V, E)
Output: meet the subgraph set S of constraint condition
The environment for the calculate node that the present embodiment uses is i3-3240 3.4GHz processor, 4-GB main memory.It surveys It tries benchmark set and derives from MiBench.These benchmark are to carry out front end compiling and simulation by generic compilation platform GeCoS.Implement The feature of data flow diagram used in example and the runing time of serial algorithm are shown in Table 1.NV, column NE are arranged in table and column NS is respectively indicated The number of node, the number on side and the subgraph number enumerated in data flow diagram.The operating time log of serial algorithm is arranging Runtime (unit is millisecond).
1 benchmark program feature of table and serial algorithm runing time
Benchmark program NV NE NS Runtime
MP3 43 66 181,533,673 221,554
MESA 37 65 7,554,499 8,027
IIR 40 56 23,195,414 28,725
DES3 45 60 637,125,710 649,873
In the present embodiment, custom instruction parallel enumerating method (being denoted as MGP) of the invention is mentioned with Wang et al. respectively Parallel enumerating method (being denoted as ODP) out and Rezgui et al.]The parallel method (being denoted as EPS) of proposition compares.It is involved And the configuration of calculate node be i3-32403.4GHz processor, 4-GB main memory, three parallel methods use Hadoop1.0.0 is realized.Calculate node proposes that subgraph enumeration carries out subgraph and enumerates using Wang et al..Institute in the present embodiment The test benchmark program used is as shown in table 1.For parallel method EPS, the subtask number of each calculate node distribution is set as 50, Therefore, subtask sum is 50*w, and wherein w is the number of calculate node.
For four test benchmark programs in table 1, three parallel algorithms obtain acceleration under different calculate node numbers It is more as shown in Figure 4 than result.By comparing result it is observed that three parallel algorithms take when calculate node is less than or equal to 10 The speed-up ratio obtained is similar, is in the speed-up ratio of approximately linear.But obtained with the increase of calculate node number, MGP method Speed-up ratio is substantially better than other two parallel algorithms, and EPS method is better than ODP method.Wherein, acquired by EPS method and ODP method Speed-up ratio cannot with the increase of calculate node keep linear increase.Particularly with ODP method, when calculate node increases to 18 After, the growth of speed-up ratio tends towards stability, and no longer linearly increases with the increase of calculate node.Comparing result also illustrates, Treatment effect of the MGP method in problem of load balancing is better than EPS method and ODP method.
In order to further analyze the speed-up ratio difference of three of the above parallel method acquirement, compares and analyze in the present embodiment When three kinds of parallel methods enumerate custom instruction using same number calculate node, the total run time of each calculate node.? On the basis of test before this, the total run time of each calculate node is counted.For benchmark DES3, in calculate node In the case that number is 8, distinguished using the total run time of three kinds of each calculate nodes of parallel method as shown in Figure 5.It can observe To MGP method is used, the total run time between each calculate node differs smaller (maximum runing time and minimum runing time Difference be 13.1%), and to use EPS method or ODP method, the total run time difference between each calculate node is more apparent (difference of maximum runing time and minimum runing time is respectively 21.3% and 39.6%).By comparison, it was found that EPS method and There are larger differences for the subgraph size that ODP method generates, and wherein EPS method is primary concern is that the subgraph quantity generated after segmentation Without considering the problems of that subgraph size, ODP method are brighter using subgraph difference in size caused by disposable dividing method It is aobvious, and larger subgraph can be divided further into several compared with boy by MGP method under the guidance of Runtime prediction model Figure, therefore the subgraph size that this method generates is more balanced.Comparing result further illustrates that MGP method of the invention may make point The task of each calculate node of dispensing has similar complexity, it is ensured that the load balancing between calculate node, to obtain The speed-up ratio of approximately linear.
Meanwhile the present embodiment also evaluates runing time prediction model.It has been first randomly generated 200 data flows Figure, the nodal point number of these data flow diagram | V | range is 20~55, the number on side | E |=1.5* | V |.For each random generation Data flow diagram, be configured to i3-3240 3.4GHz processor at one, subgraph piece used on the computer of 4-GB main memory It lifts algorithm and enumerates all convex portion figures for meeting constraint condition, and record corresponding actual algorithm runing time.Then, using wherein 185 parts of actual data (nodal point number, number of edges, runing time) carry out the parameter in formula (3) as training sample polynary non- Linear regression fit.In order to evaluate Runtime prediction model, the data flow diagram generated at random for 15 transports task The time of row time prediction model prediction compares with actual run time.Subgraph enumerates Runtime prediction mould The prediction runing time that type provides is as shown in Figure 6 compared with actual run time.Comparing result shows, prediction runing time with Actual run time is very close, and error range is 3%~12%.
Extra time expense, the present embodiment detailed analysis are needed due to carrying out figure segmentation and the prediction of subtask runing time The ratio of parallel enumerating method total run time is accounted for the time required to the figure dividing processing of task based access control runing time prediction model.Needle To four different test benchmark programs, under conditions of calculate node number is 12, task based access control runing time prediction model The ratio that total run time is accounted for the time required to figure segmentation is as shown in table 2.Arrange Benchmark, column Pre.&Par.Time, column Total It is (unit is millisecond) the time required to Time and column Ratio respectively indicates test benchmark program name, figure segmentation and task prediction, total The ratio between runing time and the two (%).According to result, it can be seen that needed for the figure dividing processing of task based access control runing time prediction Time is averagely about the 3.82% of total run time.This also indicates that the figure segmentation side of task based access control time prediction model of the invention Method efficiency with higher, the time required to be considerably less than total run time, when will not influence parallel method significantly and always running Between.
The time required to the figure segmentation of 2 task based access control runing time prediction model of table compared with total run time
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify to technical solution documented by previous embodiment, or some or all of the technical features are equal Replacement;And these are modified or replaceed, model defined by the claims in the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims (3)

1. a kind of custom instruction parallel enumerating method based on depth map segmentation, it is characterised in that: the following steps are included:
Step 1, using Master-slave parallel mode, the host node in computing cluster receives dedicated custom instruction collection and automatically generates process Intermediate representation generation phase generate data flow diagram as input;
The data flow diagram is a kind of directed acyclic graph G=(V, E), wherein nodal set V={ v1..., vnIndicate elementary instruction, N is the number of data flow diagram node, side collection E={ e1..., em∈ V × V indicate instruction between data dependence relation, m indicate number According to the number on flow graph side;
Step 2, using the depth image segmentation method based on nonlinear regression Runtime prediction model by original data stream Figure is divided into several subgraphs, and the subgraph after segmentation is distributed to the idle calculate node in computing cluster, method particularly includes:
Step 2.1, custom instruction is enumerated task T progress be disposably divided into k subtask, shown in following formula:
Wherein, Gk=G- { v1, v2..., vk-1It is k-th of subgraph generating after data flow diagram G segmentation, and k=1,2 ..., | v |, | V | for the nodal point number in figure G, E (Gk, vk) indicate from k-th of subgraph GkIn enumerate it is all comprising node vkCustom instruction;
Step 2.2 establishes the Runtime prediction model based on nonlinear regression, to the runing time of the subtask of segmentation It is predicted;
Step 2.3, according to the predicted time of all subtasks and the number of calculate node, judge whether to need to continue to complicated Subtask is split;If the prediction runing time of subtask is more than given time upper limit, which is continued to divide For several subtasks, until the prediction runing time of all subtasks is less than or equal to given time upper limit, step is otherwise executed Rapid 3;The given time upper limit is the consensus forecast runing time of current all subtasks;
Step 3, calculate node enumerate custom instruction from the subgraph received using convex portion enumeration of graph algorithm;
It is described to enumerate custom instruction are as follows: the institute for meeting the following conditions is enumerated from a given data flow diagram G=(V, E) Subgraph S:(1) subgraph S is convex portion figure;(2) subgraph S is connected graph;
The convex portion figure are as follows: for the subgraph S of G,If in any path in G between u and v all only by S Node, then claiming S is the convex portion figure of G;
The connected graph are as follows: for the subgraph S of G,There are at least one paths to connect u and v, then S is connected graph.
2. a kind of custom instruction parallel enumerating method based on depth map segmentation according to claim 1, feature exist In: the step 2.2 method particularly includes:
When constraint condition is certain, custom instruction enumerate the time in figure nodal point number and number of edges it is related, customized finger Order is enumerated the time and is increased with the increase of nodal point number or number of edges;The subgraph given for one enumerates task TkAnd its it is corresponding Data flow diagram Gk(Vk, Ek), the worst runing time that custom instruction is enumerated isWherein α is constant, | Vk| it is Scheme GkIn nodal point number;There are many extensive forms of the truth of a matter in formula, it is assumed here that:
Wherein, f (| Vk|, | Ek|) be nodal point number | Vk| and number of edges | Ek| polynomial function;
Taylor series expansion is carried out to formula (2) and obtains Runtime prediction model, shown in following formula:
Wherein, parameter k ' is used for the expansion of Controlling model, parameter aI, jBy using the method for nonlinear regression, according to experiment number It is obtained according to fitting.
3. a kind of custom instruction parallel enumerating method based on depth map segmentation according to claim 2, feature exist In: the step 2.3 method particularly includes:
Assuming that subtask TkPrediction runing time be greater than given time upper limit, to TkContinue to divide, following formula institute Show:
Wherein, GK, l=Gk-{w1, w2..., wl-1It is subtask TkFirst of the subgraph generated after segmentation, E (GK, l, { vk, wl}) It indicates from subgraph GK, lIt enumerates comprising node vkWith node wlAll custom instructions;H is as cutoff value, TK, h-1Prediction fortune The row time is greater than current consensus forecast runing time, TK, hPrediction runing time be less than current consensus forecast runing time;If Subtask TK, lPrediction runing time upper limit value is still greater than, then continue TK, lIt is divided into several subtasks.
CN201910627526.5A 2019-07-12 2019-07-12 A kind of custom instruction parallel enumerating method based on depth map segmentation Pending CN110363700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910627526.5A CN110363700A (en) 2019-07-12 2019-07-12 A kind of custom instruction parallel enumerating method based on depth map segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910627526.5A CN110363700A (en) 2019-07-12 2019-07-12 A kind of custom instruction parallel enumerating method based on depth map segmentation

Publications (1)

Publication Number Publication Date
CN110363700A true CN110363700A (en) 2019-10-22

Family

ID=68218908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910627526.5A Pending CN110363700A (en) 2019-07-12 2019-07-12 A kind of custom instruction parallel enumerating method based on depth map segmentation

Country Status (1)

Country Link
CN (1) CN110363700A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024839A1 (en) * 2007-07-21 2009-01-22 Arssov Paul Plamen Variable instruction set microprocessor
US8972958B1 (en) * 2012-10-23 2015-03-03 Convey Computer Multistage development workflow for generating a custom instruction set reconfigurable processor
CN104796622A (en) * 2014-01-17 2015-07-22 宏达国际电子股份有限公司 Image segmentation device and image processing method
CN105574269A (en) * 2015-12-16 2016-05-11 青岛大学 Design verification method of special instruction processor
CN106919380A (en) * 2015-12-24 2017-07-04 英特尔公司 Programmed using the data flow of the computing device of the figure segmentation estimated based on vector
CN107329828A (en) * 2017-06-26 2017-11-07 华中科技大学 A kind of data flow programmed method and system towards CPU/GPU isomeric groups
CN107650123A (en) * 2017-06-30 2018-02-02 哈尔滨工大特种机器人有限公司 A kind of robotic programming method and apparatus of expansible instruction set
CN108664251A (en) * 2018-04-26 2018-10-16 北京控制工程研究所 A kind of telecommand code generating method based on multi-dimension feature extraction
CN108804383A (en) * 2018-05-30 2018-11-13 深圳大学 Supporting point parallel enumerating method and device based on metric space

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024839A1 (en) * 2007-07-21 2009-01-22 Arssov Paul Plamen Variable instruction set microprocessor
US8972958B1 (en) * 2012-10-23 2015-03-03 Convey Computer Multistage development workflow for generating a custom instruction set reconfigurable processor
CN104796622A (en) * 2014-01-17 2015-07-22 宏达国际电子股份有限公司 Image segmentation device and image processing method
CN105574269A (en) * 2015-12-16 2016-05-11 青岛大学 Design verification method of special instruction processor
CN106919380A (en) * 2015-12-24 2017-07-04 英特尔公司 Programmed using the data flow of the computing device of the figure segmentation estimated based on vector
CN107329828A (en) * 2017-06-26 2017-11-07 华中科技大学 A kind of data flow programmed method and system towards CPU/GPU isomeric groups
CN107650123A (en) * 2017-06-30 2018-02-02 哈尔滨工大特种机器人有限公司 A kind of robotic programming method and apparatus of expansible instruction set
CN108664251A (en) * 2018-04-26 2018-10-16 北京控制工程研究所 A kind of telecommand code generating method based on multi-dimension feature extraction
CN108804383A (en) * 2018-05-30 2018-11-13 深圳大学 Supporting point parallel enumerating method and device based on metric space

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENGLONG XIAO: "Parallel custom instruction identification for extensible processors", 《JOURNAL OF SYSTEMS ARCHITECTURE》 *
SHANSHAN WANG: "Parallel Enumeration of Custom Instructions Based on Multidepth Graph Partitioning", 《IEEE EMBEDDED SYSTEMS LETTERS》 *

Similar Documents

Publication Publication Date Title
Atasu et al. Automatic application-specific instruction-set extensions under microarchitectural constraints
Yazdanbakhsh et al. Flexigan: An end-to-end solution for fpga acceleration of generative adversarial networks
Clark et al. Automated custom instruction generation for domain-specific processor acceleration
Zuo et al. Improving high level synthesis optimization opportunity through polyhedral transformations
Breß et al. Efficient co-processor utilization in database query processing
Alexandrov et al. MapReduce and PACT-comparing data parallel programming models
Xu et al. PET: reducing database energy cost via query optimization
Kofler et al. Automatic data layout optimizations for gpus
KR20180034626A (en) Compile data processing graph
Selvitopi et al. Optimizing high performance Markov clustering for pre-exascale architectures
Chen et al. Efficient decision ordering techniques for SAT-based test generation
Jin et al. On fast enumeration of maximal cliques in large graphs
Verma et al. Fast, nearly optimal ISE identification with I/O serialization through maximal clique enumeration
Chen et al. {Locality-Aware} Software Throttling for Sparse Matrix Operation on {GPUs}
Engström et al. PageRank for networks, graphs, and Markov chains
Abraham et al. Efficient design space exploration in PICO
CN110363700A (en) A kind of custom instruction parallel enumerating method based on depth map segmentation
González-Álvarez et al. Automatic design of domain-specific instructions for low-power processors
JP2017111749A (en) Calculation code generation device, method and program
Krishna et al. Optimizing graph algorithms in asymmetric multicore processors
Park et al. XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory
Ma et al. Parallel exact inference on multicore using mapreduce
Chen et al. Efficient resource constrained scheduling using parallel two-phase branch-and-bound heuristics
Werner et al. Automated composition and execution of hardware-accelerated operator graphs
Rajagopalan et al. Specification of software pipelining using petri nets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191022