CN110187975B - Multi-core processor resource allocation calculation method, storage medium and terminal equipment - Google Patents

Multi-core processor resource allocation calculation method, storage medium and terminal equipment Download PDF

Info

Publication number
CN110187975B
CN110187975B CN201910482008.9A CN201910482008A CN110187975B CN 110187975 B CN110187975 B CN 110187975B CN 201910482008 A CN201910482008 A CN 201910482008A CN 110187975 B CN110187975 B CN 110187975B
Authority
CN
China
Prior art keywords
core
slave
node
resource allocation
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910482008.9A
Other languages
Chinese (zh)
Other versions
CN110187975A (en
Inventor
胡波
李一明
彭星洪
罗鸣
朱可
汪艳红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jieshi Zhitong Polytron Technologies Inc
Original Assignee
Chengdu Sunway Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sunway Technology Co ltd filed Critical Chengdu Sunway Technology Co ltd
Priority to CN201910482008.9A priority Critical patent/CN110187975B/en
Publication of CN110187975A publication Critical patent/CN110187975A/en
Application granted granted Critical
Publication of CN110187975B publication Critical patent/CN110187975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a resource allocation calculation method of a multi-core processor, a storage medium and a terminal device, wherein the resource allocation calculation method comprises the following contents: inputting the scale of the application model and the number of available computing resources; and traversing and finding the optimal resource allocation mode in a region segmentation mode. Aiming at an application model, under the condition of specifying computing resources, a resource allocation mode can be quickly and accurately found, so that the application performance of a processor can reach the maximum; different available computing resources may also be specified, and the user may strike a balance between computing resources and performance.

Description

Multi-core processor resource allocation calculation method, storage medium and terminal equipment
Technical Field
The invention relates to a computing resource allocation calculation method in a multi-core processor, in particular to a computing resource allocation calculation method, a storage medium and a terminal device according to LBM application model scale and computing resources.
Background
The multi-core processor may be structurally divided into a homogeneous multi-core processor and a heterogeneous many-core processor. The heterogeneous many-core processor comprises a main core and a slave core, wherein the main core carries out general control communication operation, and the slave core carries out intensive operation. As the logic of the slave cores in the heterogeneous many-core processor is simpler, more cores can be integrated under the same process, and the computing performance can be better improved. The heterogeneous many-core processor has different functions of a master core and a slave core, and the slave core computing performance, the inter-slave core communication performance in a core group and the inter-slave core communication performance of a cross-core group are different. The size of the model that users need to compute, and the computational resources (in terms of core groups) that can be provided to users vary.
At present, more nodes in the model are allocated to each slave core computing resource according to the scale of the application model and the computing resources by virtue of the experience of a designer and then are assisted for testing, a relatively optimized resource allocation mode may be found, but the mode needs abundant design experience and can not necessarily find a relatively optimized allocation scheme, each alternative scheme needs to be realized for testing the performance of each alternative scheme, and the testing and screening cost is very high.
Therefore, how to allocate the nodes in the model to each slave core on the premise of determining the scale of the application model and the computing resources to balance the loads of the slave cores, and the computation and the communication can be performed as parallel as possible, so that the fastest operation speed is difficult to achieve.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a resource allocation calculation method of a multi-core processor, a storage medium and terminal equipment, and solves the problem that load balancing of an LBM (local binary matrix) on a heterogeneous many-core processor 26010 is difficult in the prior art.
The purpose of the invention is realized by the following technical scheme: a resource allocation calculation method of a multi-core processor comprises the following steps:
inputting the scale of the application model and the number of available computing resources;
and traversing and finding the optimal resource allocation mode in a region segmentation mode.
Before inputting the scale of the application model and the quantity of available computing resources, the sizes of a plurality of basic speed values of the model are tested and computed; the several basic speed values include the average calculated speed CalV of each node, the communication speed CoreOutComV of each node information to the communication speed coreextracore, the communication speed corelncomv of each node information to the communication speed corelncomv to the communication speed and the communication speed GrpComV between the cores of each node information.
The magnitude of several basic speed values of the test calculation model comprises the following contents:
pre-storing n pieces of node information and node information around the n pieces of node information into a cache of a slave core;
calculating the interaction between n nodes and surrounding nodes, updating the information of the n nodes to obtain a time value, and dividing the time value by n to obtain the average calculation time of each node, namely the calculation speed is CalV;
counting a time value for transmitting n node information in a cache to a main memory by a slave core, and dividing the time value by n to obtain a communication speed CoreOutComV for transmitting each node information to the outside of the slave core;
counting a time value of reading n node information from a main memory to a cache by a slave core, and dividing the time value by n to obtain a communication speed CoreInComV of each node information transmitted to the slave core;
and counting the time value of n pieces of node information transmitted from one main core to other main cores, and dividing the time value by n to obtain the communication speed GrpComV of each node information between the cores.
In order to facilitate the division of the model, each slave core responsible node area in a three-dimensional space is set to be in a cuboid or cube shape, and a core group area formed by all slave core node areas in the same core group is set to be in a cuboid or cube shape.
The traversing and finding of the optimal resource allocation mode through the region segmentation mode comprises the following contents:
traversing the condition that each slave core is responsible for different node numbers and areas, and calling the combination condition of the traversing core group on the basis;
traversing the distribution condition of each core group under the condition that the distribution of the slave core responsible node is determined;
according to the node distribution responsible for the slave cores and the distribution of the slave cores in the core group, a certain resource combination condition is obtained;
calculating according to the CalV of the slave cores, the CoreoutComV and CoreInComV of the communication speeds among the slave cores in the core group and the GrpComV of the communication speeds among the slave cores in the cross-core group to obtain the total time consumed by one iteration according to the resource combination;
and repeating the steps to calculate the total time consumed by all the resource combination conditions to obtain the optimal resource allocation mode with the minimum time consumption.
The step of traversing the distribution condition of each core group comprises the step of eliminating the combination condition which does not meet the requirement in advance so as to reduce the operation calculation time for obtaining the optimal resource allocation mode.
The step of eliminating the combination condition which does not meet the requirement in advance comprises the steps of judging whether the number of the nodes in charge of the core does not meet the standard or not and judging whether the nodes in a certain dimension of the area in charge of the core group are in excess or not and judging whether the nodes in charge of the core group are in need of elimination in advance or not, and judging whether the resources of the core group occupied by the area in charge of the core group exceed the resources of the core group provided by the system or not and judging whether the resources of the core group.
The node allocation computing method further comprises the step of repeating the model scale and the resource quantity calculation step of the input system and the step of traversing and finding the optimal resource allocation mode when the optimal resource allocation mode is judged not to meet the requirements until the optimal resource allocation mode meeting the requirements is obtained.
A storage medium having stored thereon at least one computer-executable instruction which, when executed, performs the steps of a method of resource allocation computation for a multicore processor.
A terminal device comprises at least one memory and at least one processor, wherein the memory is stored with at least one computer executable instruction capable of running on the processor, and the processor executes the computer executable instruction to execute the steps of the resource allocation calculation method of the multi-core processor.
The invention has the beneficial effects that: a resource allocation calculation method, a storage medium and a terminal device of a multi-core processor can quickly and accurately find a resource allocation mode under the condition of specifying calculation resources aiming at an application model, so that the application performance of a processor is maximized; different available computing resources may also be specified, and the user may strike a balance between computing resources and performance.
Drawings
FIG. 1 is a D3Q19 mesh model diagram of LBM;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that the terms "upper", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings or orientations or positional relationships that the products of the present invention conventionally use, which are merely for convenience of description and simplification of description, but do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," and "connected" are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.
The discrete Lattice Boltzmann Method (LBM) is an important method in computational fluid mechanics, and has the advantages of simple algorithm, capability of processing complex boundary conditions, high parallelism and the like. The LBM has mesh models such as D2Q9, D3Q13, D3Q15, D3Q19, etc., and different models correspond to different computational complexity and simulation accuracy. The D3Q19 mesh model has the highest computational complexity and the highest accuracy.
As shown in fig. 1, taking the D3Q19 model as an example, in a three-dimensional geometric space, a certain node is used as a central node, and 18 surrounding nodes exchange collision energy with the node. Assuming that the coordinates of the central node are (0,0,0) in the three-dimensional coordinate system (X, Y, Z), the 18 surrounding node coordinates are (0,0,1), (0,1,1), (0,1,0), (0,1, -1), (0,0, -1), (0, -1, -1), (0, -1,0), (0, -1,1), (1,0,0), (1,0,1), (1,1,0), (1,0, -1), (1, -1,0), (-1,0,0), (-1,0,1), (-1,1,0), (-1,1,0) (-1, 0).
In the D3Q19 mesh model, the central node has collision effect with the 18 surrounding nodes. The central node contains 19 vectors, potential energy that remains motionless and potential energy that is directed to the other 18 nodes. And in an iteration period, calculating the acting force of the peripheral 18 nodes on the central node according to the potential energy and the direction of the central node and the peripheral 18 nodes, and updating the potential energy of the central node. And sequentially calculating the action conditions of all nodes in the model and 18 nodes around the model and updating the potential energy information of the nodes.
In the simulation analysis process, in order to perform accurate calculation simulation, the model size needs to be large enough, and the iteration times need to be sufficient, so that the calculation time is very long, and some calculation time is as long as several months. The accuracy of simulation and the conflict of simulation time severely restrict the application of LBM. Improving computational performance becomes an important means to solve simulation accuracy and simulation time conflicts.
To improve computational performance, one can start with two aspects: the CPU working frequency is improved, and the parallelism is improved. The integrated circuit has been developed to the age of Pentium4, and as the working frequency is increased, the power consumption is increased sharply, and the problem of heat dissipation is more and more serious. The CPU performance is improved by improving the working frequency, and the multi-core parallel processing is performed to improve the computing performance.
The multi-core processor may be structurally divided into a homogeneous multi-core processor and a heterogeneous many-core processor. The heterogeneous many-core processor comprises a main core and a slave core, wherein the main core carries out general control communication operation, and the slave core carries out intensive operation. As the logic of the slave cores in the heterogeneous many-core processor is simpler, more cores can be integrated under the same process, and the computing performance can be better improved. The homemade Shenwei 26010 many-core processor is a high-performance heterogeneous many-core processor and comprises 4 core groups, wherein each core group comprises 1 main core and 64 slave cores.
26010 the computing resources of the many-core processor supercomputing platform are in units of core groups, the master core in each core group is generally used for inter-core communication, slave core task allocation and slave core data collection, and does not participate in arithmetic operation, and the high-density computation of the arithmetic is performed on the slave cores.
If the slave cores in each slave core array do not perform data interaction and do not access the main memory, only the codes and the data in the cache are used, the independent parallel operation can be performed, and the operation degree reaches the maximum. If data interaction is carried out among the secondary cores in the core group, the DMA read-write shared main memory mode is used conveniently and quickly. Since each core group has only one memory control, 64 DMA initiated by the core in the core group is executed in series, and the memory control and the main memory of each core group work independently and parallelly. Therefore, the time for a program to perform inter-core communication is 64 times the time for one slave core to perform inter-core data communication.
If a certain slave core performs data interaction with slave cores in other core groups, the MPI asynchronous communication mode is conveniently and quickly used. Because interaction between the core groups passes through the non-blocking cross IB switch, namely, communication between any two core groups is not influenced by communication of the third-party core group. The time for a program to perform cross-core group communication from inter-core is the time for one core group to perform cross-core group communication.
As shown in fig. 2, a resource allocation calculation method of an Shenwei 26010 many-core processor super-computing platform includes the following steps:
s1, inputting the scale of the application model and the number of available computing resources;
and S2, traversing through the region division mode to find the optimal resource allocation mode.
Furthermore, the general model is a cuboid, the lengths of the model in three dimensions are ModX, ModY and ModZ, the model has ModX ModY ModZ nodes, and the interaction between each node in the model and the 18 surrounding nodes needs to be calculated in the secondary iterative calculation; for nodes at the edges of the model, nodes at relative positions in the model interact in an overflow wrap-around manner.
Setting a total of MaxGrp core group resources available, wherein each core group comprises 64 slave cores, 64 MaxGrp slave core computing resources are total, and partial or all computing resources may be used;
further, in order to perform reasonable resource allocation, the sizes of several basic speed values of the calculated model need to be tested before inputting the scale of the application model and the number of available calculation resources; the several basic speed values include the average calculated speed CalV of each node, the communication speed CoreOutComV of each node information to the communication speed coreextracore, the communication speed corelncomv of each node information to the communication speed corelncomv to the communication speed and the communication speed GrpComV between the cores of each node information.
The magnitude of several basic speed values of the test calculation model comprises the following contents:
a1, pre-storing n pieces of node information and the node information around the n pieces of node information into a cache of a slave core;
a2, calculating the interaction between n nodes and surrounding nodes, updating the information of the n nodes to obtain a time value, and dividing the time value by n to obtain the average calculation time of each node, namely the calculation speed is CalV;
a3, counting the time value of transferring n node information in a cache to a main memory by a slave core in a DMA mode, and dividing the time value by n to obtain the communication speed CoreOutComV of transferring each node information to the slave core;
a4, counting the time value of a slave core reading n node information from a main memory in a DMA mode to a cache, and dividing the time value by n to obtain the communication speed CoreInComV of each node information transmitted to the slave core;
a5, counting the time value of n node information transmitted by one main core to other main cores in an asynchronous MPI mode, and dividing the time value by n to obtain the communication speed GrpComV of each node information between cores.
The secondary core calculation can be carried out in parallel, and the calculation time of one iteration is the calculation time of one secondary core; the inter-core communication in the core group is serial in the core group, the inter-core communication in the core group is parallel, and the inter-core communication time in the core group of one iteration is 64 times of the single-slave-core type communication time; the inter-core communication of the cross-core group is parallel operation, and the inter-core communication time of the cross-core group of one iteration is the single-core group type communication time.
In order to facilitate the division of the model, each slave core responsible node area in a three-dimensional space is set to be in a cuboid or cube shape, and a core group area formed by all slave core node areas in the same core group is set to be in a cuboid or cube shape.
Further, specifically determining the node region shape includes, assuming that each slave core is responsible for N nodes, storing information of the N nodes in a cache, and sequentially calculating, by the slave core, interactions between each node of the N nodes and 18 nodes around the node, where the region where the N nodes are located is N.
When a node A is considered, the calculation with a certain node B around the node A can be divided into two cases that the node B is in the range of the region N and out of the range of the region N; in the first case, when the node B is within the area N of the node a, the information of the two nodes can be directly obtained from the core, the acting force between the two nodes can be calculated, and the information of the two nodes can be updated. In the second case, if node B is not located within the area N where node a is located, it is necessary to transmit information of node B to the slave core through communication, and the slave core calculates the acting force between the two nodes to update the information of node a. And similarly, the slave core to which the B belongs acquires the information of the node A through communication, calculates the acting force between the two nodes and updates the information of the node B. If A and B are within one core group, inter-core communication within the core group is performed, and if A and B are across the core group, slower inter-core communication across the core group is performed.
Obviously, the node A and the node B are both in the range of the region N, and the calculation speed is fastest; to generalize further, the nodes around the N nodes responsible for the slave core should be as small as possible within the area N, i.e. the surface area should be as small as possible with a fixed volume of the area N, so that the amount of communication with the outside is minimal.
According to the geometric theory, the surface area of a ball is the minimum under the condition of the same volume, but a plurality of spheres cannot well divide a cuboid model, a plurality of edge nodes cannot be divided into a certain sphere area which is responsible for a slave nucleus, in order to balance the small surface area and the convenient division of the model, a slave nucleus responsible node area N is set to be in a cuboid or cube shape, the area surface area is smaller, and the model can be conveniently divided.
Between the communication in the core group and the communication across the core group, the communication speed across the core group is slower, and similarly, 64 core group areas M composed of the node areas of the cores in the same core group should also be set to be in a rectangular parallelepiped or cube shape, the area surface area of the core group area is smaller, and the model can be conveniently divided.
The area surface area, i.e. the traffic, should comprise the inner surface area and the outer surface area of the area. For example, a region N, which is responsible for one slave core or one core group, is X, Y, Z in three dimensions, and it should transfer node information of the internal surface area of N to other slave cores or core groups by 2 × (X × Y + Y × Z + Z × X). The number of node information to be acquired from the core or the core group from the other cores or the core groups is 2 × ((X +2) × (Y +2) + (Y +2) × (Z +2) + (Z +2) × (X +2)), and the transfer amount is 2 × (X + 2)).
If the model is exclusively owned by one slave core in a certain dimension, two surfaces perpendicular to the dimension in the slave core area surface area statistics do not include statistics, i.e. there is no communication between slave cores in the core group in the dimension. Similarly, if the model is exclusively owned by a core group in a certain dimension, two surfaces perpendicular to the dimension in the core group area surface area statistics do not include statistics, i.e., there is no cross-core group inter-core communication in the dimension.
The slave core computation is run independently of the slave core. The same-core group slave-to-core communication is that the slave core initiates a DMA operation, the DMA controller carries out data transfer, and the slave core checks to confirm that the DMA operation is completed. Where the operating time from the core is very short and negligible. The cross-core group inter-master communication is asynchronous MPI communication initiated by the master core. The three operations may be performed in parallel, with the total run time of one iteration being the maximum of the run times of the three operations.
Step S2 finds the optimal resource allocation manner through region segmentation traversal, which includes the following:
s21, traversing the condition that each slave core is in charge of different node numbers and areas, and calling the combination condition of the traversed core group on the basis;
s22, traversing the distribution condition of each core group under the condition that the distribution of the slave core responsible node is determined;
further, traversing the area in charge of the slave core as an outer loop, and traversing the combined situation of 64 slave cores in the core group as an inner loop on the basis of a certain area in charge of the slave core, thereby obtaining the combined situation of the area in charge of one core group; performing nuclear group area superposition on three dimensions under the combination condition to obtain a nuclear group area, and enabling the nuclear group area to just cover the model area; under the condition that the core group area is determined, the core group set area is also uniquely determined, the resource allocation mode is also determined, and the minimum time consumption can be calculated.
S23, according to the node distribution responsible by the slave core, the distribution of the slave core in the core group, obtaining a certain resource combination condition;
s24, calculating the total time consumed by performing one iteration according to the resource combination according to the calculation speed CalV of the slave cores, the communication speeds CoreoutComV and CoreInComV among the slave cores in the core group and the communication speed GrpComV among the slave cores in the cross-core group;
and S25, repeating the steps to count the total time consumed by all the resource combination conditions, and obtaining the optimal resource allocation mode with the minimum consumed time.
The step of traversing through the region segmentation mode to find the optimal resource allocation mode further comprises the step of eliminating the combination condition which does not meet the requirement in advance so as to reduce the operation calculation time for obtaining the optimal resource allocation mode.
The step of eliminating the combination condition which does not meet the requirement in advance comprises the steps of judging whether the number of the nodes in charge of the core does not meet the standard or not and judging whether the nodes in a certain dimension of the area in charge of the core group are in excess or not and judging whether the nodes in charge of the core group are in need of elimination in advance or not, and judging whether the resources of the core group occupied by the area in charge of the core group exceed the resources of the core group provided by the system or not and judging whether the resources of the core group. Namely, the condition that the number of nodes in charge of the secondary core is too small (the condition that the minimum processing requirement cannot be met) and the computing resources are not enough needs to be removed in advance; the condition that the number of nodes in a certain dimension of a core group responsible area is excessive, so that the responsible is unbalanced and the time consumption is increased is eliminated in advance; and eliminating the condition that the number of the occupied core groups exceeds the computing resources.
Further, under the condition that the inner loop traverses the responsible region of the kernel group, some combination conditions are excluded in advance according to the three dimensional sizes of the region of the kernel group; specifically, if the number of nodes of a core group responsible region in a certain dimension is greater than that of nodes of the model in the dimension, and the number of the nodes is greater than or equal to that of the slave cores in the core group in the dimension, it indicates that each slave core in the dimension can cover the whole model by calculating at least one less node; on the basis of the combination situation, the combination situation that each slave core reduces 1 node in the dimension can also cover the whole model, obviously, the combination situation has more calculation time and communication time than the combination situation that the combination situation reduces 1 node, and therefore the situation that the number of the nodes exceeds the standard in the dimension of the core group of the combination is eliminated.
The method specifically comprises the following steps:
a. the function tracksalcore is to traverse the combination of the slave core responsible regions, for example, starting from {1,1, a } and sequentially from {1,1, a +1}, {1,1, a +2}. {1,1, ModZ }, {1,2, B +1}. {1,2, ModZ }. the {2,1, N }. the {2,1, ModZ }. the. B and N in the following are calculated in the same way, so that the traversal times can be reduced;
b. in the case of a slave core responsible region of the above TraversalCore function, such as { CoreX, CoreY, CoreZ }, the TraversalGrp function is called and the inner layer recirculates through the combined cases of the core group responsible regions. Since the area responsible for the nucleus group is also a cuboid or a cube, 64 secondary nuclei also need to be arranged in a cuboid or a cube manner, and 28 arrangement cases are shown in the group CoreCombination. When traversing the 28 cases, two cases are eliminated to reduce the traversal times: firstly, the number of nodes of the core group in a certain dimension is larger than that of nodes of the model in the dimension, and the larger number is larger than or equal to the number of slave cores in the core group in the dimension. Secondly, the number of the core groups to be occupied exceeds the number of the provided core group resources;
c. for a combination of TraversalCore function traversals, the outer loop determines the region responsible for each slave core, and the inner loop determines the region responsible for each core group, so that a computing resource allocation can be obtained. For the situation, calling a CalTotalTime function, and calculating the total time consumption of one iteration calculation under the resource allocation situation by using 4 speed quantities obtained by pre-testing;
d. and calculating the consumed time of each resource allocation mode to find the resource allocation mode with the shortest consumed time.
When the area responsible by the core group is determined, the combination condition of the core group set area is uniquely determined, the occupied core group resource is correspondingly determined, whether the core group resource occupied by the method exceeds the core group resource provided by the system is judged, and if the core group resource occupied by the method exceeds the core group resource provided by the system, the combination condition is eliminated.
The node allocation computing method further comprises the step of repeating the model scale and the resource quantity calculation step of the input system and the step of traversing and finding the optimal resource allocation mode when the optimal resource allocation mode is judged not to meet the requirements until the optimal resource allocation mode meeting the requirements is obtained.
Further, whether the optimal resource allocation mode meets the requirement that the time consumption of the optimal resource allocation mode meets the requirement of the user is judged, and the requirement has unnecessary requirements for different users.
A storage medium having stored thereon at least one computer-executable instruction that, when executed, performs the steps of a method of resource allocation computation for an LBM-based, unwarranted 26010 many-core processor supercomputing platform.
A terminal device comprising at least one memory and at least one processor, said memory having stored thereon at least one computer executable instruction executable on said processor, said processor when executing said computer executable instruction performing the steps of said method of resource allocation computation for an LBM-based SW 26010 multi-core processor supercomputing platform.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A resource allocation calculation method of a multi-core processor is characterized by comprising the following steps: the resource allocation calculation method comprises the following steps:
inputting the scale of the application model and the number of available computing resources; the application model is an LBM model;
traversing and finding out the optimal resource allocation mode through a region division mode;
the region segmentation mode traversal is as follows:
setting an LBM model as a cuboid, and dividing the LBM model into ModX, ModY and ModZ in the length of three dimensions, wherein the model comprises ModX ModY ModZ nodes;
each slave core is responsible for N nodes in the LBM, the area where the N nodes are located is N, each slave core responsible node area is set to be in a cuboid or cube shape in a three-dimensional space, and a core group area formed by all the slave core node areas in the same core group is set to be in the cuboid or cube shape; each slave core is responsible for the conditions of different node numbers and areas, and the combination condition of the traversal core group is called on the basis;
in the case of distribution determination from the core responsible node, the distribution situation of each core group is traversed.
2. The method of claim 1, wherein the method comprises the following steps: before inputting the scale of the application model and the quantity of available computing resources, testing and computing the sizes of a plurality of basic speed values of the model; the several basic speed values comprise an average calculation speed CalV of each node, a communication speed CoreOutComV transferred from each node to the outside of the core, a communication speed CoreInComV transferred from each node to the inside of the core and a communication speed GrpComV between the cores of each node information;
the sizes of the several basic speed values of the test calculation model comprise the following contents:
pre-storing n pieces of node information and node information around the n pieces of node information into a cache of a slave core;
calculating the interaction between n nodes and surrounding nodes, updating the information of the n nodes to obtain a time value, and dividing the time value by n to obtain the average calculation time of each node, namely the calculation speed is CalV;
counting a time value for transmitting n node information in a cache to a main memory by a slave core, and dividing the time value by n to obtain a communication speed CoreOutComV for transmitting each node information to the outside of the slave core;
counting a time value of reading n node information from a main memory to a cache by a slave core, and dividing the time value by n to obtain a communication speed CoreInComV of each node information transmitted to the slave core;
and counting the time value of n pieces of node information transmitted from one main core to other main cores, and dividing the time value by n to obtain the communication speed GrpComV of each node information between the cores.
3. The method of claim 2, wherein the method comprises: the step of traversing through the region segmentation mode to find the optimal resource allocation mode specifically comprises the following steps:
traversing the condition that each slave core is responsible for different node numbers and areas, and calling the combination condition of the traversing core group on the basis;
traversing the distribution condition of each core group under the condition that the distribution of the slave core responsible node is determined;
obtaining a resource combination condition according to the distribution of nodes responsible by the slave cores and the distribution of the slave cores in the core group;
calculating according to the CalV of the slave cores, the CoreoutComV and CoreInComV of the communication speeds among the slave cores in the core group and the GrpComV of the communication speeds among the slave cores in the cross-core group to obtain the total time consumed by one iteration according to the resource combination;
and repeating the steps to calculate the total time consumed by all the resource combination conditions to obtain the optimal resource allocation mode with the minimum time consumption.
4. The method of claim 3, wherein the method comprises the following steps: the step of traversing the distribution condition of each core group comprises the step of eliminating the combination condition which does not meet the requirement in advance so as to reduce the operation calculation time for obtaining the optimal resource allocation mode.
5. The method of claim 4, wherein the method comprises the following steps: the step of eliminating in advance the unsatisfactory combination cases comprises: the condition that the number of nodes in charge of the secondary core is too small, so that computing resources are not enough is eliminated in advance;
the condition that the number of nodes in a certain dimension of a core group responsible area is excessive, so that the responsible is unbalanced and the time consumption is increased is eliminated in advance;
and eliminating the condition that the occupied core group resources exceed the core group resources provided by the system.
6. The method of claim 3, wherein the method comprises the following steps: the resource allocation calculation method further comprises the step of repeating the step of inputting the application model scale and calculating the resource quantity and the step of traversing and finding the optimal resource allocation mode when the optimal resource allocation mode is judged not to meet the requirements until the optimal resource allocation mode meeting the requirements is obtained.
7. A storage medium having at least one computer-executable instruction stored thereon, characterized in that: the computer-executable instructions when executed perform the steps of a method of multicore processor resource allocation computation of any of claims 1 to 6.
8. A terminal device comprising at least one memory and at least one processor, the memory having stored thereon at least one computer-executable instruction executable on the processor, characterized in that: the processor, when executing the computer executable instructions, performs the steps of a method of multicore processor resource allocation computation of any of claims 1 to 6.
CN201910482008.9A 2019-06-04 2019-06-04 Multi-core processor resource allocation calculation method, storage medium and terminal equipment Active CN110187975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910482008.9A CN110187975B (en) 2019-06-04 2019-06-04 Multi-core processor resource allocation calculation method, storage medium and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910482008.9A CN110187975B (en) 2019-06-04 2019-06-04 Multi-core processor resource allocation calculation method, storage medium and terminal equipment

Publications (2)

Publication Number Publication Date
CN110187975A CN110187975A (en) 2019-08-30
CN110187975B true CN110187975B (en) 2020-08-18

Family

ID=67720281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910482008.9A Active CN110187975B (en) 2019-06-04 2019-06-04 Multi-core processor resource allocation calculation method, storage medium and terminal equipment

Country Status (1)

Country Link
CN (1) CN110187975B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100099B (en) * 2020-09-28 2021-06-08 湖南长城银河科技有限公司 Lattice boltzmann optimization method for multi-core vector processor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929724A (en) * 2012-11-06 2013-02-13 无锡江南计算技术研究所 Multistage memory access method and discrete memory access method based on heterogeneous multi-core processor
CN103064819A (en) * 2012-10-25 2013-04-24 浪潮电子信息产业股份有限公司 Method for utilizing microwave integrated circuit (MIC) to rapidly achieve lattice Boltzmann parallel acceleration
CN103514043A (en) * 2012-06-29 2014-01-15 华为技术有限公司 Multi-processor system and data processing method thereof
CN104407925A (en) * 2014-12-10 2015-03-11 中国电信集团***集成有限责任公司 Dynamic resource distribution method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060059030A1 (en) * 2004-09-15 2006-03-16 Boris Stavrovski Method and decision support system for optimal allocation of expendable resources in internet marketing with resource-dependant effectiveness and use of a-posteriori information
US9021490B2 (en) * 2008-08-18 2015-04-28 Benoît Marchand Optimizing allocation of computer resources by tracking job status and resource availability profiles
CN102945295B (en) * 2012-10-15 2015-09-02 浪潮(北京)电子信息产业有限公司 A kind of parallel acceleration method of Lattice Boltzmann Method and system
CN103778098A (en) * 2014-02-17 2014-05-07 浪潮(北京)电子信息产业有限公司 Large eddy simulation system and method for realizing cooperative computing based on latticed-Boltzmann theory
US10510001B2 (en) * 2016-03-18 2019-12-17 Mindtrace Limited Neuromorphic training algorithm for a Restricted Boltzmann Machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514043A (en) * 2012-06-29 2014-01-15 华为技术有限公司 Multi-processor system and data processing method thereof
CN103064819A (en) * 2012-10-25 2013-04-24 浪潮电子信息产业股份有限公司 Method for utilizing microwave integrated circuit (MIC) to rapidly achieve lattice Boltzmann parallel acceleration
CN102929724A (en) * 2012-11-06 2013-02-13 无锡江南计算技术研究所 Multistage memory access method and discrete memory access method based on heterogeneous multi-core processor
CN104407925A (en) * 2014-12-10 2015-03-11 中国电信集团***集成有限责任公司 Dynamic resource distribution method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于CUDA的格子Boltzmann方法:算法设计及程序优化";黄昌盛 等;《科学通报》;20111015;第56卷(第28-29期);第2434-2444页 *
João V. F. Lima 等."A Dynamic Task-Based D3Q19 Lattice-Boltzmann Method for Heterogeneous Architectures".《2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)》.2019, *

Also Published As

Publication number Publication date
CN110187975A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
Betkaoui et al. A reconfigurable computing approach for efficient and scalable parallel graph exploration
Cevahir et al. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning
US9158719B2 (en) Heterogeneous parallel systems for accelerating simulations based on discrete grid numerical methods
CN109726441B (en) Body and surface mixed GPU parallel computing electromagnetism DGTD method
CN115016951B (en) Flow field numerical simulation method and device, computer equipment and storage medium
Motamedi et al. Fast and energy-efficient CNN inference on IoT devices
Du et al. Model parallelism optimization for distributed inference via decoupled CNN structure
Li et al. 3-D parallel fault simulation with GPGPU
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
Barnat et al. Employing multiple CUDA devices to accelerate LTL model checking
Bernaschi et al. A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units
Haghi et al. FP-AMG: FPGA-based acceleration framework for algebraic multigrid solvers
CN112183015A (en) Chip layout planning method for deep neural network
CN110187975B (en) Multi-core processor resource allocation calculation method, storage medium and terminal equipment
Li et al. FSimGP^ 2: An efficient fault simulator with GPGPU
Liu et al. OBFS: OpenCL based BFS optimizations on software programmable FPGAs
CN111079078A (en) Lower triangular equation parallel solving method for structural grid sparse matrix
WO2023216915A1 (en) Helicopter flow field numerical simulation system and method based on graphics processing unit
Meister et al. A software concept for cache-efficient simulation on dynamically adaptive structured triangular grids
CN108021563A (en) The detection method and device that a kind of inter-instruction data relies on
Van der Wijngaart et al. Extending the BT NAS parallel benchmark to exascale computing
Zhao et al. Heterogeneous dual-core overlay processor for light-weight cnns
Pearson et al. Node-aware stencil communication for heterogeneous supercomputers
Liang et al. Design of 16-bit fixed-point CNN coprocessor based on FPGA
Xu et al. Energy-efficient accelerator design for deformable convolution networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230307

Address after: Room 612, floor 6, building 4-7, No. a, Hangfeng Road, Science City, Fengtai District, Beijing 100071

Patentee after: Beijing Jieshi Zhitong Polytron Technologies Inc.

Address before: 610000 Shuangxing Avenue, Gongxing street, Southwest Airport Economic Development Zone, Shuangliu District, Chengdu City, Sichuan Province

Patentee before: CHENGDU SUNWAY TECHNOLOGY CO.,LTD.