CN112631693B

CN112631693B - Dynamic expansion method for computing resources in runtime

Info

Publication number: CN112631693B
Application number: CN201910903881.0A
Authority: CN
Inventors: 何王全; 董恩铭; 于康; 宋长明; 方燕飞; 漆锋滨
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2022-10-04
Anticipated expiration: 2039-09-24
Also published as: CN112631693A

Abstract

The invention discloses a dynamic expansion method of computing resources during running, which sends a resource expansion upgrading command to an original running topic; after receiving the upgrade signal, the original node cleans up the communication environment; waiting for newly-entered computing resources to carry out secondary communication environment reconstruction communication; the new node enters a dynamic task partition area by a job management starting program; after judging that the new node is updated by the resources, the new node carries out communication environment cleaning; performing secondary communication environment reconstruction on the original node and the new node; the new node automatically partitions according to the dynamic task partition rule, a partition master node is selected, and the partition master node applies for a task to a global master node; after receiving the task application of the master nodes of the original partition and the new partition, the global master node uniformly distributes tasks to the original computing resource nodes and the new computing resource nodes, and the task continues to run normally. The invention solves the problem of communication environment reconstruction after resource adjustment, allocates idle computing resources to the task on the basis of not interrupting the operated task, and dynamically allocates uncompleted tasks, thereby achieving the maximum utilization of the computing resources.

Description

Dynamic expansion method for computing resources in runtime

Technical Field

The invention belongs to the field of runtime systems, and particularly relates to a runtime computing resource dynamic expansion method.

Background

The high-performance computing system provides possibility for resolving massively parallel applications in a plurality of fields, generally, computing resources of the high-performance computing system are shared by a plurality of application projects, the operation scale of each application project is different, the operation time is different, and situations that some application projects are finished to operate and the computing resources are idle, and some application projects need to operate for a long time exist. The technical problem to be solved is to utilize idle computing resources to accelerate the running application.

The task parallel application task is to distribute tasks in a task pool to a plurality of computing resources to be completed in parallel, if the tasks have correlation, the mapping relation of the tasks can be damaged by dynamically adding new computing resources, and the normal operation of a program is influenced; if the tasks have no correlation, the tasks are insensitive to the scale and the shape of the computing resources, the existing tasks can be dynamically divided, and the completion of the application can be accelerated by using new idle resources.

Many large-scale task parallel applications require a large amount of computing resources and long computing time, and if new computing resources are available, the schemes often adopted are to wait for breakpoint file updating, interrupt a task in operation, and resubmit the task after integrating resources, which not only brings a certain burden to users, but also causes idle waste of new computing resources within a certain time (waiting for breakpoint file updating).

Disclosure of Invention

The invention aims to provide a method for dynamically expanding computing resources during running, which aims to solve the problem of communication environment reconstruction after resource adjustment, allocate idle computing resources to a running task on the basis of not interrupting the task, dynamically allocate incomplete tasks and achieve the maximum utilization of the computing resources on the basis of ensuring the correctness and completeness of the running result of the task.

In order to achieve the purpose, the invention adopts the technical scheme that: a method for dynamically expanding computing resources at runtime, comprising the steps of,

s1, sending a resource expansion upgrading command to an original subject which normally runs;

s2, after all the original computing resource nodes and the new computing resource nodes of the original topic receive the resource expansion and upgrade command simultaneously, the following steps are respectively carried out:

a. after all original computing resource nodes of the original topic receive the resource extension upgrading command, the following steps are carried out,

a1, cleaning a communication environment and releasing related environment variables;

a2, initializing a secondary communication environment together with waiting for new incoming computing resources, and reconstructing the communication environment;

a3, after the communication environment is initialized, the original computing resource node updates the dynamic task division information, and a new computing resource node which comes in is added into a dynamic task division area according to the dynamic task division rule;

b. the new computing resource node receives the resource expansion upgrading command, starts the task by the job management, and respectively performs the following steps,

b1, starting a program by a new computing resource node, and entering a dynamic task partition area;

b2, cleaning the communication environment by the new computing resource node;

b3, initializing and reconstructing a secondary communication environment together with the original computing resource node;

b4, partitioning automatically according to a dynamic task partitioning rule, automatically partitioning a new computing resource node to become a new part of process area of dynamic task partitioning, selecting a new computing resource node as a new partition master node, and applying for a task from the partition master node to a global master node in the original computing resource node;

and S3, after the global master node of the original computing resource receives the application tasks of the original partition master node and the new partition master node, uniformly distributing the tasks to the original computing resource node and the new computing resource node, and continuing normal operation of the task.

The technical scheme of further improvement in the technical scheme is as follows:

1. in the above solution, the dynamic task partition can be used in a message library of a cross-language message communication standard

2. In the foregoing solution, the message library of the cross-language message communication standard is MPICH and Open MPI.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

1) The method for dynamically expanding the computing resources during the operation solves the problem of communication environment reconstruction after resource adjustment, allocates the idle computing resources to the task on the basis of not interrupting the operated task, dynamically allocates uncompleted tasks, and achieves the maximum utilization of the computing resources on the basis of ensuring the correctness and completeness of the task operation result.

2) The dynamic expansion method of the computing resources in the running process is transparent to the user, the user only needs to key in a resource upgrading command to tell the running operation of the system in the running process that the dynamic expansion of the resources can be carried out, and other processing is automatically completed by the system in the running process.

3) The dynamic expansion method for the computing resources during operation effectively utilizes new idle computing resources and accelerates the calculation efficiency of the problems during operation.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is further described below with reference to the following examples:

example (b): as shown in fig. 1, a method for dynamically expanding a runtime computing resource includes the steps of,

s1, sending a resource expansion upgrading command to an original topic which is running normally;

b2, cleaning a communication environment by the new computing resource node;

The dynamic task partitioning can be used in a message library of cross-language messaging standards.

The message libraries of the cross-language message communication standard are MPICH and Open MPI.

The examples are further explained below:

in the invention, the definition of dynamic expansion of computing resources is as follows: for the task of task parallel application without correlation among tasks, under the condition of not changing the running state of the original task, the idle computing resources are expanded into the running task, incomplete tasks are dynamically distributed to all the computing resources after the expansion is completed, and new and old resources complete all the remaining tasks together, so that the purpose of reasonably utilizing the computing resources is achieved. The dynamic expansion method of the computing resources during the operation of the invention requires the application of task parallel classes, and the tasks have no correlation, so that the scale and the shape of the computing resources are not required at any time during the operation, and the computing resources can be increased during the operation to improve the resolving efficiency.

The dynamic expansion and upgrade process of the computing resources is as follows, and under the condition of idle computing resources, a user only needs to key in a resource expansion and upgrade command signal to the original topic which is running. The resource expansion upgrading command is realized by job management and mainly comprises two parts of contents: (1) The new computing resource starts to run the object code which is the same as the existing computing resource; and (2) sending signals to the original computing resource and the new computing resource.

After all original computing resource nodes of the original topic receive the resource expansion upgrading command signal, the following operations are carried out: (1) Cleaning up a communication environment and releasing related environment variables; (2) Waiting for the new computing resources to carry out secondary communication environment initialization together, and reconstructing a communication environment; (3) After the communication environment is initialized, the original computing resource node updates the dynamic task division information, and a new computing resource node is added into the dynamic task division area according to the dynamic task division rule.

And the new computing resource node starts the task by job management and enters a dynamic task partition area. After the environment variable judgment point is a new computing resource node, cleaning a communication environment, then carrying out communication environment reconstruction together with the original computing resource node, after the reconstruction is finished, enabling the original computing resource node and the new computing resource node to carry out normal communication, automatically partitioning the new computing resource node into a new area for dynamic task partitioning according to a dynamic task partitioning rule, selecting a new computing resource node (generally, a first node of the partition) as a new partition master node, and applying for a task from the partition master node to a global master node in the original computing resource node; after receiving the task applications of the original partition master node and the new partition master node, the global master node of the original computing resource evenly distributes tasks to the original computing resource node and the new computing resource node, and the task continues to run normally.

The operation steps performed by the original computing resource node after receiving the resource extension and upgrade command signal and the operation steps performed by the new computing resource node after receiving the resource extension and upgrade command signal are performed simultaneously and not sequentially.

In the invention, the communication environments of the original computing resource node and the new computing resource node need to be reconstructed, so that the communication environments of the original computing resource node and the new computing resource node are consistent. After receiving an upgrade command initiated by a user, the original computing resource node releases the established communication domain and some related global information, and then waits for a new computing resource node. After a user initiates an upgrade command, all new computing resource nodes can run the subject, when the new computing resource nodes enter a dynamic task partitioning function, the function judges the points as the new computing resource nodes, and then the points release the established communication domain and some related global information as the original computing resource nodes. Then, the new computing resource node and the original computing resource node carry out reconstruction of a communication domain and update of global information together; after reconstruction is completed, partitioning is automatically performed according to dynamic task partition rules, the partition master node applies for a task to the global master node, the global master node evenly distributes the task to the new computing resource node and the original computing resource node after receiving the application task, and the original task continues to run normally. Therefore, the dynamic expansion technology of the computing resources can effectively utilize new available resources under the condition of not interrupting the original topic, and the resolving efficiency of the application topic is greatly improved.

The invention can be realized based on dynamic task division in Message library (MPICH, open MPI) of language Message communication standard (Message Passing Interface, MPI).

When the method for dynamically expanding the computing resources in the running process is adopted, the problem of whether idle computing resources can be utilized to accelerate the running application is solved. The dynamic expansion technology of the computing resources is transparent to the user, the user only needs to key in a resource upgrading command to tell the running operation of the running system that the dynamic expansion of the resources can be carried out, and other processing is automatically completed by the running system; and new idle computing resources are effectively utilized, and the solving efficiency of the running topic is accelerated.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A method for dynamically extending runtime computing resources, comprising: comprises the following steps of (a) carrying out,

a. after all original computing resource nodes of the original topic receive the resource expansion and upgrade command, the following steps are carried out,

b. the new computing resource node receives the resource expansion and upgrade command, starts the task by the job management, and respectively performs the following steps,

b2, cleaning the communication environment by the new computing resource node;

2. The method for dynamically extending runtime computing resources of claim 1, wherein: the dynamic task partitioning can be used in a message library of a cross-language messaging standard.

3. The method for dynamically extending runtime computing resources of claim 2, wherein: the message libraries of the cross-language message communication standard are MPICH and Open MPI.