CN112463290A

CN112463290A - Method, system, apparatus and storage medium for dynamically adjusting the number of computing containers

Info

Publication number: CN112463290A
Application number: CN202011247775.0A
Authority: CN
Inventors: 刘昌俊
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-03-09

Abstract

The present disclosure provides a method, system, apparatus, and storage medium for dynamically adjusting a number of computing containers, wherein the method comprises: receiving a task submitted by a user; acquiring a computing container cluster for executing the task according to the task; and monitoring the computing container cluster, and adjusting the number of computing containers in the computing container cluster according to a monitoring result. The method provided by the disclosure transforms the original static resource mode of the computing container into the dynamic expansion and contraction capacity mode, so that the computing cluster resources are fully utilized, and the function of computing the dynamic expansion and contraction of the resources according to the actual traffic is realized while the requirements of the service characteristics are met. And the improved computing container cluster can automatically reduce the capacity and recover resources when being idle, so that the pressure of cluster nodes can be reduced.

Description

Method, system, apparatus and storage medium for dynamically adjusting the number of computing containers

Technical Field

The present invention relates to the field of computer technology, and more particularly, to a method, system, apparatus, and storage medium for dynamically adjusting the number of computing containers.

Background

Under the application scene of multiple tenants of a shared cloud computing platform, each user hopes to have an independent and isolated application environment, and the mutual calling interference is reduced. The container technology can construct a completely isolated, independent, and easily maintainable operating environment for different users and applications. Current big data calculations, for example: the model of Hadoop (a distributed system infrastructure developed by the Apache Foundation) and Spark (a fast general-purpose compute engine designed specifically for large-scale data processing) that combines computation and storage is an attempt at distributed architecture construction. However, when the community modifies HDFS (Hadoop Distributed File System) to support erasure codes of Hadoop 3.0, the policy of reading nearby is no longer supported, which represents a new trend. The data plane, instead of HDFS, may be stored with large-scale cloud-based objects. In the computing plane, to be able to start computing as needed, a virtualization technology like kubernets (an open source, for managing containerized applications on multiple hosts in a cloud platform) may be considered instead of binding YARNs (Another Resource coordinator). Therefore, the container-based Spark cluster deployment mode is increasingly applied to the field of large data on the cloud.

At present, container-based Spark clusters do not have a dynamic scaling resource function. The number of computing containers of the Spark cluster needs to be set in a configuration file in advance, the computing containers are certain, when the computing capacity of the existing cluster does not meet the requirements of users, automatic capacity expansion and contraction cannot be achieved, only one Spark cluster can be newly built for processing, therefore, when the computing containers are idle, waste of resources can be caused, and when the computing containers are busy, overstock of tasks can be caused.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a method, a system, a device and a storage medium for dynamically adjusting the number of computing containers, which can automatically adjust the number of computing containers according to the processed task, so as to make full use of computing cluster resources.

According to a first aspect of the present invention, an embodiment of the present invention provides a method for dynamically adjusting the number of computing containers, the method comprising: receiving a task submitted by a user; acquiring a computing container cluster for executing the task according to the task; and monitoring the computing container cluster, and adjusting the number of computing containers in the computing container cluster according to a monitoring result.

In some embodiments of the invention, the obtaining a cluster of computing containers for performing the task according to the task comprises: acquiring the running mode of the task; acquiring a management container according to the operation mode; and acquiring a computing container cluster for executing the task according to the management container.

In some embodiments of the invention, the obtaining a cluster of computing containers for performing the task from the management container comprises: the management container instantiates a running environment and a resource management component of the task; the resource management component applies for the computing container cluster from a resource server.

In some embodiments of the invention, the method further comprises: and before the number of containers of a calculation container suitable for executing the task is obtained according to the task, acquiring a dynamic flexible container identifier of the task, and judging whether a dynamic flexible container mode is started by the task.

In some embodiments of the present invention, the adjusting the number of computing containers in the computing container cluster according to the monitoring result includes: and when the task with the backlog time exceeding a preset backlog threshold value exists in the computing container cluster, generating an expanded computing container, and registering the address of the expanded computing container to a management container.

In some embodiments of the invention, said adjusting the number of compute containers in the compute container cluster according to the number of containers further comprises: setting a monitoring unit in the expansion computing container; the expansion calculation container acquires data through the monitoring unit for processing, and sends the address of a processing result to the intermediate data monitoring module through the monitoring unit.

In some embodiments of the present invention, the adjusting the number of computing containers in the computing container cluster according to the monitoring result includes: and when monitoring that the idle time of the computing containers in the computing container cluster exceeds a preset idle threshold, releasing the computing containers.

In some embodiments of the invention, said adjusting the number of compute containers in the compute container cluster according to the number of containers further comprises: and before releasing the computing container, sending the address of the data in the computing container to an intermediate data monitoring module.

According to a second aspect of the present invention, an embodiment of the present invention provides a system for dynamically adjusting the number of computing containers, the system comprising: the task receiving module is used for receiving tasks submitted by users; the cluster acquisition module is used for acquiring a computing container cluster for executing the task according to the task; the quantity acquisition module is used for acquiring the container quantity of the calculation container suitable for executing the task according to the task; and the dynamic adjusting module is used for adjusting the number of the computing containers in the computing container cluster according to the number of the containers.

In some embodiments of the invention, the cluster acquisition module is configured to: acquiring the running mode of the task; acquiring a management container according to the operation mode; and acquiring a computing container cluster for executing the task according to the management container.

In some embodiments of the invention, the system further comprises: and the mode judging module is used for acquiring the dynamic flexible capacity identification of the task before acquiring the number of containers of the calculation container suitable for executing the task according to the task, and judging whether the task starts a dynamic flexible capacity mode.

In some embodiments of the invention, the dynamic adjustment module is configured to: and when the task with the backlog time exceeding a preset backlog threshold value exists in the computing container cluster, generating an expanded computing container, and registering the address of the expanded computing container to a management container.

In some embodiments of the invention, the dynamic adjustment module is further configured to: setting a monitoring unit in the expansion computing container; the expansion calculation container acquires data through the monitoring unit for processing, and sends the address of a processing result to the intermediate data monitoring module through the monitoring unit.

In some embodiments of the invention, the dynamic adjustment module is configured to: and when monitoring that the idle time of the computing containers in the computing container cluster exceeds a preset idle threshold, releasing the computing containers.

In some embodiments of the invention, the dynamic adjustment module is further configured to: and before releasing the computing container, sending the address of the data in the computing container to an intermediate data monitoring module.

According to a third aspect of the present invention, an embodiment of the present invention further provides an apparatus for dynamically adjusting the number of computing containers, including a memory for storing computer-readable instructions and a processor for executing the computer-readable instructions to implement the method according to any one of the foregoing embodiments.

According to a fourth aspect of the present invention, embodiments of the present invention further provide a computer storage medium storing a computer program, which when executed by a processor implements the method of any one of the preceding embodiments.

The invention changes the original static resource mode of the computing container into the dynamic expansion and contraction capacity mode, fully utilizes the computing cluster resources, meets the requirements of service characteristics and realizes the function of dynamic expansion and contraction of the computing resources according to the actual service quantity. And the improved computing container cluster can automatically reduce the capacity and recover resources when being idle, so that the pressure of cluster nodes can be reduced. In addition, the problem that the computing efficiency of the whole cluster is low and the user experience is poor due to the fact that intermediate computing data are possibly lost in the dynamic telescopic resource process is solved by the aid of the intermediate data monitoring module. And ensuring that the generated intermediate data can still be used as the upstream data of the next calculation container after the calculation container is recovered.

Drawings

FIG. 1 is a flow diagram of a method of dynamically adjusting the number of computing containers, according to one embodiment of the invention;

FIG. 2 is a schematic flow diagram of process 101 of FIG. 1;

FIG. 3 is a block diagram of a system that dynamically adjusts the number of compute containers in accordance with one embodiment of the present invention.

Detailed Description

Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known modules, units and their interconnections, links, communications or operations with each other are not shown or described in detail. Furthermore, the described features, architectures, or functions can be combined in any manner in one or more implementations. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. It will also be readily understood that the modules or units or processes of the embodiments described herein and illustrated in the figures can be combined and designed in a wide variety of different configurations.

One embodiment of the present invention provides a method of dynamically adjusting the number of computing containers, as shown in fig. 1, which, in an embodiment of the present invention, includes:

100: receiving a task submitted by a user;

101: acquiring a computing container cluster for executing the task according to the task;

102: and monitoring the computing container cluster, and adjusting the number of computing containers in the computing container cluster according to the monitoring result.

In this embodiment, as shown in FIG. 2, process 101 may be implemented as follows:

103: acquiring a task running mode;

104: acquiring a management container according to the operation mode;

105: a cluster of compute containers for performing the task is obtained from the management container.

Specifically, the management container instantiates the runtime environment and resource management component of the task, and the resource management component applies for the computing container cluster from the resource server.

Thus, a cluster of basic computing containers is obtained that perform the task. In this embodiment, before executing the processing 102, a dynamic scalable content identifier of the task may be obtained, and whether the dynamic scalable content mode is started by the task may be determined according to the identifier.

In this embodiment, if the task starts the dynamic scaling mode, the computing container cluster is monitored, and the number of computing containers in the computing container cluster is adjusted according to the monitoring result, specifically:

and when the task with the backlog time exceeding a preset backlog threshold value exists in the computing container cluster, generating an expanded computing container, and registering the address of the expanded computing container to the management container. Meanwhile, a monitoring unit is arranged in the expansion calculation container, the expansion calculation container can acquire data through the monitoring unit for processing, and the address of a processing result is sent to the intermediate data monitoring module through the monitoring unit.

When monitoring that the idle time of the computing container in the computing container cluster exceeds a preset idle threshold value, releasing the computing container. And simultaneously, before releasing the computing container, sending the address of the data in the computing container to an intermediate data monitoring module.

Therefore, the invention changes the original static resource mode of the computing container into the dynamic expansion and contraction capacity mode, fully utilizes the computing cluster resources, meets the requirements of service characteristics and realizes the function of dynamic expansion and contraction of the computing resources according to the actual service quantity. And the improved computing container cluster can automatically reduce the capacity and recover resources when being idle, so that the pressure of cluster nodes can be reduced. In addition, the problem that the computing efficiency of the whole cluster is low and the user experience is poor due to the fact that intermediate computing data are possibly lost in the dynamic telescopic resource process is solved by the aid of the intermediate data monitoring module. And ensuring that the generated intermediate data can still be used as the upstream data of the next calculation container after the calculation container is recovered.

The automatic operation and maintenance method for computer operation provided by the invention is described in the following with specific examples:

in this embodiment, a Spark cluster will be taken as an example for explanation.

Firstly, a user submits a Spark task, identifies the running mode of the Spark task, and if the running mode of the submitted task is a cluster mode, the task does not have a corresponding management container, so that the container cluster management node can create a corresponding Spark management container on the container cluster computing node by calling a management plug-in of a Spark driver through a kubernets API. If the submitted task has the running mode of the client mode, the Spark cluster management container is indicated to exist.

Secondly, calling the Spark management container to instantiate a Spark task running environment and a Spark cluster resource management component, wherein the resource management component refers to a service for acquiring resources on a container cluster by a Spark task, namely a Spark resource manager, and the Spark management container applies for the resources like the container cluster by calling the resource management service component and acquires computing container resources.

In this embodiment, after obtaining the resources of the computation container, the Spark management container determines whether the cluster starts a dynamic scaling mode, if the dynamic scaling mode is started, the Spark management container starts a Spark computation container management plug-in, the computation container management plug-in creates a monitoring thread to monitor a Spark computation container node, and finally the Spark management container calls the Spark computation container management plug-in through a resource management service according to the complexity of a Spark task, and then calls the container cluster management node, generates a plurality of Spark expansion computation containers on the container computation node, and registers address information and the like in the Spark management container.

After a Spark application is submitted, the succession is divided into a plurality of stages, the stages are executed in sequence, each Stage comprises a group of tasks, and the number of the tasks is the parallelism of Stage. The task is operated in the Spark calculation container, and the volume expansion parameter of the task backlog time is set to be 1s, namely if each Spark calculation container works at present, but the task backlog still exceeds 1s and is not processed temporarily, the Spark management container applies new resources to the container cluster to newly add the Spark calculation container to process the rest tasks, and the Spark calculation container returns task operation state information to the Spark management container in real time.

Specifically, after the Spark application submits the task, the Driver process starts Spark context to record the life cycle of the application, instantiates the Spark execution environment and the resource management component. And then, creating components such as SchedulBackend, Kubernets ExternalshufflManagerImpl and the like by the resource management component to prepare for the subsequent implementation of the dynamic expansion and contraction capacity of the computing resources and the subsequent intermediate data monitoring. The scheduler backup may be associated with the resource management component, allocate computing resources (i.e., Executors) to a task currently waiting for allocation of the computing resources, start the task on the allocated Executors, complete the computing resource scheduling process, and may also be used for requests and Kill Executors. The executorallocation manager is a dynamic resource allocation management component of Spark, and may set a minimum number of executors to be dynamically allocated, a maximum number of executors to be dynamically allocated, and the like.

Spare Driver will determine whether the execution dynamic allocation mode is to be enabled. After the dynamic allocation mode is started, an execution allocation manager dynamic manager is started, configuration information such as the minimum dynamically allocated execution or number, the maximum dynamically allocated execution or number, the number of tasks that each execution or can run, the time-out time of the backlog of the tasks, the idle time-out time of the execution or and the like can be set in the dynamic manager, and the configuration information is verified, wherein the configuration information is shown in the following table. The dynamic manager creates an allocatorrunbal thread to dynamically monitor the information such as the survival state of the executive. The spare Driver calls an interface of the Kubernets Master to start the execution Pod and finally starts the CoorseGraineExecutionBackend process, so that the execution registry is started successfully.

Table 1:

in order to ensure that the intermediate calculation data is not lost, the Spark calculation container interacts with the Spark management container and the intermediate data monitoring module in real time in the Spark task execution process. How many stages a Spark application task is divided into after it is submitted is divided according to a shuffle (a random permutation function) process. The shuffle process is a demarcation point of two phases, all transformations before the shuffle are one phase, and operations after the shuffle are the other phase. Because the data may be stored in different nodes of the cluster, the execution of the next stage first needs to pull the data of the previous stage to be stored in the own node. This involves data transfer and interaction between computing containers.

In this embodiment, the spare cluster calculation process is accompanied by a process of storing the intermediate data of the shuffle in the local disk of the container cluster calculation node. After the dynamic scaling mode is started, the path of the computing container node storing the data needs to be recorded by an intermediate data monitoring module. When the dynamic scalable capacity mode is not started, paths of intermediate data generated by the computation task are stored in Spark computation containers, the computation containers exist forever, when a Spark computation container in a next stage needs to obtain shuffle data in a previous stage, the Spark management container can be directly accessed to obtain an address of the Spark computation container in the previous stage, then the Spark computation container in the previous stage is directly accessed to obtain an intermediate data storage path through an internal network of the container, then the Spark computation container in the stage pulls the intermediate data to a local disk of a container cluster computation node where the Spark computation containers are located, and then computation in the stage is started. After a dynamic scalable capacity mode is newly added, the spare computing container nodes will increase or decrease along with the change of the service, and then the storage path of the intermediate data will disappear along with the deletion of the spare computing container, so that an intermediate data monitoring module is needed to assist in memorizing the intermediate data address. After a dynamic expansion and contraction capacity module is newly added, the intermediate data monitoring management module calls the container cluster management module to install an intermediate data monitoring module on each container cluster computing node, and each Spark computing container stores an intermediate data address generated by a shuffle in real time in the intermediate data monitoring module of the container cluster computing node where the current Spark computing container is located in the computing process and is used for other Spark computing containers to read in the next stage. The intermediate data monitoring module is statically deployed on the container cluster computing node, so that a path of intermediate data generated in the whole life cycle of the whole Spark cluster computing task can be recorded, and the normal operation of dynamic expansion and contraction of the Spark cluster is ensured. Since data is stored in the local disk of the container cluster, and the path is recorded by the intermediate data monitoring module, no additional data processing mechanism is needed in the dynamic expansion or contraction process of the Spark cluster to prevent data loss.

Specifically, in order to support the dynamic change of the number of executors, the function of externalShuffLeService is required to be relied on. In the Spark Shuffle process, one Executor goes to another to fetch data. If an Executor node is abnormal or has been deleted due to the timeout of idle time, it cannot process the shuffle data read requests sent by other Executors, and the previously generated data is meaningless. This results in some operations being recalculated, which is very inefficient. To solve this problem, the necessary link between whether the shuffle data generated by an Executor is available and whether the Executor is still alive needs to be broken. The concept of explicit shuffle service deployed on each compute Node YARN component in a Spark on YARN may therefore be introduced into a container-based Spark cluster service, where an explicit shuffle service pod is deployed on a Node of a container cluster. The executors can store the temporary storage position information of the shuffle data into the ExternalShuffleService, and other executors directly read the required data through the ExternalShuffleService instead of searching the data through the executors which previously generated the data, so that whether the previous executors survive does not affect the execution of the subsequent tasks.

Since the executors need to read data from the ExternalShuffleService, and the process survival time needs to be longer than that of all the executors, we adopt the deployment mode of the container DamonSet, and the Pod created by the container is run with one copy on each Node in the container cluster. If a new Node is dynamically added to the cluster, the Pod in DaemonSet (ensure that the Pod it creates runs a copy on every Node in the cluster) will also be added to run on the newly added Node. In the map (calculating each element in a list), the execution on each Node and the external ShuffleService on the local Node can communicate. In reduce (iterative computation for each element in a list), the executer needs to remotely call the external ShuffLeService pod of the required Node.

Fig. 3 is a block diagram of a system 1 for dynamically adjusting the number of computing containers according to an embodiment of the present invention, and referring to fig. 3, the system 1 may include: the task receiving module 11 is used for receiving tasks submitted by users; the cluster acquisition module 12 is configured to acquire a computing container cluster for executing a task according to the task; and the dynamic adjusting module 14 is configured to monitor the computing container cluster, and adjust the number of computing containers in the computing container cluster according to the monitoring result.

In the embodiment of the present invention, the cluster obtaining module 12 first obtains the operation mode of the task, and then obtains the management container according to the operation mode, and obtains the computing container cluster for executing the task through the management container. Specifically, the management container instantiates the runtime environment and the resource management component of the task, and the resource management component applies for the computing container cluster from the resource server.

In an embodiment of the present invention, the system 1 further includes a mode determination module 13, configured to, before obtaining, according to the task, the number of containers of the computation container suitable for executing the task, obtain a dynamic scaling capacity identifier of the task, and determine whether the task starts a dynamic scaling capacity mode. .

In an embodiment of the present invention, the dynamic adjustment module 14 is configured to: and when the task with the backlog time exceeding a preset backlog threshold value exists in the computing container cluster, generating an expanded computing container, and registering the address of the expanded computing container to the management container. Meanwhile, a monitoring unit is also required to be arranged in the expansion calculation container, the expansion calculation container acquires data through the monitoring unit for processing, and an address of a processing result is sent to the intermediate data monitoring module through the monitoring unit.

In an embodiment of the present invention, the dynamic adjustment module 14 is configured to: when monitoring that the idle time of the computing container in the computing container cluster exceeds a preset idle threshold value, releasing the computing container. And simultaneously, before releasing the computing container, sending the address of the data in the computing container to an intermediate data monitoring module.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.

Therefore, the embodiment of the present invention further provides a computer storage medium storing a computer program for implementing the method for dynamically adjusting the number of computing containers provided in the foregoing embodiment or implementation manner of the present invention when executed. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, and the like.

Embodiments of the present invention also provide an apparatus for dynamically adjusting the number of computing containers, the apparatus comprising a memory for storing computer readable instructions; a processor for executing the computer readable instructions to implement the method for dynamically adjusting the number of computing containers provided by the foregoing embodiments or implementations of the invention. Optionally, in an implementation manner of the embodiment of the present invention, the apparatus may further include an input/output interface for data communication. For example, the device may be a computer, a smart terminal, a server, or the like.

The particular embodiments disclosed herein are illustrative only and should not be taken as limitations upon the scope of the invention, which is to be accorded the full scope consistent with the claims, as defined in the appended claims. Accordingly, the particular illustrative embodiments disclosed above are susceptible to various substitutions, combinations or modifications, all of which are within the scope of the disclosure.

Claims

1. A method of dynamically adjusting a number of computing containers, the method comprising:

receiving a task submitted by a user;

acquiring a computing container cluster for executing the task according to the task;

and monitoring the computing container cluster, and adjusting the number of computing containers in the computing container cluster according to a monitoring result.

2. The method of claim 1, wherein the obtaining a cluster of computing containers for performing the task from the task comprises:

acquiring the running mode of the task;

acquiring a management container according to the operation mode;

and acquiring a computing container cluster for executing the task according to the management container.

3. The method of claim 2, wherein the obtaining a cluster of compute containers for performing the task from the management container comprises:

the management container instantiates a running environment and a resource management component of the task;

the resource management component applies for the computing container cluster from a resource server.

4. The method of claim 1, wherein the method further comprises:

and before the number of containers of a calculation container suitable for executing the task is obtained according to the task, acquiring a dynamic flexible container identifier of the task, and judging whether a dynamic flexible container mode is started by the task.

5. The method of claim 1, wherein the adjusting the number of compute containers in the compute container cluster based on the monitoring comprises:

and when the task with the backlog time exceeding a preset backlog threshold value exists in the computing container cluster, generating an expanded computing container, and registering the address of the expanded computing container to a management container.

6. The method of claim 5, wherein said adjusting the number of compute containers in the compute container cluster based on the monitoring further comprises:

setting a monitoring unit in the expansion computing container;

the expansion calculation container acquires data through the monitoring unit for processing, and sends the address of a processing result to the intermediate data monitoring module through the monitoring unit.

7. The method of claim 1, wherein the adjusting the number of compute containers in the compute container cluster based on the monitoring comprises:

and when monitoring that the idle time of the computing containers in the computing container cluster exceeds a preset idle threshold, releasing the computing containers.

8. The method of claim 7, wherein said adjusting the number of compute containers in the compute container cluster based on the monitoring further comprises:

and before releasing the computing container, sending the address of the data in the computing container to an intermediate data monitoring module.

9. A system for dynamically adjusting the number of computing containers, the system comprising:

the task receiving module is used for receiving tasks submitted by users;

the cluster acquisition module is used for acquiring a computing container cluster for executing the task according to the task;

and the dynamic adjusting module is used for monitoring the computing container cluster and adjusting the number of the computing containers in the computing container cluster according to a monitoring result.

10. The system of claim 9, wherein the cluster acquisition module is to:

acquiring the running mode of the task;

acquiring a management container according to the operation mode;

11. The system of claim 10, wherein the obtaining a cluster of compute containers for performing the task from the management container comprises:

12. The system of claim 9, wherein the system further comprises:

and the mode judging module is used for acquiring the dynamic flexible capacity identification of the task before acquiring the number of containers of the calculation container suitable for executing the task according to the task, and judging whether the task starts a dynamic flexible capacity mode.

13. The system of claim 9, wherein the dynamic adjustment module is to:

14. The system of claim 13, wherein the dynamic adjustment module is further to:

setting a monitoring unit in the expansion computing container;

15. The system of claim 9, wherein the dynamic adjustment module is to:

16. The system of claim 15, wherein the dynamic adjustment module is further to:

17. An apparatus for dynamically adjusting the number of computing containers, comprising a memory and a processor,

the memory is to store computer readable instructions;

the processor is configured to execute the computer-readable instructions to implement the method of any of claims 1-8.

18. A computer storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1-8.