CN113961353A

CN113961353A - Task processing method and distributed system for AI task

Info

Publication number: CN113961353A
Application number: CN202111271295.2A
Authority: CN
Inventors: 李宏铭; 杨燕; 常全福
Original assignee: Shenzhen TetrasAI Technology Co Ltd
Current assignee: Shenzhen TetrasAI Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-21

Abstract

The embodiment of the disclosure provides a task processing method of an AI task and a distributed system, wherein the method is applied to a main node of the distributed system, the distributed system comprises the main node and a plurality of computing nodes, each computing node comprises at least one container manager, and each container manager comprises at least one container for executing the AI task; the method comprises the following steps: acquiring task configuration files of a plurality of AI tasks to be executed; and configuring a plurality of containers based on the task configuration file, and controlling the plurality of computing nodes to process the plurality of AI tasks in parallel through the plurality of containers. Each AI task is executed by each container, and a plurality of AI tasks can be simultaneously carried out, thereby improving the execution efficiency of the algorithm task.

Description

Task processing method and distributed system for AI task

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a task processing method and a distributed system for an AI (Artificial Intelligence) task.

Background

The evaluation and testing of AI (Artificial Intelligence) algorithms usually requires a large amount of data, and the testing of such data consumes a lot of time and hardware resources. Currently, when a single computer is used for executing the algorithms of a plurality of AI tasks, the efficiency is low, and the result can be output after waiting for a long time.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a task processing method and a distributed system for at least one AI task.

In a first aspect, a task processing method for an AI task is provided, where the method is applied to a master node of a distributed system, the distributed system includes the master node and a plurality of computing nodes, each computing node includes at least one container manager, and each container manager includes at least one container for executing the AI task; the method comprises the following steps:

acquiring task configuration files of a plurality of AI tasks to be executed;

and configuring a plurality of containers based on the task configuration file, and controlling the plurality of computing nodes to process the plurality of AI tasks in parallel through the plurality of containers.

In some optional embodiments, the task configuration file includes: container environment configuration information required during processing of the AI task;

configuring a plurality of the containers based on the task profile, including:

and sending the container environment configuration information to the computing node, wherein the container environment configuration information is used for the computing node to create at least one container which is the same as the configuration in the container environment configuration information in at least one container manager.

In the task processing method for the AI task provided in this embodiment, the master node may create containers with the same configuration in each container manager of each computing node in batch by sending the container environment configuration information to the computing nodes, so that a plurality of containers may execute a plurality of AI tasks with the same requirement on the execution environment in parallel.

In some optional embodiments, the task configuration file includes: a first number of container managers;

configuring a plurality of the containers based on the task profile, including:

creating a first number of container managers in a plurality of the compute nodes according to the first number in the task profile;

monitoring the running states of the first quantity of capacity managers;

in response to determining that there is a failed container manager among the first number of capacity managers based on the operating status, performing state recovery processing of the container managers, wherein the number of container managers after the state recovery processing remains at the first number;

in the first number of container managers, a plurality of the containers are configured.

In the task processing method for the AI task provided in this embodiment, the master node may monitor the operation state of the capacity manager and maintain a certain number of capacity managers in the working state, so as to always keep the computing resources occupied when the AI task is executed within a certain range, and the problem of slow task processing speed caused by too few containers and allocated resources for executing the AI task due to container manager failure does not occur.

In some optional embodiments, the task configuration file includes: a parameter threshold for a resource usage parameter of the container manager; the method further comprises the following steps:

acquiring resource use parameters of the container manager when the container processes the AI task;

in response to the resource usage parameter not meeting the requirement of the parameter threshold, adjusting the number of the container managers, wherein the adjusted resource usage parameter of the container managers meets the requirement of the parameter threshold;

configuring a plurality of the containers based on the task profile, including:

configuring a plurality of the containers in the container manager based on the task profile and the adjusted container manager.

In the task processing method for the AI task provided in this embodiment, the master node may dynamically adjust the number of the container managers according to the resource usage parameters of the container managers, so as to control the number of the container managers according to the actual needs of the containers when the containers process the AI task, and further control the computing resources that can be used when the containers execute the AI task, thereby achieving a higher resource utilization rate.

In some optional embodiments, said adjusting the number of container managers in response to said resource usage parameter not meeting said parameter threshold requirement comprises:

in response to a ratio between the resource usage parameter and the parameter threshold being below a first ratio, reducing the number of container managers; and/or the presence of a gas in the gas,

increasing the number of container managers in response to a ratio between the resource usage parameter and the parameter threshold being higher than a second ratio.

In the task processing method for the AI task provided in this embodiment, when a ratio between a resource usage parameter of a container manager and a parameter threshold is lower than a first ratio, that is, when a computational resource required for executing the AI task is less than a computational resource allocated to the container manager, a master node reduces the number of the container managers to release the computational resource; when the ratio of the resource usage parameter of the container manager to the parameter threshold is higher than the first ratio, that is, when the computing resources required for executing the AI task are more than the computing resources allocated to the container manager, the number of the container managers is increased to obtain more computing resources, thereby realizing automatic allocation of the computing resources according to the task requirements.

In some optional embodiments, after adjusting the number of container managers in response to the resource usage parameter not meeting the requirement of the parameter threshold, the method further comprises:

allocating computing resources for processing the AI task for the added container manager; and/or

Freeing the computing resources occupied by the stopped container manager for processing the AI task.

In the task processing method for the AI task provided in this embodiment, the master node may allocate computing resources to the newly added container manager so as to increase the computing resources that can be used by the container when executing the AI task, thereby reducing the burden of other container managers with heavy loads; the computing resources occupied by the stopped container manager are released in time so that the released computing resources can be reused for other computations.

In some optional embodiments, the resource usage parameter includes at least one of: CPU utilization rate and memory occupancy rate.

In the task processing method for the AI task provided in this embodiment, when calculating the resource usage parameter of the container manager, the master node may consider the occupancy rates of the container manager to the CPU and the memory of the computing node, so as to better evaluate the usage condition of the computing resource allocated by the container manager.

In some optional embodiments, said configuring a plurality of said containers based on said task profile comprises:

configuring a plurality of said containers in a container manager based on said task profile;

the controlling the plurality of compute nodes to process the plurality of AI tasks in parallel through a plurality of the containers includes:

and the plurality of containers controlling the plurality of computing nodes utilize the computing resources of the container manager to process the plurality of AI tasks in parallel.

In the task processing method for the AI task provided in this embodiment, the main node controls the multiple containers on the compute node to process the multiple AI tasks by using the compute resources of the container manager where the multiple containers are located, and the multiple containers in the same container manager can share the compute resources when executing the AI tasks, thereby improving the resource utilization rate.

In a second aspect, a task processing method for an AI task is provided, where the method is applied to a computing node of a distributed system, the distributed system includes a master node and a plurality of computing nodes, each computing node includes at least one container manager, and each container manager includes at least one container for executing the AI task; the method comprises the following steps:

acquiring a plurality of task information corresponding to a plurality of AI tasks to be executed respectively, wherein the task information is used for limiting an AI algorithm to be executed and a data set correspondingly processed by the AI algorithm;

acquiring the AI algorithm and the data set from a task repository according to the task information;

executing, by the plurality of containers configured by the master node, processing of the data set by an AI algorithm in the plurality of AI tasks in parallel; wherein the plurality of containers are configured by the master node according to task profiles of a plurality of AI tasks to be performed.

In the task processing method for the AI task provided in this embodiment, the computing node may obtain task information corresponding to a plurality of AI tasks, so that a plurality of containers may respectively execute different AI tasks in parallel.

In some optional embodiments, the obtaining of a plurality of task information respectively corresponding to a plurality of AI tasks to be executed includes:

monitoring a message queue, wherein the message queue comprises a plurality of task information respectively corresponding to a plurality of AI tasks to be executed;

and acquiring a plurality of task information from the message queue.

In the task processing method for the AI task provided in this embodiment, the computing node may monitor the message queue and obtain the task information of the AI task to be executed in time.

In a third aspect, a distributed system is provided, the distributed system comprising a master node and a plurality of computing nodes, each computing node comprising at least one container manager, each container manager comprising at least one container for performing artificial intelligence AI tasks;

the main node is used for executing a task processing method of an AI task according to any embodiment of the disclosure;

the computing node is configured to execute a task processing method of an AI task according to any embodiment of the present disclosure.

In a fourth aspect, there is provided a task processing apparatus for an AI task, the apparatus operating on a master node in a distributed system, the apparatus comprising:

the configuration file acquisition module is used for acquiring task configuration files of a plurality of AI tasks to be executed;

and the container configuration module is used for configuring a plurality of containers based on the task configuration file and controlling the plurality of computing nodes to process the plurality of AI tasks in parallel through the plurality of containers.

In a fifth aspect, there is provided a task processing device for an AI task, the device running on a computing node in a distributed system, the device comprising:

the task information acquisition module is used for acquiring a plurality of task information corresponding to a plurality of AI tasks to be executed respectively, wherein the task information is used for limiting an AI algorithm to be executed and a data set correspondingly processed by the AI algorithm;

the algorithm data acquisition module is used for acquiring the AI algorithm and the data set from a task repository according to the task information;

an AI task execution module, configured to execute, in parallel, processing of the data set by an AI algorithm in the plurality of AI tasks through the plurality of containers configured by the master node; wherein the plurality of containers are configured by the master node according to task profiles of a plurality of AI tasks to be performed.

In a sixth aspect, an electronic device is provided, which includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement a task processing method of an AI task according to any one of the embodiments of the present disclosure when executing the computer instructions.

In a seventh aspect, a computer program product is provided, which includes a computer program/instruction that when executed by a processor implements the task processing method of the AI task according to any one of the embodiments of the present disclosure.

In an eighth aspect, there is provided a computer-readable storage medium on which a computer program is stored, the program implementing a task processing method of an AI task according to any one of the embodiments of the present disclosure when executed by a processor.

The task processing method for the AI tasks provided by the embodiment of the disclosure is executed by the master node of the distributed system, the master node configures containers in a plurality of computing nodes of the distributed system based on the task configuration file, and controls the plurality of computing nodes to execute a plurality of different AI tasks in parallel through the plurality of containers, different from the case that a single computer executes a plurality of AI tasks and can only start to execute the next task after the execution of the single task is finished, in the scheme, each AI task is executed by each container, and the plurality of AI tasks can be executed simultaneously, thereby improving the execution efficiency of the algorithm task.

Drawings

In order to more clearly illustrate one or more embodiments of the present disclosure or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in one or more embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a flowchart illustrating a task processing method of an AI task according to at least one embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a task processing method of another AI task according to at least one embodiment of the present disclosure;

FIG. 3 is a front page illustrating at least one embodiment of the present disclosure;

fig. 4 is a block diagram of a task processing device of an AI task according to at least one embodiment of the present disclosure;

fig. 5 is a block diagram of a task processing device of another AI task, shown in at least one embodiment of the present disclosure;

fig. 6 is a block diagram of a task processing device of yet another AI task, shown in at least one embodiment of the present disclosure;

fig. 7 is a schematic diagram illustrating a hardware structure of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

For an AI task, for example, evaluation and test of an AI image quality algorithm, a large number of pictures need to be input for testing, the memory capacity of a picture set used by the AI image quality algorithm is all in the T level, testing the pictures needs to consume a large amount of time and hardware resources, and in order to evaluate the performance of the algorithm, it is often necessary to run different algorithm versions and test different picture sets for many times, which causes a large amount of time to be consumed when a single computer runs an algorithm run.

Therefore, the embodiment of the disclosure provides a task processing method for an AI task, so as to improve the execution efficiency of an algorithm task when a plurality of AI tasks are executed.

As shown in fig. 1, fig. 1 is a flowchart of a task processing method for an AI task, which is shown in at least one embodiment of the present disclosure, and the method may be used for a master node of a distributed system, where in this embodiment, a kubernets system is used as the distributed system for description, the kubernets system is abbreviated as k8s, and in other embodiments, the distributed system may also be Swarm, Mesos, AWS ECS, or the like.

A distributed system is a cluster of multiple hosts, including a master node and multiple compute nodes, each of which is a host, whether the master node or the compute node. Each computing node comprises at least one container manager, each container manager comprises at least one container for executing AI tasks, and each container manager in the distributed system can be located on different computing nodes or on the same computing node. Taking k8s as an example, the k8s system is actually a k8s cluster (cluster), and generally may be composed of a master node (master node) and a plurality of computing nodes (node nodes), each node is a host, in the k8s system, the container manager is Pod, Pod is the minimum unit that the master node can schedule, each computing node may run one or more pods, each Pod may carry one or more containers, in this embodiment, a container of a Docker type is taken as an example for explanation, and in other embodiments, the container used may also be a container of a container, cri-o, rkt, or the like.

The method comprises the following steps:

in step 102, task profiles of a plurality of AI tasks to be performed are obtained.

The task configuration file includes files for creating and configuring the container. Different types of AI tasks have different requirements on the operating environment, so there are different configuration requirements on containers for executing different types of AI tasks, and those skilled in the art can preset a task configuration file according to the actual requirements of a plurality of AI tasks to be executed.

For example, the task configuration file may specify the number of containers to be created in each container manager, a data storage path of the container, a port mapping between the container and a computing node where the container is located, and the like, and the master node schedules an application container engine on the computing node to create and configure each container on the computing node according to the content specified in the task configuration file.

The task configuration file can also comprise files of other hierarchical architectures except the container in the distributed system, and the computing nodes, the container manager, the container and the like in the distributed system can be configured according to the task configuration file, so that the distributed system capable of processing the AI task is well deployed.

In this step, the task configuration file may be obtained once or multiple times, and the task configuration file obtained each time may be used to configure a part of the architecture of the distributed system. The task configuration file may be specifically a script file written by a user in advance, or configuration information input by the user.

In step 104, a plurality of containers are configured based on the task configuration file, and the plurality of computing nodes are controlled to process the plurality of AI tasks in parallel through the plurality of containers.

According to the acquired task configuration file, the main node can schedule an application container engine on the computing node to create a container and configure the container. Such as the number of configuration containers, file directories, addresses, and communication means.

In one example, the task configuration file includes: container environment configuration information required when processing an AI task; when the main node configures a plurality of containers based on the task configuration file, the method comprises the following steps: and sending the container environment configuration information to the computing node, wherein the container environment configuration information is used for the computing node to create at least one container which is the same as the configuration in the container environment configuration information in at least one container manager.

The container environment configuration information is an image file encapsulated according to the environment operation requirements, the initialization requirements and the like of the AI task, and may also be an acquisition address or a tag of the image file, the operation environments in each container created by each container manager according to the container environment configuration information are the same, and the operation environments may include environments on which the AI tasks such as an operating system, a dynamic library, an application program and the like depend when executed.

For example, in k8s, the algorithm execution environment of an AI task may be packaged into an image file in advance and the image file is stored in a container warehouse or a local directory, a user may submit a task configuration file to a master node through a command line tool kubecect of a k8s cluster, the task configuration file includes a tag of the image file required by the AI task, the master node sends the tag of the image file to a computing node, and then an application container engine in the computing node, such as a Docker client, may pull the image file from the container warehouse or the local directory according to the tag of the image file, create and configure a plurality of new containers according to the image file, and the configured operation environments in the plurality of containers are all the same as the operation environment constructed in the image file.

After the container configuration is complete, it can be used to perform AI tasks. The multiple containers in this embodiment may be located in different or the same container manager, when there is an AI task to be executed, the master node controls an idle container in the multiple compute nodes to execute the AI task, different containers are isolated from each other and are not affected by each other, and the multiple containers may execute different AI tasks in parallel.

When the main node configures a plurality of containers in one container manager based on the task configuration file, namely, for a certain container manager, when a plurality of containers exist, the plurality of containers can utilize the computing resources of the container manager, and process a plurality of AI tasks in parallel, and the plurality of containers in the same container manager can share the computing resources in the container manager when executing the AI tasks, so as to improve the resource utilization rate.

The task processing method for the AI tasks provided by the embodiment of the disclosure is executed by the master node of the distributed system, the master node configures containers of a plurality of computing nodes of the distributed system based on the task configuration file, and controls the plurality of computing nodes to execute a plurality of different AI tasks in parallel through the plurality of containers, different from the case that a single computer executes a plurality of AI tasks and can only start to execute the next task after the execution of the single task is finished, in the scheme, each AI task is executed by each container, and the plurality of AI tasks can be executed simultaneously, thereby improving the execution efficiency of the algorithm task.

Particularly, when the distributed system is a k8s system, the operation and maintenance-free characteristic of the k8s system can be utilized, and the k8s system automatically maintains during operation, so that the workload of manual operation and maintenance is reduced.

In one embodiment, the acquired task configuration file may include: a first number of container managers. In step 104, configuring a plurality of the containers based on the task profile includes:

the method comprises the steps that a main node creates a first number of container managers in a plurality of computing nodes according to a first number in a task configuration file, monitors the running states of the first number of capacity managers, and executes state recovery processing of the container managers in response to determining that a container manager with a fault exists in the first number of capacity managers according to the running states, wherein the number of the container managers after the state recovery processing is kept at the first number, and the main node configures a plurality of containers in the first number of container managers.

The first number is the number of container managers that normally operate in the distributed system, and may be a specific value or a range of values, and those skilled in the art may set the first number according to actual needs.

The master node may create a first number of container managers in the distributed system according to the first number specified in the task configuration file, and allocate the created multiple container managers to different or the same computing nodes, where each container manager is allocated with computing resources on the computing node that can be used by the computing node, such as a Central Processing Unit (CPU), a memory, a disk, and the like on the computing node, so that the containers in the container managers can perform AI tasks by using the computing resources.

When a container in a container manager in each computing node executes an AI task, the master node monitors the operating state of the capacity manager and determines that a failed container manager exists.

For example, when it is determined that the number of container managers operating normally changes, the master node performs a state recovery process of the container managers, and the state recovery process may be a repair of a failed container manager. For example, when the repair is not good, a copy of the container manager may be created to work in place of the failed container manager, and a container may be configured for the container manager that is successfully repaired or newly created, so as to maintain the first number of container managers that are operating normally.

The method can always keep the occupied computing resources in a certain range when the AI task is executed, and the problem of slow task processing speed caused by the fact that the container for executing the AI task and the allocated resources are too little due to the failure of the container manager can be solved.

For example, a Pod controller replicase on a master node in the k8s system may continuously monitor the operating status of each Pod, and once a certain Pod fails, may restart or recreate a copy of the Pod to maintain the first number of pods operating in parallel.

In another embodiment, the task configuration file includes: a parameter threshold for a resource usage parameter of the container manager; the method further comprises the following steps: acquiring resource use parameters of a container manager when a container processes an AI task; in response to the resource usage parameter not meeting a requirement of a parameter threshold, adjusting a number of container managers, wherein the adjusted resource usage parameter of the container managers meets the requirement of the parameter threshold, configuring a plurality of containers in the container managers based on the task profiles and the adjusted container managers.

All containers in each container manager can use the resources allocated to the container manager by the computing node when executing the AI task, and the ratio of the resources used by all containers in a certain container manager to the resources allocated to the container manager is the resource usage parameter of the container manager.

The resource usage parameters include at least one of: the CPU utilization rate and the memory occupancy rate can also comprise the disk utilization rate or other user-defined indexes.

The main node may monitor resource usage parameters of the container manager when the AI tasks are executed in parallel by the plurality of containers, and adjust the number of the container managers so that the resource usage parameters of the container managers gradually meet requirements of the parameter threshold when the resource usage parameters of the container managers do not meet requirements of the parameter threshold.

The requirement for the parameter threshold may be a requirement for a ratio of the resource usage parameter to the parameter threshold, which is maintained between a preset first ratio and a second ratio, i.e. an adjustment of the number of container managers is triggered in case the ratio is not satisfactory.

For example, the first ratio may be set to 0.8, the second ratio may be set to 1.1, and when the ratio of the resource usage parameter to the parameter threshold is lower than the set first ratio, the number of container managers is reduced; conversely, when the ratio of the resource usage parameter to the parameter threshold is higher than the set second ratio, the number of container managers is reduced.

The master node allocates computational resources for processing AI tasks for the increased container manager, while for the decreased container manager, or the shutdown container manager, the master node frees up the computational resources occupied by the shutdown container manager for processing AI tasks.

When the ratio of the resource usage parameter of the container manager to the parameter threshold is lower than the first ratio, which indicates that the computational resources required to execute the AI task are less than the computational resources allocated to the container manager, the master node reduces the number of the container managers to release the computational resources.

When the ratio of the resource usage parameter of the container manager to the parameter threshold is higher than the first ratio, which indicates that the computing resources required for executing the AI task are more than the computing resources allocated to the container manager, the master node increases the number of the container managers to obtain more computing resources, thereby realizing automatic allocation of the computing resources according to the task requirements.

For example, in k8s, at intervals, such as every 30s, the controller manager on the master node may obtain resource usage parameters of each resource of each Pod from a resource measurement API (Application Programming Interface), and then compare the queried resource usage parameters with a set parameter threshold to obtain a ratio of the resource usage parameters to the parameter threshold, where the ratio is used to expand or reduce the number of existing pods.

For example, when the resource usage parameter is the CPU utilization, the parameter threshold may be set to have an average CPU utilization of 50%, and the requirement of the parameter threshold is that the ratio of the resource usage parameter to the parameter threshold does not exceed the second ratio 1.1 and is not lower than the first ratio 0.9.

Assuming that the number of normally operating pods is 15, the master node collects the CPU utilization of each Pod at intervals and calculates an average CPU utilization of all the pods, assuming that the calculated average CPU utilization is 60%, the ratio of the resource usage parameter to the parameter threshold is 1.2, which is the ratio of 60% to 50%, and exceeds the second ratio 1.1, the number of pods can be reduced.

The specific reduction amount may be automatically calculated according to a scaling ratio, for example, the running Pod number 15 is divided by the ratio 1.2 to 12.5, and the rounding is performed to obtain 13, that is, the running Pod number needs to be adjusted to 13, and 2 pods are reduced. Of course, other ways of determining the number of shifted pods are possible, such as shifting one or two pods per adjustment.

After the Pod number is adjusted, the host node may continue to monitor the resource usage parameter of the container manager, and if the resource usage parameter does not meet the requirement of the parameter threshold, continue to adjust the Pod number.

It should be noted that the reduction of the number of container managers in this embodiment may be that the master node directly closes the container manager whose containers are all idle, or that when there is no container manager whose containers are all idle, the master node commands the container manager whose container resource usage parameter is the lowest to stop receiving and executing new AI tasks after monitoring that the container in the container manager has finished executing the current AI task, and closes the container manager until the containers in the container manager are all idle.

The number of the container managers can be increased by binding new container managers on a plurality of existing computing nodes and allocating computing resources on the computing nodes for the new container managers, or by binding new container managers on newly added computing nodes and allocating computing resources on the new computing nodes for the newly added computing nodes in the distributed system. The newly added container in the container manager may acquire and perform a new AI task after configuration.

When the number of the container managers is reduced, the originally distributed resources of the container managers are released, and when the number of the container managers is increased, more computing resources are distributed for the execution of the AI task, so that the performance bottleneck is solved, the resources are automatically released after the containers are used, and the high-efficiency resource utilization rate is achieved. Compared with the traditional method that a plurality of computers perform algorithm tasks in parallel, the resource utilization rate is higher.

In addition, the task configuration file may further include a lower limit and an upper limit of the number of container managers, so that the adjusted number of container managers does not exceed the upper limit and is not lower than the lower limit after the number of container managers is increased or decreased.

As shown in fig. 2, fig. 2 is a flowchart of a task processing method for an AI task, which is shown in at least one embodiment of the present disclosure, and the method may be used in a computing node of a distributed system, where the distributed system includes a master node and a plurality of computing nodes, each computing node includes at least one container manager, and each container manager includes at least one container for executing the AI task, where steps the same as those in fig. 1 are not described herein again, and the method includes the following steps;

in step 202, a plurality of task information corresponding to a plurality of AI tasks to be executed are obtained.

The task information is used for limiting an AI algorithm to be executed and a data set correspondingly processed by the AI algorithm.

In step 204, the AI algorithm and the data set are obtained from a task repository based on the task information.

In step 206, the processing of the data set by the AI algorithm in the AI tasks is performed in parallel by the plurality of containers configured by the master node.

Wherein the plurality of containers are configured by the master node according to task profiles of a plurality of AI tasks to be performed. Different AI tasks generally correspond to different task information. For example, different AI tasks often require different AI algorithm models or different versions of the same algorithm model to execute, or different AI tasks specify the same AI algorithm and different data sets to be processed.

In this embodiment, the obtaining of the plurality of task information respectively corresponding to the plurality of AI tasks to be executed may be one-time obtaining of the plurality of task information, or may be multiple obtaining of a single or a plurality of task information. The acquisition mode may be that a user inputs task information, the task information is acquired from a script file, or the task information sent by other equipment is received.

The idle containers, i.e., containers without AI tasks being executed, can download the AI algorithms and data sets required for the AI tasks from the task repository based on the task information.

The task information comprises AI algorithms to be executed and tags of data sets correspondingly processed by the AI algorithms in the task storage library.

The task repository may be a cloud database or a local database. When the AI algorithm and the data set corresponding to the AI task are obtained, each container can run the AI algorithm in the container to process the data set.

A single container executes a single AI task, and the AI algorithms in the multiple AI tasks execute the processing of the data sets in parallel through the multiple containers.

The task processing method for the AI tasks provided by the embodiment of the disclosure is executed by the computing nodes of the distributed system, the computing nodes can acquire the task information corresponding to a plurality of AI tasks, and execute different AI tasks in parallel through a plurality of containers configured by the master node, each AI task is executed by each container, and the AI tasks can be executed simultaneously, thereby improving the execution efficiency of the algorithm task.

In one embodiment, the obtaining of the task information corresponding to the AI tasks to be executed in step 202 includes: and monitoring the message queue, and acquiring a plurality of task information corresponding to a plurality of AI tasks to be executed from the message queue.

The task information in the message queue may be generated by the front-end page, i.e., the front-end page acts as a producer of the message queue, while the plurality of containers in the distributed system act as a plurality of consumers of the message queue. After a new AI task is created on the front-end page, task information corresponding to the AI task can be issued to the idle containers through the message queue, so that each container of the computing node can acquire the task information of the AI task to be executed in time.

The following describes this embodiment with reference to a scenario of an AI image quality algorithm test.

After the distributed system is deployed through

steps

102 and 104, algorithmic testing tasks may be created through tools based on Web (Web) front-end technology. As shown in fig. 3, fig. 3 is an exemplary front page with the options: selecting an atlas, selecting an algorithm model, selecting an algorithm version, and setting a button for creating a task on a page.

A user can select a corresponding atlas, an algorithm model and an algorithm version according to the actual requirement of a test task, after the selection is finished, a button for creating the task is pressed, task information containing the information of the atlas, the algorithm model, the algorithm version and the like can be generated and issued to a message queue, a computing node in the distributed system monitors the message queue, when the message queue contains task information and a free container exists in the computing node, the atlas corresponding to the task information and the algorithm model of the algorithm version are obtained from a task storage library, and the atlas is processed according to the algorithm in the container to execute an AI image quality algorithm test task.

The user can create the test task for many times through the front-end page, the computing node can acquire a plurality of task information corresponding to a plurality of to-be-executed AI tasks from the message queue, and when obtaining AI algorithms and data sets corresponding to the AI tasks, the computing node can execute the AI algorithms in the AI tasks to process the data sets in parallel through a plurality of containers.

Compared with the prior art, the efficiency of using a single PC (Personal Computer) algorithm to run the map is low, or the utilization rate of the resources is low by using a multi-PC parallel running algorithm test, and a tool with low automation degree needs to be developed for carrying out the algorithm test.

The embodiment of the present disclosure further provides a distributed system, where the distributed system includes a master node and multiple compute nodes, each compute node includes at least one container manager, and each container manager includes at least one container for executing an AI task;

the master node may execute the method according to any embodiment of the present disclosure, and the computing node may execute the method according to any embodiment of the present disclosure.

The disclosed embodiment also provides a task processing apparatus for an AI task, where the apparatus runs on a master node in a distributed system, the distributed system includes the master node and a plurality of computing nodes, each computing node includes at least one container manager, and each container manager includes at least one container for executing the AI task, as shown in fig. 4, the apparatus includes:

a configuration file obtaining module 41, configured to obtain task configuration files of a plurality of AI tasks to be executed;

and a container configuration module 42, configured to configure a plurality of containers based on the task configuration file, and control the plurality of compute nodes to process the plurality of AI tasks in parallel through the plurality of containers.

the container configuration module 42 is specifically configured to: and sending the container environment configuration information to the computing node, wherein the container environment configuration information is used for the computing node to create at least one container which is the same as the configuration in the container environment configuration information in at least one container manager.

the container configuration module 42 is specifically configured to: creating a first number of container managers in a plurality of the compute nodes according to the first number in the task profile; monitoring the running states of the first quantity of capacity managers; in response to determining that there is a failed container manager among the first number of capacity managers based on the operating status, performing state recovery processing of the container managers, wherein the number of container managers after the state recovery processing remains at the first number; in the first number of container managers, a plurality of the containers are configured.

In some optional embodiments, the task configuration file includes: a parameter threshold for a resource usage parameter of the container manager; on the basis of the foregoing embodiment of the apparatus, as shown in fig. 5, the apparatus further includes: a container manager adjustment module 43;

the container manager adjusting module 43 is configured to obtain a resource usage parameter of the container manager when the container processes the AI task; in response to the resource usage parameter not meeting the requirement of the parameter threshold, adjusting the number of the container managers, wherein the adjusted resource usage parameter of the container managers meets the requirement of the parameter threshold;

the container configuration module 42 is specifically configured to: configuring a plurality of the containers in the container manager based on the task profile and the adjusted container manager.

In some optional embodiments, the container manager adjusting module 43, when configured to adjust the number of container managers in response to the resource usage parameter not meeting the requirement of the parameter threshold, is specifically configured to: in response to a ratio between the resource usage parameter and the parameter threshold being below a first ratio, reducing the number of container managers; and/or, in response to the ratio between the resource usage parameter and the parameter threshold being higher than a second ratio, increasing the number of container managers.

In some optional embodiments, the container manager adjusting module 43, after being configured to adjust the number of container managers in response to the resource usage parameter not meeting the requirement of the parameter threshold, is further configured to: allocating computing resources for processing the AI task for the added container manager; and/or freeing computing resources occupied by the stopped container manager for processing the AI task.

In some alternative embodiments, the container configuration module 42 is specifically configured to: and configuring a plurality of containers in one container manager based on the task configuration file, and controlling the plurality of containers of the plurality of computing nodes to utilize the computing resources of the container manager to process the plurality of AI tasks in parallel.

The embodiment of the present disclosure also provides a task processing apparatus for an AI task, where the apparatus runs on a computing node in a distributed system, the distributed system includes a master node and a plurality of computing nodes, each computing node includes at least one container manager, and each container manager includes at least one container for executing the AI task, as shown in fig. 6, the apparatus includes:

a task information obtaining module 61, configured to obtain a plurality of task information corresponding to a plurality of AI tasks to be executed, where the task information is used to define an AI algorithm to be executed and a data set correspondingly processed by the AI algorithm;

an algorithm data obtaining module 62, configured to obtain the AI algorithm and the data set from a task repository according to the task information;

an AI task execution module 63, configured to execute, in parallel, processing of the data set by an AI algorithm in a plurality of AI tasks through the plurality of containers configured by the master node; wherein the plurality of containers are configured by the master node according to task profiles of a plurality of AI tasks to be performed.

In some optional embodiments, the task information obtaining module 61 is specifically configured to: monitoring a message queue, wherein the message queue comprises a plurality of task information respectively corresponding to a plurality of AI tasks to be executed; and acquiring a plurality of task information from the message queue.

The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

The embodiment of the present disclosure further provides an electronic device, as shown in fig. 7, the electronic device includes a memory 71 and a processor 72, where the memory 71 is configured to store computer instructions executable on the processor 72, and the processor 72 is configured to implement a task processing method of an AI task according to any embodiment of the present disclosure when executing the computer instructions.

The embodiments of the present disclosure also provide a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the computer program/instruction implements the task processing method of the AI task according to any embodiment of the present disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, where the computer program, when executed by a processor, implements a task processing method for an AI task according to any one of the embodiments of the present disclosure.

One skilled in the art will appreciate that one or more embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing description of specific embodiments of the present disclosure has been described. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this disclosure may be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this disclosure and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure, and is not intended to limit the scope of the present disclosure, which is to be construed as being limited by the appended claims.

Claims

1. A task processing method of AI tasks is applied to a main node of a distributed system, the distributed system comprises the main node and a plurality of computing nodes, each computing node comprises at least one container manager, and each container manager comprises at least one container for executing artificial intelligence AI tasks; the method comprises the following steps:

acquiring task configuration files of a plurality of AI tasks to be executed;

2. The method of claim 1, wherein the task configuration file comprises: container environment configuration information required during processing of the AI task;

said configuring a plurality of said containers based on said task profile, comprising:

3. The method of claim 1, wherein the task configuration file comprises: a first number of container managers;

monitoring the running states of the first quantity of capacity managers;

4. The method of claim 1, wherein the task configuration file comprises: a parameter threshold for a resource usage parameter of the container manager;

the method further comprises the following steps:

5. The method of claim 4,

said adjusting the number of container managers in response to said resource usage parameter not meeting said parameter threshold requirement comprises:

6. The method according to claim 4 or 5,

after adjusting the number of container managers in response to the resource usage parameter not meeting the requirement of the parameter threshold, the method further comprises:

7. The method according to any of claims 4-6, wherein the resource usage parameters comprise at least one of: CPU utilization rate and memory occupancy rate.

8. The method according to any one of claims 1 to 7,

9. A task processing method for AI tasks is applied to a computing node of a distributed system, the distributed system comprises a main node and a plurality of computing nodes, each computing node comprises at least one container manager, and each container manager comprises at least one container for executing AI tasks; the method comprises the following steps:

10. The method of claim 9,

the acquiring of the task information corresponding to the AI tasks to be executed includes:

and acquiring a plurality of task information from the message queue.

11. A distributed system comprising a master node and a plurality of compute nodes, each of said compute nodes comprising at least one container manager, each of said container managers comprising at least one container for performing AI tasks;

the master node is configured to perform the method of any of claims 1 to 8;

the computing node is configured to perform the method of any of claims 9 to 10.

12. A task processing apparatus of an AI task, the apparatus operating on a master node in a distributed system, the apparatus comprising:

13. A task processing apparatus of an AI task, the apparatus operating on a computing node in a distributed system, the apparatus comprising:

the AI task execution module is used for parallelly executing the processing of the data set by AI algorithms in a plurality of AI tasks through a plurality of containers configured by the main node; wherein the plurality of containers are configured by the master node according to task profiles of a plurality of AI tasks to be performed.

14. An electronic device, characterized in that the device comprises a memory for storing computer instructions executable on a processor, the processor being configured to implement the task processing method of an AI task according to any one of claims 1 to 8 or the task processing method of an AI task according to any one of claims 9 to 10 when executing the computer instructions.

15. A computer program product comprising a computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the task processing method of an AI task according to any one of claims 1 to 8, or the task processing method of an AI task according to any one of claims 9 to 10.

16. A computer-readable storage medium on which a computer program is stored, the program being characterized by implementing, when executed by a processor, the task processing method for an AI task according to any one of claims 1 to 8 or the task processing method for an AI task according to any one of claims 9 to 10.