CN114035937A - Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence - Google Patents

Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence Download PDF

Info

Publication number
CN114035937A
CN114035937A CN202111204831.7A CN202111204831A CN114035937A CN 114035937 A CN114035937 A CN 114035937A CN 202111204831 A CN202111204831 A CN 202111204831A CN 114035937 A CN114035937 A CN 114035937A
Authority
CN
China
Prior art keywords
task
training
model
tasks
optimizer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111204831.7A
Other languages
Chinese (zh)
Inventor
卞正达
李永彬
柳泓鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Luchen Technology Co ltd
Original Assignee
Beijing Luchen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Luchen Technology Co ltd filed Critical Beijing Luchen Technology Co ltd
Priority to CN202111204831.7A priority Critical patent/CN114035937A/en
Publication of CN114035937A publication Critical patent/CN114035937A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the field of artificial intelligence, in particular to a distributed training and reasoning system and a method based on artificial intelligence. The trained model carries out actual application reasoning, and the resource scheduling and multidimensional parallel technology can also be adopted in the reasoning process. According to the method and the device, the large-scale distribution is introduced in the training and reasoning process of the AI model, so that the consumption of the AI to computing resources is reduced, the training reasoning time is shortened, the AI deployment efficiency is improved to the maximum extent, and the deployment cost is minimized.

Description

Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
Technical Field
The invention belongs to the field of artificial intelligence deep learning, and particularly relates to a distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence.
Background
In recent years, the AI training market has a demand inflection point, the demand on the computing power market is rapidly expanded, the computing power use efficiency needs to be improved, the large-scale algorithm starts explosive breakthrough in the last two years, a new algorithm and a new model can emerge continuously, the demand on the computing power of the market is increased, a large model cannot be trained by a single GPU, and the model parameter is too large to be put in the video memory of the single GPU; even if the device can be put down, the training time is not acceptable, the increase trend of the computing power of the hardware is far away from the requirement of the model on the computing power, and more hardware (chips) are required to be used for compensating the short board of the computing power increase.
In an enterprise scenario, a large number of factors are involved in large-scale deployment, including time delay, throughput, cost, load balancing and the like, and the main difficulties include that the computational efficiency is difficult to improve due to communication bottlenecks: the maximum utilization rate of GPU computing power in the existing training is only 30%, computing, storage and network resources need to be shared among different tasks, the problems of isolation and scheduling are involved, different tasks need different distributed training solutions and hardware, and extra software and hardware cost is provided.
Disclosure of Invention
Aiming at the defects of the prior art introduced above, the invention creates a general distributed artificial intelligence system which has high efficiency and low energy consumption and is suitable for an AI large model, and helps enterprises to maximize the efficiency of artificial intelligence deployment and minimize the deployment cost.
The embodiment of the application provides a distributed training and reasoning method, system, equipment and medium based on artificial intelligence.
In a first aspect, an embodiment of the present application provides an artificial intelligence-based distributed training and reasoning method for a hardware processor, where the method is executed on a software platform and uses a machine learning library;
characterized in that the method comprises the steps of:
acquiring task parameters of a plurality of AI tasks, acquiring a scheduling decision according to the task parameters of the AI tasks, and distributing the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks;
acquiring the computing resources of the AI tasks distributed to the plurality of hardware processors, executing multidimensional parallel processing on the training tasks of the AI tasks on the respective hardware processors, and acquiring the output result of the AI tasks;
acquiring a parallel processing result of the AI task after executing the parallel processing, calculating a gradient according to a current output result of a model aiming at a training task of the AI task, optimizing the AI task by adopting an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets the requirement;
an optimization algorithm is used in the distribution process to optimize scheduling decisions;
the parallel processing mode comprises data parallel, sequence parallel, pipeline parallel and multidimensional grid parallel processing;
the AI task includes a training task and an inference task.
In a possible implementation of the first aspect, the obtaining a parallel processing result of the AI task after performing the parallel processing, calculating a gradient according to a current output result of a model for a training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement further includes:
fine adjustment and prediction are carried out on AI model parameters of the AI task processed by the optimizer, the model is continuously trained aiming at specific application through fine adjustment, and finally the trained model is deployed to carry out inference of actual application;
the training task of the AI task is executed with multidimensional parallel processing on respective hardware processors, and the process of obtaining the output result of the AI task further comprises the following steps:
completing data migration of the AI task between the hardware processors by segmenting and/or unloading optimizer states, gradients and model parameters;
the AI task includes a picture processing task and/or a natural language processing task.
In a possible implementation of the first aspect, the obtaining a parallel processing result of the AI task after performing the parallel processing, calculating a gradient according to a current output result of a model for a training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement specifically includes:
the data parallelly distributes the AI tasks to the hardware processors to obtain the total batch size of data which are processed by all the hardware processors at the same time and the batch size of data processed by each hardware processor each time;
the sequence further performs segmentation and/or unloading and distribution on the data, and each AI task is put into a plurality of processors;
the pipeline parallelism is realized by splitting the model into a plurality of sections, deploying each section in different hardware processors, connecting the sections in series according to the model sequence, and taking the output of the previous section as the input of the next section;
the multi-dimensional grid parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
In a possible implementation of the first aspect, the step of obtaining a parallel processing result of the AI task after performing the parallel processing, calculating a gradient according to a current output result of a model for a training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement specifically includes:
the optimizer algorithm corresponding to the AI task comprises but is not limited to a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing of computer vision-related AI tasks;
the LAMB is used for processing a related AI task in natural language;
the ConAdv is suitable for processing AI tasks with high speed requirements and low precision requirements;
the La-Lars is suitable for processing AI tasks with narrow communication bandwidth and high network communication cost.
In a second aspect, embodiments of the present application provide an artificial intelligence based distributed training and reasoning system for a hardware processor, where the system is executed on a software platform, and uses a machine learning library for processing various application data;
the hardware processor includes but is not limited to: CPU, GPU, FPGA, TPU;
characterized in that the system comprises:
the scheduling module is used for acquiring task parameters of a plurality of AI tasks, acquiring scheduling decisions according to the task parameters of the AI tasks, and distributing the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks;
the multidimensional parallel module is used for acquiring the computing resources of the AI tasks distributed to the hardware processors, executing multidimensional parallel processing on the hardware processors of the training tasks of the AI tasks and acquiring the output result of the AI tasks;
the extensible optimization module is used for acquiring a parallel processing result of the AI task after parallel processing is executed, calculating a gradient according to a current output result of a model aiming at a training task of the AI task, optimizing the AI task by adopting an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets the requirement;
an optimization algorithm is used in the distribution process to optimize scheduling decisions;
the parallel processing mode comprises data parallel, sequence parallel, pipeline parallel and multidimensional grid parallel processing;
the AI task includes a training task and an inference task.
In a possible implementation of the second aspect, the system further includes:
the fine-tuning and reasoning module is used for performing fine tuning and prediction on the AI task processed by the optimizer, continuing to train the model aiming at specific application through fine tuning, and finally deploying the trained model to perform reasoning on actual application;
the dynamic memory disk management module completes data migration of the AI task between the hardware processors by segmenting and/or unloading the state, gradient and model parameters of the optimizer;
the AI task includes a picture processing task and/or a natural language processing task.
In a possible implementation of the second aspect, the multidimensional parallel module obtains the computation resources of the AI tasks allocated to the plurality of hardware processors, and for the training task of the AI task, executes multidimensional parallel processing on the respective hardware processors to obtain the output result of the AI task, further includes:
the data parallelly distributes the AI tasks to the hardware processors to obtain the total batch size of data which are processed by all the hardware processors at the same time and the batch size of data processed by each hardware processor each time;
the sequence further performs segmentation and/or unloading and distribution on the data, and each AI task is put into a plurality of processors;
the pipeline parallelism is realized by splitting the model into a plurality of sections, deploying each section in different hardware processors, connecting the sections in series according to the model sequence, and taking the output of the previous section as the input of the next section;
the multi-dimensional grid parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
In a possible implementation of the second aspect, the extensible optimization module obtains a parallel processing result of the AI task after performing the parallel processing, calculates a gradient according to a current output result of the model for a training task of the AI task, performs optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updates the iteration model parameter until a target iteration number is reached or the training result meets a requirement, and the extensible optimization module further includes:
the optimizer algorithm corresponding to the AI task comprises but is not limited to a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing of computer vision-related AI tasks;
the LAMB is used for processing a related AI task in natural language;
the ConAdv is suitable for processing AI tasks with high speed requirements and low precision requirements;
the La-Lars is suitable for processing AI tasks with narrow communication bandwidth and high network communication cost.
In a third aspect, an embodiment of the present application provides a distributed training apparatus based on artificial intelligence, including:
a memory for storing instructions for execution by one or more processors of the system, an
A processor, being one of the processors of the system, for executing the instructions to implement any one of the possible artificial intelligence based distributed training and reasoning methods of the first aspect described above.
In a fourth aspect, the present application provides a computer-readable storage medium encoded with a computer program, where the computer-readable storage medium has instructions stored thereon, and when the instructions are executed on a computer, the instructions cause the computer to perform any one of the possible artificial intelligence based distributed training and reasoning methods of the first aspect.
Compared with the prior art, the application has the following effects:
the scheme adopted by the invention divides the model through multi-dimensional parallelism, improves the efficiency of distributed AI training and reasoning efficiency, realizes 70% response speed improvement, and reduces the response time from the original 30 seconds to 17-18 seconds; through efficient memory partitioning and data movement management, the maximum model supported on each processor is increased from 10 hundred million parameter scale to 120 hundred million parameter scale, the number of processors required by large model inference is reduced, the cost is reduced, and the availability and the product performance of the model are improved; an automatic deployment scheme is provided, the deployment speed is increased by 5-10 times, and the labor, time and cost required by model distributed deployment can be saved in the future.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 illustrates a workflow diagram of an artificial intelligence based distributed training and reasoning method, according to some embodiments of the present application;
FIG. 2 illustrates an application scenario diagram of an artificial intelligence based distributed training and reasoning approach, according to some embodiments of the present application;
FIG. 3 illustrates a block diagram of a hardware architecture of an artificial intelligence based distributed training and reasoning system, according to some embodiments of the present application;
FIG. 4 illustrates a structural layout diagram of a 2.5-dimensional grid-parallel approach to artificial intelligence based distributed training and reasoning approaches, according to some embodiments of the present application;
FIG. 5 illustrates a block diagram of matrix-vector parameter equalization for an artificial intelligence based distributed training and reasoning approach, according to some embodiments of the present application;
FIG. 6 illustrates a weak extension efficiency comparison schematic of an artificial intelligence based distributed training and reasoning approach, according to some embodiments of the present application;
FIG. 7 illustrates a graph of a strong extended efficiency comparison of an artificial intelligence based distributed training and reasoning approach, according to some embodiments of the present application;
FIG. 8 illustrates a statistical graph of experimental results of a LAMB algorithm based on an artificial intelligence distributed training and reasoning approach, according to some embodiments of the present application;
FIG. 9 illustrates a workflow diagram of an artificial intelligence based distributed training and reasoning method La-Lars algorithm, according to some embodiments of the present application;
FIG. 10 illustrates a block diagram of the architecture of an artificial intelligence based distributed training and reasoning system, according to some embodiments of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The illustrative embodiments of the present application include, but are not limited to, an artificial intelligence based distributed training and reasoning method, apparatus, device, and medium and an artificial intelligence based distributed training and reasoning method, apparatus, device, and medium.
It is to be appreciated that the methods of determining content similarity provided herein can be implemented on a variety of distributed training and reasoning systems, including, but not limited to, a server, a distributed server cluster of multiple servers, a cell phone, a tablet, a laptop, a desktop computer, a wearable device, a head mounted display, a mobile email device, a portable game console, a portable music player, a reader device, a personal digital assistant, a virtual reality or augmented reality device, a distributed training and reasoning system such as a television with one or more processors embedded or coupled thereto, and so forth.
It is to be appreciated that in various embodiments of the present application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, the like, and/or any combination thereof.
The inventive concepts of the embodiments of the present application are briefly described below.
From the perspective of the computing market, under-computing supply and demand appears in the market at present, the system hopes to reduce the demand of AI on computing resources by accelerating large-scale distributed training and reasoning, and the demand of the AI on an AI infrastructure platform is very urgent in the market, and efficient distributed training is an indispensable function of the AI infrastructure platform, so an efficient training and reasoning scheme similar to the system is just needed in the future AI market. From the perspective of an AI model application scene, a large number of application scenes can bring huge requirements for efficient parallel training, many existing leading-edge models cannot be applied to the ground due to the calculation constraint, and more markets can be developed after the calculation efficiency is improved; for example, the transform architecture appearing in 2018 still does not completely replace RNNs, which requires running on average on each processor, and deployment is relatively difficult in the prior art; such as nerf (application of deep learning in three-dimensional rendering) which appeared in 2019, because the limit of the computation speed has not come to the ground widely.
In addition, distributed training and deployment thresholds and costs are high, taking the PyTorch built-in scheme as an example, it is necessary to write codes related to process groups, intra-group collective communication, data sets, parallel models, and adjust the backend interface according to the hardware (CPU/GPU) used. Distributed training deployment engineers need to understand a plurality of aspects of an algorithm (parallel strategy), a system (training architecture, synchronization method), an AI framework, a training and reasoning method, communication programming, resource scheduling software, a big data platform, bottom layer software programming and the like at the same time, talent quality requirements are extremely high, and corresponding enterprise employment cost is also high; different tasks require different distributed training solutions and hardware, with additional hardware and software costs. The training and reasoning scheme is generally based on self hardware, is a customized solution directly integrated with the hardware, is difficult to face a new emerging hardware/model architecture, and urgently needs a set of general and standardized parallel training and reasoning scheme; in the prior art, the algorithm is often selected to break through, but on one hand, the algorithm is difficult to break through, and on the other hand, the problem that the distributed training efficiency is limited is difficult to completely solve only by the algorithm; for example, for the fields of medical treatment, security and the like, the data security may be required, or a model with a special structure may be required; the method of manual parameter adjustment and deployment can still be used for training in a short time, but a set of general and automatic parallel training mode is needed in a long time, so that the method can be adapted to a fast iterative algorithm to reduce the cost of AI application and promote AI application.
In view of this, the inventive concepts of the embodiments of the present application are briefly described below. FIG. 1 illustrates a workflow diagram of an artificial intelligence based distributed training and reasoning method for a hardware processor, the method being implemented on a software platform using a machine learning library, according to a first embodiment of the present application;
characterized in that the method comprises the steps of:
acquiring task parameters of a plurality of AI tasks, acquiring a scheduling decision according to the task parameters of the AI tasks, and distributing the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks;
acquiring the computing resources of the AI tasks distributed to the plurality of hardware processors, executing multidimensional parallel processing on the training tasks of the AI tasks on the respective hardware processors, and acquiring the output result of the AI tasks;
acquiring a parallel processing result of the AI task after executing the parallel processing, calculating a gradient according to a current output result of a model aiming at a training task of the AI task, optimizing the AI task by adopting an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets the requirement;
an optimization algorithm is used in the distribution process to optimize scheduling decisions;
the parallel processing mode comprises data parallel, sequence parallel, pipeline parallel and multidimensional grid parallel processing.
The artificial intelligence based distributed training and reasoning method is executed on a software platform, wherein the software platform comprises but is not limited to CUDA and ROCM;
the artificial intelligence based distributed training and reasoning method uses machine learning libraries including, but not limited to, TensorFlow, Keras, PyTorch.
Meanwhile, in the future, various application scenarios generate a great deal of requirements for AI model training (generally, the larger the AI model is, the stronger the performance is).
After the inventive concept of the embodiment of the present application is introduced, some simple descriptions are made below on application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
The technical scheme provided by the embodiment of the application is suitable for multimedia content recommendation scenes such as characters, pictures (including static pictures in formats such as jpeg, and dynamic pictures in formats such as gif), videos and the like, and is mainly exemplified by corpus vector training in natural language processing. Wherein the corpus vector in the natural language processing is from a web corpus, such as Wikipedia. FIG. 2 illustrates a scenario diagram of an artificial intelligence based distributed training and reasoning approach, according to some embodiments of the present application. Specifically, the scenario includes a terminal 101, a server 102, and a network 103.
The terminal 101 may be a desktop terminal or a mobile terminal, and the mobile terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and the like. The terminal 101 may be installed with an application that may perform natural language processing corpus training data set acquisition. The application related to the embodiment of the application may be a software client, or a client such as a web page or an applet, and if the application is a client such as a web page or an applet, the background server is a background server corresponding to the software or the web page or the applet, and the specific type of the client is not limited. The user can log in the user on the application, and then data set collection is carried out.
The server 102 may be a background server corresponding to an application installed on the terminal 101, for example, an independent physical server or a server cluster or distributed system composed of a plurality of servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.
The server 102 may include one or more processors 1021, memory 1022, and an I/O interface 1023 to interact with the terminal, among other things. In addition, server 102 may also configure database 1024, and database 1024 may be used to store training data sets of a user-submitted natural language processing corpus. The memory 1022 of the server 102 may further store program instructions such as a machine learning library and an optimizer provided in the embodiment of the present application, and when the program instructions are executed by the processor 1021, the program instructions can be used to implement the steps of determining the distributed training and reasoning method provided in the embodiment of the present application, so as to perform distributed training on data to be trained input by a user, and further push the trained content to a target user, so as to be used in the terminal 101 for subsequent artificial intelligence interactive application.
The terminals 101 and the server 102 are connected via a network 103, and the network 103 includes one or more and may include various connection types, such as a wired, wireless communication link, cloud, or fiber optic cable, and the like, and the specific examples of the network described above may include the internet provided by the communication provider of the terminal 101.
First, the processor 1021 reads a training data set of the natural language processing corpus submitted by the user, which is stored in the database 1024 and corresponds to the terminal 101, through the I/O interface 1023 interacting with the terminal 101, and then the memory 1022 performs stored program instructions of the distributed training and inference method, and pushes the training data set to the terminal 101 through the I/O interface 1023 interacting with the terminal, and displays the training data set to the user.
FIG. 3 illustrates a block diagram of a hardware architecture of an artificial intelligence based distributed training and reasoning system, according to some embodiments of the present application. Specifically, as shown in fig. 3, it includes one or more processors, system control logic connected to at least one of the processors, system memory connected to the system control logic, non-volatile memory (NVM) connected to the system control logic, and a network interface connected to the system control logic.
In some embodiments, the processor may include one or more single-core or multi-core processors. In some embodiments, the processor may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where the distributed training and reasoning system employs an eNB (enhanced Node B) or RAN (Radio Access Network) controller, the processor may be configured to perform various consistent embodiments.
In some embodiments, the processor includes a GPU, a CPU, an FPGA, and a TPU. And scheduling the resources of the processor based on the data set condition of the training and reasoning tasks to be processed, migrating the tasks of the GPU to other non-GPU processors, and then performing corresponding control logic processing on the training and reasoning tasks to be processed based on the computing resources of each processor.
In some embodiments, the system control logic may include any suitable interface controllers to provide any suitable interface to at least one of the processors and/or any suitable device or component in communication with the system control logic.
In some embodiments, the system control logic may include one or more memory controllers to provide an interface to system memory. System memory may be used to load and store data and/or instructions. The memory of the distributed training and reasoning system may in some embodiments comprise any suitable volatile memory, such as suitable Dynamic Random Access Memory (DRAM). In some embodiments, system memory may be used to load or store instructions to implement the distributed training described above, or system memory may be used to load or store instructions to implement an application that utilizes the distributed training and inference methodology described above for distributed training.
The NVM/memory may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as at least one of a HDD (Hard Disk Drive), CD (Compact Disc) Drive, DVD (Digital Versatile Disc) Drive. The NVM/memory may also be used to store training models used in the distributed training classes described above.
The NVM/memory may include a portion of the storage resources on the device on which the distributed training and reasoning system is installed, or it may be accessible by, but not necessarily a part of, the device. For example, the NVM/memory may be accessed over a network via a network interface.
In particular, the system memory and NVM/storage may each include: a temporary copy and a permanent copy of the instruction. The instructions may include: instructions that when executed by at least one of the processors cause the distributed training and reasoning system to implement the distributed training and reasoning method of the present application. In some embodiments, instructions, hardware, firmware, and/or software components thereof may additionally/alternatively be placed in system control logic, a network interface, and/or a processor.
The network interface may include a transceiver to provide a radio interface for the distributed training and reasoning system to communicate with any other suitable devices (e.g., front end modules, antennas, etc.) over one or more networks. In some embodiments, the network interface may be integrated with other components of the distributed training and reasoning system. For example, the network interface may be integrated into at least one of the processor, the system memory, the NVM/storage, and a firmware device (not shown) having instructions that, when executed by at least one of the processors, the distributed training and reasoning system implements the distributed training and reasoning method of the present application.
The network interface may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, the network interface may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem. The network interface is also used for being in communication connection with the cloud application to achieve data processing of the cloud.
In some embodiments, at least one of the processors may be packaged together with logic for one or more controllers of system control logic to form a System In Package (SiP). In some embodiments, at least one of the processors may be integrated on the same die with logic for one or more controllers of system control logic to form a system on a chip (SoC).
The distributed training and reasoning system may further comprise: input/output (I/O) devices. The I/O device may include a user interface to enable a user to interact with the distributed training and reasoning system; the design of the peripheral component interface enables the peripheral component to also interact with the distributed training and reasoning system.
The scheme adopted by the first embodiment divides the model through multi-dimensional parallelism, improves the efficiency of distributed AI training and reasoning efficiency, realizes 70% response speed improvement, and reduces the response time from the original 30 seconds to 17-18 seconds; through efficient memory partitioning and data movement management, the maximum model supported on each processor is increased from 10 hundred million parameter scale to 120 hundred million parameter scale, the number of GPUs (graphics processing units) required by large model reasoning is reduced, the cost is reduced, and the availability and the product performance of the model are improved; an automatic deployment scheme is provided, the deployment speed is increased by 5-10 times, and the labor, time and cost required by model distributed deployment can be saved in the future.
In a possible implementation of the first embodiment, the obtaining task parameters of a plurality of AI tasks, obtaining a scheduling decision according to the task parameters of the AI tasks, and allocating the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks specifically includes:
each AI task has parameters such as data (pictures/sentences), models (ViT/ResNet/TransFormer, etc.), types (training/fine tuning/reasoning), etc., and appropriate computing resources are allocated by adopting a task scheduling strategy.
Specifically, task scheduling is adjusted according to the batch size and other information, computing resources are fully utilized, and the average waiting time of tasks can be remarkably shortened;
when a user wants to start a training and reasoning task, a starting command is written into a file, the file is submitted to a scheduling system, and the scheduling system helps to queue, plan the training and reasoning task;
the scheduling mode is dynamic, scheduling can be performed according to the property of the task, the average time for completing the task is shortened, and adjustment can be performed according to the priority.
The task scheduling method related to the embodiment can achieve the good technical effects of maximizing the calculation power utilization rate, reducing the idle time of each processor and thread, shortening the task waiting time, shortening the time from task submission to calculation termination and the like.
In a possible implementation of the first embodiment, the obtaining a parallel processing result of the AI task after performing parallel processing, calculating a gradient according to a current output result of a model for a training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement further includes:
fine adjustment and prediction are carried out on AI model parameters of the AI task processed by the optimizer, the model is continuously trained aiming at specific application through fine adjustment, and finally the trained model is deployed to carry out inference of actual application;
the whole fine adjustment is basically the same as training; the specific functions of the inference process include: the speed is improved by at least 30 percent relative to a reference system, GPT-2 deployment can be completed on a single server, the memory is reduced by at least 30 percent, and the precision is ensured
The fine tuning is mainly performed by two methods.
The method comprises the following steps: freezing all the convolutional layers and training the personalized and customized full-connection layers;
pre-training the model using inclusion v 3: the input image size for this model when initially trained with the ImageNet dataset is 299x299 and the image channel order is RGB. It should be noted that, the pre-training model must ensure that the data to be trained is as close as possible to the original data set, so as to maximize the image recognition capability of the model.
Pretreatment: the data are preprocessed according to the original preprocessing mode of the pre-training model, and the preprocessing mode is normalized to [ -1,1]
A base model: the pre-trained model is imported (only the convolutional layer part) and all convolutional layer parameters are locked.
Customizing a model: the convolution layer is connected with a Global Average Pooling (GAP) first, then connected with Dropout, and then connected with a classifier, and the output number is selected according to the classification task. The model may have only two thousand trainable parameters.
An optimizer: LARS is used.
Preparing data: the training set is divided into a training set and a validation set.
Callback functions are defined to facilitate training: and automatically storing the model at each generation end, performing early stop by taking val _ loss as a monitoring index, and synchronously updating the training historical data to the Tensoboard for visualization.
Batch Size: the training is performed with a larger batch size, which allows the model to converge faster and better.
Although the convolutional layers are all already locked, training is time consuming since the samples still need to be computed from the input to the output of the model. The training of five generations takes more than ten minutes, and the verification set Loss is about 0.05.
The second method comprises the following steps: and (3) deriving a feature vector, and training a classifier independently:
pretreatment: and preprocessing is carried out according to the requirements of a pre-training model before the feature vectors of the training set and the test set are derived, otherwise, the derived features are not the best performance of the model.
A base model: the base model consists of the convolution layer portion of IncepisationV 3 and the Global Average Pooling (GAP).
Deriving, i.e. predicting: derivation is actually to make the base model directly predict the training set and the test set, but the predicted result is not the image class, but the feature vector (a condensed version of the feature map).
The derivation takes a certain time, and usually takes one or two minutes to complete because the derivation requires prediction for all pictures of the data set.
The inputs to the new model are feature vectors: the input of the new model is no longer the images of the training set, but the image feature vector after the pre-training model is 'digested', the first dimension of the vector corresponds to each image sample, the length of the vector is the number of the samples, the second dimension is the average value of the output feature map of each convolution kernel of the last layer of the base model, and for Inception V3, the length of the second dimension is 2048.
Dividing a training set and a verification set: note that here the training set and validation set are partitioned over the input feature vectors.
Customizing the new model: since the feature vectors have already been derived, only one fully connected network with input feature length 2048 needs to be trained next.
The same training is done with the callback function and the larger batch size (4096).
The training speed is obviously improved, and the five-generation training only takes ten seconds, so that the Loss of about 0.02 can be achieved on the verification set.
In this case, the model can be basically ascending in Top20 of Kaggle Leaderboard. If the model of ResNet50, Xception, etc. is further fused, Top10 can be reached.
The model may be subsequently fine-tuned to further improve model performance.
The technical effects of the fine adjustment and reasoning method on the training include that the pre-training model can be competent for recognizing the new data set, the original feature extraction capability of the pre-training model can be fully released and utilized, and the model can achieve lower loss.
In a possible implementation of the first embodiment, the performing of steps S002 and S003 further includes:
completing data migration of the AI task between the hardware processors by segmenting and/or unloading optimizer states, gradients and model parameters;
the AI task includes a picture processing task and/or a natural language processing task.
By segmenting and/or unloading the optimizer state (stage 1), the gradient (stage 2) and the model parameters (stage 3), the GPU memory only stores the related data required by the current calculation, so that the GPU memory consumption required in the training process is reduced, and finally the scheme is allowed to train/fine tune the extremely large AI model by using extremely small amount of GPU resources. The method can enable the memory of the GPU to only store related data required by current calculation, and when the memory of the GPU is insufficient, the model is unloaded to the memory of the CPU, and further the model is unloaded to the hard disk.
For a large model, because of the huge number of parameters, the model itself and the corresponding optimizer need to consume a large amount of space, such as a GPU video memory, when computational resources (the number and capacity of processors) are limited, or even though a single processor cannot process the model through multi-dimensional parallel division, dynamic memory disk management needs to be used, information such as model parameters, corresponding optimizer states, gradients and the like is dynamically placed by means of the capacity of a CPU memory or a high-speed hard disk, and only the information required for current computation is stored on the GPU video memory.
The picture processing task is used for processing the characteristic data of the picture; the natural language processing task is to process the characteristic data of the sentence.
In a possible implementation of the first embodiment, the obtaining the computation resources of the AI tasks allocated to the multiple hardware processors, and performing multidimensional parallel processing on the training tasks of the AI tasks on the respective hardware processors to obtain the output result of the AI tasks specifically include:
the data parallelly distributes the AI tasks to the hardware processors to obtain the total batch size of data which are processed by all the hardware processors at the same time and the batch size of data processed by each hardware processor each time;
the data in the data parallelism is divided, each node (or process) has a model, then each node takes different data, usually a batch size, and then respectively completes the calculation of the forward direction and the backward direction to obtain the gradient, the training processes are workers, besides the workers, parameter servers, for short ps servers, the workers can send the respective calculated gradient to the ps servers, then the ps servers perform update operation, and then the updated model is transmitted back to each node.
The data parallelism can enlarge the equivalent batch size, namely the equivalent batch size, and the calculation is accelerated through the calculation of the parallel processor count and the monobloc processor batch size. For example, 128000 data, the batch size 128 of each processor needs 2 seconds for updating the model each time, and it needs to wait 2 × 128000/128 to 2000 seconds (1000 times for calculation); when 100 processors are running in parallel, the equivalent batch size is 12800, and only about 2 × 128000/12800 is needed to wait for 20 seconds (10 times).
The sequence parallelism can prolong the length of data received by a transform type model, so that long text in NLP and high-resolution pictures (large pictures)/video in CV tasks can be processed (pictures can be cut into small blocks of pictures, all the small pictures are arranged in sequence and are also sequences; the video is the sequence of the pictures, and each picture can be cut and/or unloaded). If the non-ordered columns are parallel, the data can only be cut off and processed in a segmented mode, and the performance is reduced. the transformer model is the forefront in the field of deep learning at present and has excellent performance not only in NLP related tasks. For CV tasks, transform-based models such as ViT were also derived.
The sequence further performs segmentation and/or unloading and distribution on the data, and each AI task is put into a plurality of processors;
after the computing resources are obtained, the picture processing tasks and/or the feature data of the pictures are processed and distributed to various processors (such as a GPU/CPU) through data parallel, and the data are further segmented and/or unloaded and distributed in sequence parallel. But a single data is too long to be processed by a single processor, and after parallel slicing and/or unloading of the sequence, one data is put into multiple processors. The calculation is equivalent to directly processing the whole complete data through communication.
The pipeline parallelism is realized by splitting the model into a plurality of sections, deploying each section in different hardware processors, connecting the sections in series according to the model sequence, and taking the output of the previous section as the input of the next section;
the multi-dimensional grid parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
The 2.5-dimensional grid parallel aims to design a quantifiable novel deep learning model parallel framework, minimize expensive transmission loss between graphics processors, provide a flexible and efficient framework, and further improve the speed and efficiency of model parallel.
The 2.5-dimensional grid parallel adopts flexible architecture design, a user can flexibly set various parameters of the model parallel to efficiently use limited graphics processor resources, and the adopted architecture setting between 2D and 3D has simple 2D design and high 3D efficiency. After having possessed 2D and 3D's characteristics, 2.5 dimension grid model parallel can be like 2D unrestricted application to extensive deep learning model, also possess 3D's high efficiency, and various deep learning models, various application can be compatible to furthest to this kind of design to promote the efficiency of model by a wide margin.
As a newly proposed model parallel scheme, 2.5-dimensional grid parallel (Tesseract) respectively improves the running speed by 1.375 times and 1.5293 times (64x NVIDIA Quadro RTX 5000) compared with the traditional 1-dimensional and 2-dimensional model parallel architectures. By reducing the transmission times among the graphic processors, the 2.5-dimensional grid model parallelism (Tesseract) greatly improves the overall operation efficiency of the model parallelism, and further reduces the training cost of the deep learning model, including the required number of the graphic processors and the waiting time. In addition, the tested ViT model of the scheme shows that the 2.5D model can achieve the same training precision in parallel (Tesseract) and non-parallel in the training precision of the model.
FIG. 4 is a structural layout diagram of a 2.5-dimensional grid-parallel scheme, where for the number of processors, p, are built in a 2.5-dimensional layout diagram of [ p, p, d ], where d is the depth.
The 2.5-dimensional grid parallel action is implemented by separating a matrix A with the size [ a, B ] and a matrix B with the size [ B, C ], and then merging to obtain a matrix C with the size [ a, C ], and specifically implementing the following algorithm:
wherein q represents the dimension, b represents the batch size of the batch size, h represents the concealment size, and s represents the sequence length;
inputting: matrix A [ a, B ], matrix B [ B, c ]
And (3) outputting: matrix C [ a, C ] ═ a × B
The matrices A, B are divided into block matrices of the form [ a/qd, B/q ] and [ B/q, c/q ], respectively,
for i e {0, … qd-1}, j e {0, …, q-1}, performing h i% p, k i// p, and aijDeposit pkjhIn, CijWhen the ratio is 0, C isijDeposit pkjhPerforming the following steps;
for i ∈ {0, … p-1}, j ∈ {0, …, p-1}, k ∈ {0, …, d-1}, the execution will B ∈ {0, … p-1}, whereijDeposit pkjhPerforming the following steps;
for i, j e {0, … p-1}, k e {0, …, d-1}, and concurrent execution for each element in t e {0, … p-1}, p is addeditkA in (A)itkBroadcast to pijkA 1 is to ptjkB in (1)tjkBroadcast to pijk,Cijk=Cijk+Aitk*Bvtk
Merging all CijkA matrix C is obtained.
3D parallel matrix multiplication is adopted in the 3-dimensional grid in parallel, each matrix is divided into a plurality of small blocks according to rows and columns, and multiplication of a large matrix is divided into multiplication of a plurality of small matrices;
the three-dimensional matrix multiplication of the original version is realized, each matrix is only stored on one surface (partial GPU), so that the storage resource waste is caused, the scheme optimizes the storage and communication algorithm, and the matrix storage is spread on the whole processor.
Fig. 5 is a diagram of a matrix-vector parameter balancing structure according to an embodiment of the present invention, where a vector B is uniformly stored on a diagonal (i, l, j) of a B plane by using load balancing optimization and matrix-vector operation, and C is calculated as a + B;
when the parameter scale is fixed on each GPU, the step time of the 3D method is the minimum compared with 1D and 2D (the 3D is 0.672 seconds, the 1D is 1.560 and the 2D is 1.052); with the total parameter scale fixed, the 3D method is accelerated by 2.3 and 1.6 times than the 1D and 2D methods, respectively. Fig. 6 and 7 are schematic diagrams of weak-expansion-efficiency and strong-expansion-efficiency comparisons, respectively, in which the problem size (calculation amount) is increased as the number of processors increases, that is, the parameter size per gpu is fixed, and the number of gpus is increased, and in which the problem size is kept unchanged, and the number of processors is increased to find the number of processors most suitable for solving the problem. I.e. the time taken is as short as possible without incurring too much overhead, the average time taken to derive 3-dimensional model parallelism is less, accelerated by 2.32 and 1.57 times compared to 1 and 2 dimensions, respectively.
The data-parallel, sequence-parallel + 2/2.5/3-dimensional meshing (2/2.5/3-dimensional model parallel) may constitute 4/4.5/5-dimensional parallelism, which may be further recombined with pipeline parallel into 5/5.5/6-dimensional parallelism.
The specific dimension of the 2/2.5/3-dimensional model of the multi-dimensional grid parallel is determined according to the processor attribute, and specifically, the 2-dimensional model parallel needs a processors, such as 2, 4,3, 9,4, 16; the 2.5-dimensional model requires a × b processors in parallel, such as 2 × 1 ═ 4,2 × 2 ═ 8, and 2 × 3 ═ 12. The 3-dimensional models are parallel, and a × a processors are required, such as 2 × 2 ═ 8, and 3 × 3 ═ 27.
Even though the processors are all 8, the specific operations of the parallel 2.5-dimensional model and the parallel 3-dimensional model are different; the number of the processors is 4, and the parallel operation of the 2.5-dimensional model is different from the parallel operation of the 2-dimensional model.
When the number of the parallel processors is consistent with the model parallelism of various conditions, such as 64, three types of the parallel processors are consistent, and the specific selection of which type needs to be further optimized according to the actual running performance (speed). Because different operating environments bring differences in processor performance, memory, communication bandwidth, processor network topology, and the like, the models and data used by different tasks are also very different.
The model of the AI task decomposes the model parameters to each processor through the model parallel of 2/2.5/3 dimensions, and the capacity of a single machine is limited, so that the capacity of all machines is added to accommodate the model after decomposition, on one hand, the model is allowed to accommodate a larger model on the whole, and on the other hand, the communication of the parameters in the calculation process is reduced.
The data of the AI task such as pictures/sentences are input into the model, and the processors communicate with each other in the forward calculation, which is equivalent to performing calculation by using complete long sequence data. And (3) obtaining an output result by forward calculation, comparing the output result with a training data label (label) to obtain a loss function (loss function) value, and then calculating a gradient backwards for updating the model parameters in the next step. Both the forward calculation and the backward calculation can be performed in parallel through a 2/2.5/3 dimensional model, and the calculation speed is accelerated.
In a possible implementation of the first embodiment, the obtaining a parallel processing result of the AI task after performing parallel processing, calculating a gradient according to a current output result of a model for a training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement specifically includes:
the optimizer algorithm corresponding to the AI task comprises but is not limited to a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing of computer vision-related AI tasks;
the LAMB is used for processing a related AI task in natural language;
the ConAdv is suitable for processing AI tasks with high speed requirements and low precision requirements;
the La-Lars is suitable for processing AI tasks with narrow communication bandwidth and high network communication cost.
Although data parallelism can speed up the training speed by increasing the (equivalent) batch size, it leads to difficult optimization, and an optimizer specially aiming at large batch must be used to ensure better convergence. LAMB/LARS/ConAdv are all suitable for large batch (batch) training, where LARS is best suited for computer vision related tasks (extending the batch size of the CV task to 32K), LAMB is best suited for natural language processing related tasks (extending the batch size of the NLP task to 64K), and ConAdv is suitable for CV tasks that pursue extreme speed, with slightly less precision requirements (extending the CV task batch size to 96K with a slight loss of precision)
In addition, when data are parallel, gradient is required to be transmitted through communication, model parameters are updated synchronously, communication quantity is extremely large (proportional to the size of the model (namely the parameter quantity of the model)), and particularly for the model which is larger and larger at present. Therefore, if the communication bandwidth (the amount of data that can be transmitted simultaneously) of the system is small, the operation speed is severely slowed down, and therefore, an optimizer that selects a large batch with a small communication volume is required.
The LAMB optimizer and/or the LARS optimizer and/or the ConAdv optimizer and/or the La-Lars optimizer are extensible large-scale optimizers required by the training of the AI large model, different optimizers can be selected according to needs, for example, the LAMB/LARS/ConAdv are all suitable for large batch (batch) training, the LARS is most suitable for computer vision related tasks, the LAMB is most suitable for natural language processing related tasks, and the ConAdv further expands the maximum batch of computer vision training. APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression.
APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression. APS may require only about 1/4 traffic with little loss of accuracy. La-Lars further compresses traffic to about one thousandth to accommodate the narrow bandwidth of the communication, albeit at a slight loss of accuracy.
FIG. 8 is a statistical chart of the experimental effect of the LAMB algorithm, where ADAMW cannot converge under the mixed batch size training (64k/32k), and LAMB can reach an acceleration ratio of 101.8% (64 times of computing resources, the computing speed is increased by 65.2 times).
The La-Lars is a gradient sparsification algorithm, see fig. 9, i.e. only important gradients are sent each time a gradient is exchanged. The remaining gradients will accumulate locally and be transmitted in the future.
To speed up training, one of the simplest methods is to increase the number of compute nodes. However, when the number of nodes is large, the network communication cost becomes a bottleneck. Meanwhile, when the batch size exceeds a certain size, the generalization performance of the neural network may deteriorate.
LARS solves the performance degradation problem caused by large-scale deep learning training. It is a layer-by-layer adaptive rate scaling optimizer that can extend the bulk size to 32K without loss of performance. However, due to sparse representation of gradients and local gradient accumulation, it is difficult for the present solution to simply use DGC and LARS together, as this can lead to gradient outdating problems.
The scheme proposes a LA-LARS algorithm, which has faster convergence speed and smaller performance loss than the direct simultaneous use of DGC and LARS. On MNIST and CIFAR-10 datasets, LA-LARS outperforms other baseline optimizers while guaranteeing 0.1% compression. On the ImageNet dataset, it only required 60% -70% training time to achieve similar performance as the baseline optimizer.
Second embodiment, referring to fig. 10, the present application provides an artificial intelligence based distributed training and reasoning system for a hardware processor, where the system is executed on a software platform and uses a machine learning library to process various application data;
the hardware processor includes but is not limited to: CPU, GPU, FPGA, TPU;
characterized in that the system comprises:
the scheduling module is used for acquiring task parameters of a plurality of AI tasks, acquiring scheduling decisions according to the task parameters of the AI tasks, and distributing the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks;
the multidimensional parallel module is used for acquiring the computing resources of the AI tasks distributed to the hardware processors, executing multidimensional parallel processing on the hardware processors of the training tasks of the AI tasks and acquiring the output result of the AI tasks;
the extensible optimization module is used for acquiring a parallel processing result of the AI task after parallel processing is executed, calculating a gradient according to a current output result of a model aiming at a training task of the AI task, optimizing the AI task by adopting an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets the requirement;
an optimization algorithm is used in the distribution process to optimize scheduling decisions;
the parallel processing mode comprises data parallel, sequence parallel, pipeline parallel and multidimensional grid parallel processing.
The distributed training and reasoning system based on artificial intelligence operates in a cloud end and is in communication interaction with local data;
the artificial intelligence based distributed training and reasoning system is executed on a software platform, wherein the software platform comprises but is not limited to CUDA and ROCM;
the artificial intelligence based distributed training and reasoning system uses machine learning libraries including, but not limited to, TensorFlow, Keras, PyTorch.
The scheme adopted by the second embodiment divides the model through multi-dimensional parallelism, improves the efficiency of distributed AI training and reasoning efficiency, realizes 70% response speed improvement, and reduces the response time from the original 30 seconds to 17-18 seconds; through efficient memory partitioning and data movement management, the maximum model supported on each processor is increased from 10 hundred million parameter scale to 120 hundred million parameter scale, the number of processors required by large model inference is reduced, the cost is reduced, and the availability and the product performance of the model are improved; the automatic deployment scheme is provided, the deployment speed is increased by 5-10 times, and the labor, time and cost required by distributed deployment of the model of the science and technology can be saved in the future.
In a possible implementation of the second embodiment, the scheduling module automatically manages a plurality of AI tasks according to batch sizes, maximizes the utilization rate of the hardware processor according to the batch size of each AI task, and continuously optimizes a scheduling decision through an optimization algorithm, specifically including:
each AI task has parameters such as data (pictures/sentences), models (ViT/ResNet/TransFormer, etc.), types (training/fine tuning/reasoning), etc., and appropriate computing resources are allocated by adopting a task scheduling strategy.
Specifically, task scheduling is adjusted according to the batch size and other information, computing resources are fully utilized, and the average waiting time of tasks can be remarkably shortened;
when a user wants to start a training and reasoning task, a starting command is written into a file, the file is submitted to a scheduling system, and the scheduling system helps to queue, plan the training and reasoning task;
the scheduling mode is dynamic, scheduling can be performed according to the property of the task, the average time for completing the task is shortened, and adjustment can be performed according to the priority.
The task scheduling method related to the embodiment can achieve the good technical effects of maximizing the calculation power utilization rate, reducing the idle time of each processor and thread, shortening the task waiting time, shortening the time from task submission to calculation termination and the like.
In a possible implementation of the second embodiment, the system further includes:
the fine-tuning and reasoning module is used for performing fine tuning and prediction on the AI task processed by the optimizer, continuing to train the model aiming at specific application through fine tuning, and finally deploying the trained model to perform reasoning on actual application;
the whole fine adjustment is basically the same as the training; the specific functions of the inference process include: compared with a reference system, the speed is improved by at least 30%, GPT-2 deployment can be completed on a single server, the memory is reduced by at least 30%, and the precision is ensured.
The fine tuning is mainly performed by two methods.
The method comprises the following steps: freezing all convolutional layers + training personalized custom full-connect layers
Pre-training the model using inclusion v 3: the input image size for this model when initially trained with the ImageNet dataset is 299x299 and the image channel order is RGB. It should be noted that, the pre-training model must ensure that the data to be trained is as close as possible to the original data set, so as to maximize the image recognition capability of the model.
Pretreatment: the data are preprocessed according to the original preprocessing mode of the pre-training model, and the preprocessing mode is normalized to [ -1,1]
A base model: the pre-trained model is imported (only the convolutional layer part) and all convolutional layer parameters are locked.
Customizing a model: the convolution layer is connected with a Global Average Pooling (GAP) first, then connected with Dropout, and then connected with a classifier, and the output number is selected according to the classification task. The model may have only two thousand trainable parameters.
An optimizer: LARS is used.
Preparing data: the training set is divided into a training set and a validation set.
Callback functions are defined to facilitate training: and automatically storing the model at each generation end, performing early stop by taking val _ loss as a monitoring index, and synchronously updating the training historical data to the Tensoboard for visualization.
Batch Size: the training is performed with a larger batch size, which allows the model to converge faster and better.
Although the convolutional layers are all already locked, training is time consuming since the samples still need to be computed from the input to the output of the model. The training of five generations takes more than ten minutes, and the verification set Loss is about 0.05.
The second method comprises the following steps: and (3) deriving a feature vector, and training a classifier independently:
pretreatment: and preprocessing is carried out according to the requirements of a pre-training model before the feature vectors of the training set and the test set are derived, otherwise, the derived features are not the best performance of the model.
A base model: the base model consists of the convolution layer portion of IncepisationV 3 and the Global Average Pooling (GAP).
Deriving, i.e. predicting: derivation is actually to make the base model directly predict the training set and the test set, but the predicted result is not the image class, but the feature vector (a condensed version of the feature map).
The derivation takes a certain time, and usually takes one or two minutes to complete because the derivation requires prediction for all pictures of the data set.
The inputs to the new model are feature vectors: the input of the new model is no longer the images of the training set, but the image feature vector after the pre-training model is 'digested', the first dimension of the vector corresponds to each image sample, the length of the vector is the number of the samples, the second dimension is the average value of the output feature map of each convolution kernel of the last layer of the base model, and for Inception V3, the length of the second dimension is 2048.
Dividing a training set and a verification set: note that here the training set and validation set are partitioned over the input feature vectors.
Customizing the new model: since the feature vectors have already been derived, only one fully connected network with input feature length 2048 needs to be trained next.
The same training is done with the callback function and the larger batch size (4096).
The training speed is obviously improved, and the five-generation training only takes ten seconds, so that the Loss of about 0.02 can be achieved on the verification set.
In this case, the model can be basically ascending in Top20 of Kaggle Leaderboard. If the model of ResNet50, Xception, etc. is further fused, Top10 can be reached.
The model may be subsequently fine-tuned to further improve model performance.
The technical effects of the fine adjustment and reasoning method on the training include that the pre-training model can be competent for recognizing the new data set, the original feature extraction capability of the pre-training model can be fully released and utilized, and the model can achieve lower loss.
The dynamic memory disk management module completes data migration of the AI task between the hardware processors by segmenting and/or unloading the state, gradient and model parameters of the optimizer;
the AI task includes a picture processing task and/or a natural language processing task.
By segmenting and/or unloading the optimizer state (stage 1), the gradient (stage 2) and the model parameters (stage 3), the GPU memory only stores the related data required by the current calculation, so that the GPU memory consumption required in the training process is reduced, and finally the scheme is allowed to train/fine tune the extremely large AI model by using extremely small amount of GPU resources. The method can enable the memory of the GPU to only store related data required by current calculation, and when the memory of the GPU is insufficient, the model is unloaded to the memory of the CPU, and further the model is unloaded to the hard disk.
For a large model, because of the huge number of parameters, the model itself and the corresponding optimizer need to consume a large amount of space, such as a GPU video memory, when computational resources (the number and capacity of processors) are limited, or even though a single processor cannot process the model through multi-dimensional parallel division, dynamic memory disk management needs to be used, information such as model parameters, corresponding optimizer states, gradients and the like is dynamically placed by means of the capacity of a CPU memory or a high-speed hard disk, and only the information required for current computation is stored on the GPU video memory.
The picture processing task is used for processing the characteristic data of the picture; the natural language processing task is to process the characteristic data of the sentence.
In a possible implementation of the second embodiment, the multidimensional parallel module performs parallel processing on the AI task on the hardware processor by using data parallel, sequence parallel, pipeline parallel, and multidimensional grid parallel, and further includes:
the data parallelly distributes the AI tasks to the hardware processors to obtain the total batch size of data which are processed by all the hardware processors at the same time and the batch size of data processed by each hardware processor each time;
the data in the data parallelism is divided, each node (or process) has a model, then each node takes different data, usually a batch size, and then respectively completes the calculation of the forward direction and the backward direction to obtain the gradient, the training processes are workers, besides the workers, parameter servers, for short ps servers, the workers can send the respective calculated gradient to the ps servers, then the ps servers perform update operation, and then the updated model is transmitted back to each node.
The data parallelism can enlarge the equivalent batch size, namely the equivalent batch size, and the calculation is accelerated through the calculation of the parallel processor count and the monobloc processor batch size. For example, 128000 data, the batch size 128 of each processor needs 2 seconds for updating the model each time, and it needs to wait 2 × 128000/128 to 2000 seconds (1000 times for calculation); when 100 processors are running in parallel, the equivalent batch size is 12800, and only about 2 × 128000/12800 is needed to wait for 20 seconds (10 times).
The sequence parallelism can prolong the length of data received by a transform type model, so that long text in NLP and high-resolution pictures (large pictures)/video in CV tasks can be processed (pictures can be cut into small blocks of pictures, all the small pictures are arranged in sequence and are also sequences; the video is the sequence of the pictures, and each picture can be cut and/or unloaded). If the non-ordered columns are parallel, the data can only be cut off and processed in a segmented mode, and the performance is reduced. the transformer model is the forefront in the field of deep learning at present and has excellent performance not only in NLP related tasks. For CV tasks, transform-based models such as ViT were also derived.
The sequence further performs segmentation and/or unloading and distribution on the data, and each AI task is put into a plurality of processors;
after the computing resources are obtained, the picture processing tasks and/or the feature data of the pictures are processed and distributed to various processors (such as a GPU/CPU) through data parallel, and the data are further segmented and/or unloaded and distributed in sequence parallel. But a single data is too long to be processed by a single processor, and after parallel slicing and/or unloading of the sequence, one data is put into multiple processors. The calculation is equivalent to directly processing the whole complete data through communication.
The pipeline parallelism is realized by splitting the model into a plurality of sections, deploying each section in different hardware processors, connecting the sections in series according to the model sequence, and taking the output of the previous section as the input of the next section;
in the pipeline parallelism, each device is responsible for forward and corresponding backward operation of a part of layers; in a training scene, because the next step can be carried out only after the reversal of one step is finished, each device has bubble waiting; due to the existence of bubble waiting, the utilization rate of the pipeline parallel equipment is not high; the batch size of training each time can be increased, and the batch is divided into a plurality of small batches of micro-batch, so that the utilization rate of equipment is improved.
The multi-dimensional grid parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
The 2.5-dimensional grid parallel aims to design a quantifiable novel deep learning model parallel framework, minimize expensive transmission loss between graphics processors, provide a flexible and efficient framework, and further improve the speed and efficiency of model parallel.
The 2.5-dimensional grid parallel adopts flexible architecture design, a user can flexibly set various parameters of the model parallel to efficiently use limited graphics processor resources, and the adopted architecture setting between 2D and 3D has simple 2D design and high 3D efficiency. After having possessed 2D and 3D's characteristics, 2.5 dimension grid model parallel can be like 2D unrestricted application to extensive deep learning model, also possess 3D's high efficiency, and various deep learning models, various application can be compatible to furthest to this kind of design to promote the efficiency of model by a wide margin.
As a newly proposed model parallel scheme, 2.5-dimensional grid parallel (Tesseract) respectively improves the running speed by 1.375 times and 1.5293 times (64x NVIDIA Quadro RTX 5000) compared with the traditional 1-dimensional and 2-dimensional model parallel architectures. By reducing the transmission times among the graphic processors, the 2.5-dimensional grid model parallelism (Tesseract) greatly improves the overall operation efficiency of the model parallelism, and further reduces the training cost of the deep learning model, including the required number of the graphic processors and the waiting time. In addition, the tested ViT model of the scheme shows that the 2.5D model can achieve the same training precision in parallel (Tesseract) and non-parallel in the training precision of the model.
The 2.5-dimensional grid parallel scheme constructs p processors in a 2.5-dimensional layout of [ p, p, d ], for p processors, d being the depth.
The 2.5-dimensional grid parallel action is implemented by separating a matrix A with the size [ a, B ] and a matrix B with the size [ B, C ], and then merging to obtain a matrix C with the size [ a, C ], and specifically implementing the following algorithm:
wherein q represents the dimension, b represents the batch size of the batch size, h represents the concealment size, and s represents the sequence length;
inputting: matrix A [ a, B ], matrix B [ B, c ]
And (3) outputting: matrix C [ a, C ] ═ a × B
The matrices A, B are divided into block matrices of the form [ a/qd, B/q ] and [ B/q, c/q ], respectively,
for i e {0, … qd-1}, j e {0, …, q-1}, performing h i% p, k i// p, and aijDeposit pkjhIn, CijWhen the ratio is 0, C isijDeposit pkjhPerforming the following steps;
for i ∈ {0, … p-1}, j ∈ {0, …, p-1}, k ∈ {0, …, d-1}, the execution will B ∈ {0, … p-1}, whereijDeposit pkjhPerforming the following steps;
for i, j e {0, … p-1}, k e {0, …, d-1}, and concurrent execution for each element in t e {0, … p-1}, p is addeditkA in (A)itkBroadcast to pijkA 1 is to ptjkB in (1)tjkBroadcast to pijk,Cijk=Cijk+Aitk*Bvtk
Merging all CijkA matrix C is obtained.
3D parallel matrix multiplication is adopted in the 3-dimensional grid in parallel, each matrix is divided into a plurality of small blocks according to rows and columns, and multiplication of a large matrix is divided into multiplication of a plurality of small matrices;
the three-dimensional matrix multiplication of the original version is realized, each matrix is only stored on one surface (partial GPU), so that the storage resource waste is caused, the scheme optimizes the storage and communication algorithm, and the matrix storage is spread on the whole processor.
According to the matrix-vector parameter balancing structure provided by the embodiment of the invention, in the algorithm, load balancing optimization and matrix-vector operation are adopted, the vector B is uniformly stored on a B diagonal (i, l, j), and C is calculated to be A + B;
when the parameter scale is fixed on each GPU, the step time of the 3D method is the minimum compared with 1D and 2D (the 3D is 0.672 seconds, the 1D is 1.560 and the 2D is 1.052); with the total parameter scale fixed, the 3D method is accelerated by 2.3 and 1.6 times than the 1D and 2D methods, respectively. The embodiment is based on a comparison between weak expansion efficiency and strong expansion efficiency, wherein in the weak expansion efficiency, the problem scale (calculation amount) is increased with the increase of the number of processors, that is, the parameter scale on each gpu is fixed, and the number of gpus is increased, and in the strong expansion efficiency, the problem scale is kept unchanged, and the number of processors is increased, so as to find the most suitable number of processors for solving the problem. I.e. the time taken is as short as possible without incurring too much overhead, the average time taken to derive 3-dimensional model parallelism is less, accelerated by 2.32 and 1.57 times compared to 1 and 2 dimensions, respectively.
The data-parallel, sequence-parallel + 2/2.5/3-dimensional meshing (2/2.5/3-dimensional model parallel) may constitute 4/4.5/5-dimensional parallelism, which may be further recombined with pipeline parallel into 5/5.5/6-dimensional parallelism.
The specific dimension of the 2/2.5/3-dimensional model of the multi-dimensional grid parallel is determined according to the processor attribute, and specifically, the 2-dimensional model parallel needs a processors, such as 2, 4,3, 9,4, 16; the 2.5-dimensional model requires a × b processors in parallel, such as 2 × 1 ═ 4,2 × 2 ═ 8, and 2 × 3 ═ 12. The 3-dimensional models are parallel, and a × a processors are required, such as 2 × 2 ═ 8, and 3 × 3 ═ 27.
Even though the processors are all 8, the specific operations of the parallel 2.5-dimensional model and the parallel 3-dimensional model are different; the number of the processors is 4, and the parallel operation of the 2.5-dimensional model is different from the parallel operation of the 2-dimensional model.
When the number of the parallel processors is consistent with the model parallelism of various conditions, such as 64, three types of the parallel processors are consistent, and the specific selection of which type needs to be further optimized according to the actual running performance (speed). Because different operating environments bring differences in processor performance, memory, communication bandwidth, processor network topology, and the like, the models and data used by different tasks are also very different.
The model of the AI task decomposes the model parameters to each processor through the model parallel of 2/2.5/3 dimensions, and the capacity of a single machine is limited, so that the capacity of all machines is added to accommodate the model after decomposition, on one hand, the model is allowed to accommodate a larger model on the whole, and on the other hand, the communication of the parameters in the calculation process is reduced.
The data of the AI task such as pictures/sentences are input into the model, and the processors communicate with each other in the forward calculation, which is equivalent to performing calculation by using complete long sequence data. And (3) obtaining an output result by forward calculation, comparing the output result with a training data label (label) to obtain a loss function (loss function) value, and then calculating a gradient backwards for updating the model parameters in the next step. Both the forward calculation and the backward calculation can be performed in parallel through a 2/2.5/3 dimensional model, and the calculation speed is accelerated.
In a possible implementation of the second embodiment, the extensible optimization module, which uses the optimizer corresponding to the AI task and selects the optimizer according to the attribute of the AI task, further includes:
the optimizer algorithm corresponding to the AI task comprises but is not limited to a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing of computer vision-related AI tasks;
the LAMB is used for processing a related AI task in natural language;
the ConAdv is suitable for processing AI tasks with high speed requirements and low precision requirements;
the La-Lars is suitable for processing AI tasks with narrow communication bandwidth and high network communication cost.
Although data parallelism can speed up the training speed by increasing the (equivalent) batch size, it leads to difficult optimization, and an optimizer specially aiming at large batch must be used to ensure better convergence. LAMB/LARS/ConAdv are all suitable for large batch (batch) training, where LARS is best suited for computer vision related tasks (extending the batch size of the CV task to 32K), LAMB is best suited for natural language processing related tasks (extending the batch size of the NLP task to 64K), and ConAdv is suitable for CV tasks that pursue extreme speed, with slightly less precision requirements (extending the CV task batch size to 96K with a slight loss of precision)
In addition, when data are parallel, gradient is required to be transmitted through communication, model parameters are updated synchronously, communication quantity is extremely large (proportional to the size of the model (namely the parameter quantity of the model)), and particularly for the model which is larger and larger at present. Therefore, if the communication bandwidth (the amount of data that can be transmitted simultaneously) of the system is small, the operation speed is severely slowed down, and therefore, an optimizer that selects a large batch with a small communication volume is required.
The LAMB optimizer and/or the LARS optimizer and/or the ConAdv optimizer and/or the La-Lars optimizer are extensible large-scale optimizers required by the training of the AI large model, different optimizers can be selected according to needs, for example, the LAMB/LARS/ConAdv are all suitable for large batch (batch) training, the LARS is most suitable for computer vision related tasks, the LAMB is most suitable for natural language processing related tasks, and the ConAdv further expands the maximum batch of computer vision training. APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression.
APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression. APS may require only about 1/4 traffic with little loss of accuracy. La-Lars further compresses traffic to about one thousandth to accommodate the narrow bandwidth of the communication, albeit at a slight loss of accuracy.
According to the experimental effect statistics of the LAMB algorithm, under the training of the mixed batch size (64k/32k), ADAMW can not be converged, and LAMB can reach an acceleration ratio of 101.8% (under 64 times of computing resources, the computing speed is increased by 65.2 times).
The La-Lars is a gradient sparsification algorithm, see fig. 7, i.e. only important gradients are sent each time a gradient is exchanged. The remaining gradients will accumulate locally and be transmitted in the future.
To speed up training, one of the simplest methods is to increase the number of compute nodes. However, when the number of nodes is large, the network communication cost becomes a bottleneck. Meanwhile, when the batch size exceeds a certain size, the generalization performance of the neural network may deteriorate.
LARS solves the performance degradation problem caused by large-scale deep learning training. It is a layer-by-layer adaptive rate scaling optimizer that can extend the bulk size to 32K without loss of performance. However, due to sparse representation of gradients and local gradient accumulation, it is difficult for the present solution to simply use DGC and LARS together, as this can lead to gradient outdating problems.
The La-Lars is a gradient sparsification algorithm, and only important gradients are sent when the gradients are exchanged each time. The remaining gradients will accumulate locally and be transmitted in the future.
To speed up training, one of the simplest methods is to increase the number of compute nodes. However, when the number of nodes is large, the network communication cost becomes a bottleneck. Meanwhile, when the batch size exceeds a certain size, the generalization performance of the neural network may deteriorate.
LARS solves the performance degradation problem caused by large-scale deep learning training. It is a layer-by-layer adaptive rate scaling optimizer that can extend the bulk size to 32K without loss of performance. However, due to sparse representation of gradients and local gradient accumulation, it is difficult for the present solution to simply use DGC and LARS together, as this can lead to gradient outdating problems.
The scheme proposes a LA-LARS algorithm, which has faster convergence speed and smaller performance loss than the direct simultaneous use of DGC and LARS. On MNIST and CIFAR-10 datasets, LA-LARS outperforms other baseline optimizers while guaranteeing 0.1% compression. On the ImageNet dataset, it only required 60% -70% training time to achieve similar performance as the baseline optimizer.
The second embodiment is a device embodiment corresponding to the present embodiment, and the present embodiment can be implemented in cooperation with the second embodiment. The related technical details mentioned in the second embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the first embodiment can also be applied to the second embodiment.
In a third embodiment, the present application provides an artificial intelligence based distributed training apparatus, wherein,
the method comprises the following steps:
a memory for storing instructions for execution by one or more processors of the system, an
A processor, being one of the processors of the system, for executing the instructions to implement any one of the possible artificial intelligence based distributed training and reasoning methods of the first aspect described above.
In a fourth embodiment, the present application provides a computer-readable storage medium encoded with a computer program, wherein the computer-readable storage medium has instructions stored thereon, and when the instructions are executed on a computer, the instructions cause the computer to perform any one of the possible artificial intelligence based distributed training and reasoning methods of the first aspect.
It should be noted that the method embodiments of the present application can be implemented in software, hardware, firmware, and the like. Whether implemented in software, hardware, or firmware, the instruction code may be stored in any type of computer-accessible memory (e.g., permanent or modifiable, volatile or non-volatile, solid or non-solid, fixed or removable media, etc.). Also, the Memory may be, for example, Programmable Array Logic (PAL), Random Access Memory (RAM), Programmable Read Only Memory (PROM), Read-Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic disk, an optical disk, a Digital Versatile Disk (DVD), or the like.
It should be noted that, all units/modules mentioned in the embodiments of the apparatuses in this application are logic units/modules, and physically, a logic unit may be a physical unit, or a part of a physical unit, or may be implemented by a combination of multiple physical units, where the physical implementation manner of the logic unit itself is not the most important, and the combination of the functions implemented by the logic units is the key to solve the technical problem provided by this application. In addition, in order to highlight the innovative part of the present application, the above-mentioned embodiments of the apparatus of the present application do not introduce elements that are not so closely related to solve the technical problems proposed by the present application, which does not indicate that there are no other elements in the above-mentioned embodiments of the apparatus.
It is to be noted that in the claims and the description of the present patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.
While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims (10)

1. An artificial intelligence based distributed training and reasoning method for a hardware processor, the method being implemented on a software platform using a machine learning library;
characterized in that the method comprises the steps of:
acquiring task parameters of a plurality of AI tasks, acquiring a scheduling decision according to the task parameters of the AI tasks, and distributing the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks;
acquiring the computing resources of the AI tasks distributed to the plurality of hardware processors, executing multidimensional parallel processing on the training tasks of the AI tasks on the respective hardware processors, and acquiring the output result of the AI tasks;
acquiring a parallel processing result of the AI task after executing the parallel processing, calculating a gradient according to a current output result of a model aiming at a training task of the AI task, calculating the gradient according to the current output result of the model, optimizing the AI task by adopting an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets the requirement;
an optimization algorithm is used in the distribution process to optimize scheduling decisions;
the parallel processing mode comprises data parallel, sequence parallel, pipeline parallel and multidimensional grid parallel processing;
the AI task includes a training task and an inference task.
2. The distributed training and reasoning method based on artificial intelligence of claim 1, wherein the obtaining of the parallel processing result of the AI task after performing the parallel processing, calculating a gradient according to a current output result of a model for the training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement, further comprises:
fine adjustment and prediction are carried out on AI model parameters of the AI task processed by the optimizer, the model is continuously trained aiming at specific application through fine adjustment, and finally the trained model is deployed to carry out inference of actual application;
the training task of the AI task is executed with multidimensional parallel processing on respective hardware processors, and the process of obtaining the output result of the AI task further comprises the following steps:
completing data migration of the AI task between the hardware processors by segmenting and/or unloading optimizer states, gradients and model parameters;
the AI task includes a picture processing task and/or a natural language processing task.
3. The distributed training and reasoning method based on artificial intelligence according to claim 1, wherein the obtaining of the parallel processing result of the AI task after performing the parallel processing, calculating a gradient according to a current output result of a model for the training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets a requirement specifically comprises:
the data parallelly distributes the AI tasks to the hardware processors to obtain the total batch size of data which are processed by all the hardware processors at the same time and the batch size of data processed by each hardware processor each time;
the sequence further performs segmentation and/or unloading and distribution on the data, and each AI task is put into a plurality of processors;
the pipeline parallelism is realized by splitting the model into a plurality of sections, deploying each section in different hardware processors, connecting the sections in series according to the model sequence, and taking the output of the previous section as the input of the next section;
the multi-dimensional grid parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
4. The distributed training and reasoning method based on artificial intelligence according to claim 1, wherein the step of obtaining a parallel processing result of the AI task after performing the parallel processing, calculating a gradient according to a current output result of a model for the training task of the AI task, performing optimization processing on the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the model parameter until a target iteration number is reached or the training result meets a requirement specifically comprises:
the optimizer algorithm corresponding to the AI task comprises but is not limited to a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing of computer vision-related AI tasks;
the LAMB is used for processing a related AI task in natural language;
the ConAdv is suitable for processing AI tasks with high speed requirements and low precision requirements;
the La-Lars is suitable for processing AI tasks with narrow communication bandwidth and high network communication cost.
5. An artificial intelligence based distributed training and reasoning system for a hardware processor, said system executing on a software platform using a machine learning library for processing a plurality of application data;
the hardware processor includes but is not limited to: CPU, GPU, FPGA, TPU;
characterized in that the system comprises:
the scheduling module is used for acquiring task parameters of a plurality of AI tasks, acquiring scheduling decisions according to the task parameters of the AI tasks, and distributing the AI tasks to a plurality of hardware processors to obtain computing resources of the AI tasks;
the multidimensional parallel module is used for acquiring the computing resources of the AI tasks distributed to the hardware processors, executing multidimensional parallel processing on the hardware processors of the training tasks of the AI tasks and acquiring the output result of the AI tasks;
the extensible optimization module is used for acquiring a parallel processing result of the AI task after parallel processing is executed, calculating a gradient according to a current output result of a model aiming at a training task of the AI task, optimizing the AI task by adopting an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updating the iteration model parameter until a target iteration number is reached or the training result meets the requirement;
an optimization algorithm is used in the distribution process to optimize scheduling decisions;
the parallel processing mode comprises data parallel, sequence parallel, pipeline parallel and multidimensional grid parallel processing;
the AI task includes a training task and an inference task.
6. An artificial intelligence based distributed training and reasoning system as claimed in claim 5, further comprising:
the fine-tuning and reasoning module is used for fine-tuning and predicting the AI model parameters of the AI task processed by the optimizer, continuing to train the model aiming at specific application through fine tuning, and finally deploying the trained model to carry out reasoning on actual application;
the dynamic memory disk management module completes data migration of the AI task between the hardware processors by segmenting and/or unloading the state, gradient and model parameters of the optimizer;
the AI task includes a picture processing task and/or a natural language processing task.
7. The distributed artificial intelligence-based training and reasoning system of claim 5, wherein said multidimensional parallel module obtains said computation resources assigned to the AI tasks on the plurality of hardware processors, performs multidimensional parallel processing on the respective hardware processors for the training tasks of the AI tasks, and obtains the output results of the AI tasks, further comprising:
the data parallelly distributes the AI tasks to the hardware processors to obtain the total batch size of data which are processed by all the hardware processors at the same time and the batch size of data processed by each hardware processor each time;
the sequence further performs segmentation and/or unloading and distribution on the data, and each AI task is put into a plurality of processors;
the pipeline parallelism is realized by splitting the model into a plurality of sections, deploying each section in different hardware processors, connecting the sections in series according to the model sequence, and taking the output of the previous section as the input of the next section;
the multi-dimensional grid parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
8. The distributed training and reasoning system based on artificial intelligence of claim 5, wherein the extensible optimization module obtains a parallel processing result of the AI task after performing the parallel processing, calculates a gradient according to a current output result of the model for the training task of the AI task, optimizes the AI task by using an optimizer corresponding to the AI task to obtain an optimized AI model parameter, and continuously updates the model parameter until a target iteration number is reached or the training result meets a requirement, further comprising:
the optimizer algorithm corresponding to the AI task comprises but is not limited to a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing of computer vision-related AI tasks;
the LAMB is used for processing a related AI task in natural language;
the ConAdv is suitable for processing AI tasks with high speed requirements and low precision requirements;
the La-Lars is suitable for processing AI tasks with narrow communication bandwidth and high network communication cost.
9. An artificial intelligence based distributed training apparatus, comprising:
a memory for storing instructions for execution by one or more processors of the system, an
A processor, being one of the processors of the system, for executing the instructions to implement the artificial intelligence based distributed training and reasoning method of claims 1-4.
10. A computer-readable storage medium encoded with a computer program, having instructions stored thereon, which, when executed on a computer, cause the computer to perform the artificial intelligence based distributed training and reasoning method of any of claims 1-4.
CN202111204831.7A 2021-10-15 2021-10-15 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence Pending CN114035937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111204831.7A CN114035937A (en) 2021-10-15 2021-10-15 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111204831.7A CN114035937A (en) 2021-10-15 2021-10-15 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN114035937A true CN114035937A (en) 2022-02-11

Family

ID=80135039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111204831.7A Pending CN114035937A (en) 2021-10-15 2021-10-15 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN114035937A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676761A (en) * 2022-03-10 2022-06-28 北京智源人工智能研究院 Pre-training model training processing method and device, electronic equipment and storage medium
CN114780225A (en) * 2022-06-14 2022-07-22 支付宝(杭州)信息技术有限公司 Distributed model training system, method and device
CN115248728A (en) * 2022-09-21 2022-10-28 之江实验室 Distributed training task scheduling method, system and device for intelligent computing
CN115511086A (en) * 2022-11-03 2022-12-23 上海人工智能创新中心 Distributed reasoning deployment system for super large model
CN115660034A (en) * 2022-10-28 2023-01-31 北京百度网讯科技有限公司 Distributed model training method, device and system
CN116521380A (en) * 2023-07-05 2023-08-01 之江实验室 Resource self-adaptive collaborative model training acceleration method, device and equipment
CN116739090A (en) * 2023-05-12 2023-09-12 北京大学 Deep neural network reasoning measurement method and device based on Web browser
WO2023240845A1 (en) * 2022-06-15 2023-12-21 苏州元脑智能科技有限公司 Distributed computation method, system and device, and storage medium
CN117333067A (en) * 2023-10-12 2024-01-02 苏州市职业大学(苏州开放大学) Intelligent physical education data management method and system
WO2024060788A1 (en) * 2022-09-21 2024-03-28 之江实验室 Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902818A (en) * 2019-01-15 2019-06-18 中国科学院信息工程研究所 A kind of distributed accelerated method and system towards deep learning training mission
CN110134636A (en) * 2018-02-09 2019-08-16 中兴通讯股份有限公司 Model training method, server and computer readable storage medium
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
CN110795228A (en) * 2018-08-03 2020-02-14 伊姆西Ip控股有限责任公司 Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets
CN111858058A (en) * 2020-07-24 2020-10-30 成都成信高科信息技术有限公司 SGD load balancing method and device based on parallel computing and storage medium
CN111858072A (en) * 2020-08-06 2020-10-30 华中科技大学 Resource management method and system for large-scale distributed deep learning
CN111882060A (en) * 2020-07-20 2020-11-03 中国人民解放军国防科技大学 Single-step delay stochastic gradient descent training method for machine learning
CN112154462A (en) * 2018-05-23 2020-12-29 微软技术许可有限责任公司 High performance pipeline parallel deep neural network training
CN112784968A (en) * 2021-01-29 2021-05-11 东南大学 Hybrid pipeline parallel method for accelerating distributed deep neural network training

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134636A (en) * 2018-02-09 2019-08-16 中兴通讯股份有限公司 Model training method, server and computer readable storage medium
CN112154462A (en) * 2018-05-23 2020-12-29 微软技术许可有限责任公司 High performance pipeline parallel deep neural network training
CN110795228A (en) * 2018-08-03 2020-02-14 伊姆西Ip控股有限责任公司 Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets
CN109902818A (en) * 2019-01-15 2019-06-18 中国科学院信息工程研究所 A kind of distributed accelerated method and system towards deep learning training mission
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
CN111882060A (en) * 2020-07-20 2020-11-03 中国人民解放军国防科技大学 Single-step delay stochastic gradient descent training method for machine learning
CN111858058A (en) * 2020-07-24 2020-10-30 成都成信高科信息技术有限公司 SGD load balancing method and device based on parallel computing and storage medium
CN111858072A (en) * 2020-08-06 2020-10-30 华中科技大学 Resource management method and system for large-scale distributed deep learning
CN112784968A (en) * 2021-01-29 2021-05-11 东南大学 Hybrid pipeline parallel method for accelerating distributed deep neural network training

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
QIFAN XU ET AL.: "An Efficient 2D Method for Training Super-Large Deep Learning Models", 《ARXIV》, 12 April 2021 (2021-04-12), pages 1 - 11 *
ZHENGDA BIAN ET AL.: "Maximizing Parallelism in Distributed Training for Huge Neural Networks", 《ARXIV》, 30 May 2021 (2021-05-30), pages 1 - 11 *
ZHENGDA BIAN ET AL.: "Online Evolutionary Batch Size Orchestration for Scheduling Deep LearningWorkloads in GPU Clusters", 《ARXIV》, 8 August 2021 (2021-08-08), pages 1 - 12 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676761A (en) * 2022-03-10 2022-06-28 北京智源人工智能研究院 Pre-training model training processing method and device, electronic equipment and storage medium
CN114676761B (en) * 2022-03-10 2024-03-19 北京智源人工智能研究院 Pre-training model training processing method and device, electronic equipment and storage medium
CN114780225A (en) * 2022-06-14 2022-07-22 支付宝(杭州)信息技术有限公司 Distributed model training system, method and device
CN114780225B (en) * 2022-06-14 2022-09-23 支付宝(杭州)信息技术有限公司 Distributed model training system, method and device
WO2023240845A1 (en) * 2022-06-15 2023-12-21 苏州元脑智能科技有限公司 Distributed computation method, system and device, and storage medium
CN115248728A (en) * 2022-09-21 2022-10-28 之江实验室 Distributed training task scheduling method, system and device for intelligent computing
WO2024060788A1 (en) * 2022-09-21 2024-03-28 之江实验室 Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training
CN115660034A (en) * 2022-10-28 2023-01-31 北京百度网讯科技有限公司 Distributed model training method, device and system
CN115660034B (en) * 2022-10-28 2023-08-15 北京百度网讯科技有限公司 Distributed model training method, device and system
CN115511086A (en) * 2022-11-03 2022-12-23 上海人工智能创新中心 Distributed reasoning deployment system for super large model
CN115511086B (en) * 2022-11-03 2024-05-24 上海人工智能创新中心 Distributed reasoning deployment system for oversized model
CN116739090B (en) * 2023-05-12 2023-11-28 北京大学 Deep neural network reasoning measurement method and device based on Web browser
CN116739090A (en) * 2023-05-12 2023-09-12 北京大学 Deep neural network reasoning measurement method and device based on Web browser
CN116521380A (en) * 2023-07-05 2023-08-01 之江实验室 Resource self-adaptive collaborative model training acceleration method, device and equipment
CN117333067A (en) * 2023-10-12 2024-01-02 苏州市职业大学(苏州开放大学) Intelligent physical education data management method and system
CN117333067B (en) * 2023-10-12 2024-04-05 苏州市职业大学(苏州开放大学) Intelligent physical education data management method and system

Similar Documents

Publication Publication Date Title
CN114035937A (en) Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
US20220391665A1 (en) Method for splitting neural network model by using multi-core processor, and related product
WO2022037337A1 (en) Distributed training method and apparatus for machine learning model, and computer device
CN114035936B (en) Multi-dimensional parallel processing method, system, equipment and readable storage medium based on artificial intelligence
US20220121903A1 (en) Method of performing splitting in neural network model by means of multi-core processor, and related product
CN111819580A (en) Neural architecture search for dense image prediction tasks
Han et al. Signal processing and networking for big data applications
CN111709493B (en) Object classification method, training device, object classification equipment and storage medium
CN114402293A (en) Pipelined neural network processing with continuous and asynchronous updates
CN114503125A (en) Structured pruning method, system and computer readable medium
CN113449839A (en) Distributed training method, gradient communication device and computing equipment
US20230206083A1 (en) Optimizing gradient boosting feature selection
CN116057518A (en) Automatic query predicate selective prediction using machine learning model
US11893691B2 (en) Point cloud geometry upsampling
CN115080248A (en) Scheduling optimization method for scheduling device, and storage medium
WO2023160290A1 (en) Neural network inference acceleration method, target detection method, device, and storage medium
CN115412401B (en) Method and device for training virtual network embedding model and virtual network embedding
CN115082840B (en) Action video classification method and device based on data combination and channel correlation
WO2022223052A1 (en) Accelerator, computer system, and method
CN116957041A (en) Method, device and computing equipment for compressing neural network model
CN114817845B (en) Data processing method, device, electronic equipment and storage medium
CN114138484A (en) Resource allocation method, device and medium
CN112506652B (en) Dynamic resource partitioning method
CN114661936B (en) Image retrieval method applied to industrial vision and electronic equipment
CN116489678A (en) Communication optimization method and device of deep learning model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination