WO2021142614A1 - Chip state determining method and device, and cluster resource scheduling method and device - Google Patents

Chip state determining method and device, and cluster resource scheduling method and device Download PDF

Info

Publication number
WO2021142614A1
WO2021142614A1 PCT/CN2020/071989 CN2020071989W WO2021142614A1 WO 2021142614 A1 WO2021142614 A1 WO 2021142614A1 CN 2020071989 W CN2020071989 W CN 2020071989W WO 2021142614 A1 WO2021142614 A1 WO 2021142614A1
Authority
WO
WIPO (PCT)
Prior art keywords
target chip
frequency
chip
information
availability information
Prior art date
Application number
PCT/CN2020/071989
Other languages
French (fr)
Chinese (zh)
Inventor
任峰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080093867.XA priority Critical patent/CN114981778A/en
Priority to PCT/CN2020/071989 priority patent/WO2021142614A1/en
Publication of WO2021142614A1 publication Critical patent/WO2021142614A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • This application relates to cluster scheduling technology in the field of artificial intelligence, and more specifically, to a method for determining the state of a chip, a method for scheduling cluster resources, and an apparatus thereof.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • GPU/neural network processing unit has excellent data-level parallel computing capabilities, making GPU/NPU clusters have super large-scale data-level parallel computing capabilities, so , GPU/NPU clusters can be used to train models in machine learning.
  • GPU/NPU clusters can be used to train neural network (NN) models.
  • the training data of the neural network model can be divided, and multiple chips in the GPU/NPU cluster (ie, multiple GPU/NPU chips) are used to train the neural network model at the same time.
  • This application provides a method for determining the state of a chip, a method for scheduling cluster resources and a device thereof, which can improve the performance of GPU/NPU cluster training.
  • a method for determining the state of a chip includes: acquiring state information of a target chip in a cluster; determining availability information of the target chip according to the state information, and the availability information is used to indicate all Whether the target chip is available; output the availability information.
  • the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can accurately reflect the working status of the target chip. At this time, according to the availability of the target chip Information distribution training tasks (or training data) can improve the performance of GPU/NPU cluster training.
  • acquiring the status information of the target chips in the cluster may refer to acquiring the status information of the target chips in the cluster (one or more of the above) at the same time, or in other words, acquiring the target chips in the cluster at approximately the same time. (One or more of the above) status information.
  • the state information (one or more of the above) of the target chip in the cluster is acquired within a certain time interval.
  • the acquired state information of the target chip can have better consistency, that is, all states in the state information are acquired at the same time, or acquired at approximately the same time, so that according to the The status information of the target chip can more accurately reflect the working status of the target chip.
  • the target chip may be an ASIC, for example, the ASIC chip may be a GPU or an NPU.
  • the target chip may also be another chip that can be used to accelerate neural network operations, which is not limited in the embodiment of the present application.
  • the status information includes the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, and the memory utilization of the target chip. At least one of the rate, the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
  • the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions.
  • the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
  • the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
  • the availability of the target chip can be more intuitively indicated through status information such as normal, general alarm, important alarm, and emergency alarm.
  • the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
  • the method further includes: acquiring frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the target chip At least one of the real-time frequency of the chip and the average frequency of the target chip in the preset time interval.
  • the method further includes: outputting frequency information of the target chip.
  • outputting the frequency information of the target chip can enable a server (for example, a cluster scheduling server) to conveniently obtain the frequency status of the target chip.
  • a server for example, a cluster scheduling server
  • the determining the availability information of the target chip according to the status information includes: determining the availability information of the target chip according to the status information and the frequency information Availability information.
  • the availability information of the target chip can be determined more accurately based on the status information and the frequency information.
  • training tasks or training data are allocated according to the availability information of the target chip. ), which can improve the performance of GPU/NPU cluster training.
  • the method further includes: determining frequency availability information of the target chip according to the frequency information, where the frequency availability information is used to indicate the frequency availability of the target chip Whether the frequency meets a preset condition; output the frequency availability information.
  • outputting the frequency availability information enables a server (for example, a cluster scheduling server) to directly obtain the frequency availability of the target chip.
  • the server can directly allocate training tasks according to the state of the target chip (Or training data), there is no need to go (according to the frequency information of the target chip) to determine the frequency availability, so the burden on the server can be reduced.
  • a method for scheduling cluster resources includes: obtaining availability information of a target chip in a cluster, where the availability information is used to indicate whether the target chip is available; Target chip.
  • the availability information of the target chip in the cluster is directly obtained, and the working status of the chip can be determined more accurately according to the availability information of the target chip.
  • the training task is allocated according to the availability information of the target chip. (In other words, training data), it can improve the performance of GPU/NPU cluster training.
  • the availability information of the target chip may be determined according to at least one of the following items:
  • the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions.
  • the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
  • the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
  • the availability of the target chip can be determined more accurately according to the normal, general, important, and emergency status information included in the availability information.
  • the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
  • the method further includes: acquiring frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the target chip At least one of the real-time frequency of the chip and the average frequency of the target chip in the preset time interval.
  • the allocating the target chip according to the availability information includes: determining according to the availability information, the frequency information, and the allocated chips in the cluster The influence of the target chip on the computing performance of the cluster; if it is determined that the influence is that the computing performance of the cluster does not decrease, the target chip is allocated.
  • the state of the chip for example, availability information
  • the training task (or in other words) is allocated according to the state of the target chip.
  • Training data which can improve the performance of GPU/NPU cluster training.
  • the method further includes: obtaining frequency availability information of the target chip; and the allocating the target chip according to the availability information includes: The availability information and the frequency availability information allocate the target chip.
  • the working status of the chip can be determined more accurately. Therefore, training tasks (or in other words, the training tasks) are allocated according to the availability information and the frequency availability information. , Training data), which can improve the performance of GPU/NPU cluster training.
  • a device for determining the status of a chip including: an acquiring unit, configured to acquire status information of a target chip in a cluster; and a determining unit, configured to determine availability information of the target chip according to the status information, The availability information is used to indicate whether the target chip is available; an output unit is used to output the availability information.
  • the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can accurately reflect the working status of the target chip. At this time, according to the availability of the target chip Information distribution training tasks (or training data) can improve the performance of GPU/NPU cluster training.
  • acquiring the status information of the target chips in the cluster may refer to acquiring the status information of the target chips in the cluster (one or more of the above) at the same time, or in other words, acquiring the target chips in the cluster at approximately the same time. (One or more of the above) status information.
  • the state information (one or more of the above) of the target chip in the cluster is acquired within a certain time interval.
  • the acquired state information of the target chip can have better consistency, that is, all states in the state information are acquired at the same time, or acquired at approximately the same time, so that according to the The status information of the target chip can more accurately reflect the working status of the target chip.
  • the target chip may be an ASIC, for example, the ASIC chip may be a GPU or an NPU.
  • the target chip may also be another chip that can be used to accelerate neural network operations, which is not limited in the embodiment of the present application.
  • the status information includes the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, and the memory utilization of the target chip. At least one of the rate, the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
  • the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions.
  • the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
  • the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
  • the availability of the target chip can be more intuitively indicated through status information such as normal, general alarm, important alarm, and emergency alarm.
  • the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
  • the acquiring unit is further configured to: acquire frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the At least one of the real-time frequency of the target chip and the average frequency of the target chip in a preset time interval.
  • the output unit is further configured to output frequency information of the target chip.
  • outputting the frequency information of the target chip can enable a server (for example, a cluster scheduling server) to conveniently obtain the frequency status of the target chip.
  • a server for example, a cluster scheduling server
  • the determining unit is specifically configured to determine the availability information of the target chip according to the state information and the frequency information.
  • the availability information of the target chip can be determined more accurately based on the status information and the frequency information.
  • training tasks or training data are allocated according to the availability information of the target chip. ), which can improve the performance of GPU/NPU cluster training.
  • the determining unit is further configured to determine frequency availability information of the target chip according to the frequency information, and the frequency availability information is used to indicate the target Whether the frequency of the chip meets a preset condition; the output unit is further configured to output the frequency availability information.
  • outputting the frequency availability information enables a server (for example, a cluster scheduling server) to directly obtain the frequency availability of the target chip.
  • the server can directly allocate training tasks according to the state of the target chip (In other words, training data), there is no need to determine the frequency availability (based on the frequency information of the target chip). Therefore, the burden on the server (based on the frequency information) can be reduced.
  • an apparatus for scheduling cluster resources including: an obtaining unit for obtaining availability information of a target chip in the cluster, where the availability information is used to indicate whether the target chip is available; and an allocation unit for Allocate the target chip according to the availability information.
  • the availability information of the target chip in the cluster is directly obtained, and the working status of the chip can be determined more accurately according to the availability information of the target chip.
  • the training task is allocated according to the availability information of the target chip. (In other words, training data), it can improve the performance of GPU/NPU cluster training.
  • the availability information of the target chip may be determined according to at least one of the following items:
  • the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions.
  • the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
  • the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
  • the availability of the target chip can be determined more accurately according to the normal, general, important, and emergency status information included in the availability information.
  • the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
  • the acquiring unit is further configured to: acquire frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the At least one of the real-time frequency of the target chip and the average frequency of the target chip in a preset time interval.
  • the allocating unit is specifically configured to: determine that the target chip is paired with the target chip according to the availability information, the frequency information, and the allocated chips in the cluster. The influence of the computing performance of the cluster; when it is determined that the influence is that the computing performance of the cluster does not decrease, the target chip is allocated.
  • the state of the chip for example, availability information
  • the training task (or in other words) is allocated according to the state of the target chip.
  • Training data which can improve the performance of GPU/NPU cluster training.
  • the acquiring unit is further configured to: acquire frequency availability information of the target chip; and the allocating unit is specifically configured to: according to the availability information and the The frequency availability information is allocated to the target chip.
  • the working status of the chip can be determined more accurately. Therefore, training tasks (or in other words, the training tasks) are allocated according to the availability information and the frequency availability information. , Training data), which can improve the performance of GPU/NPU cluster training.
  • a device for determining the state of a chip includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is used to execute the method in any one of the foregoing first aspect.
  • the processor in the fifth aspect mentioned above can be either a central processing unit (CPU), or a combination of a CPU and a neural network processing unit.
  • the neural network processing unit here can include a graphics processing unit (graphics processing unit). unit, GPU), neural-network processing unit (NPU), tensor processing unit (TPU), and so on.
  • TPU is an artificial intelligence accelerator application specific integrated circuit fully customized by Google for machine learning.
  • an apparatus for scheduling cluster resources includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the The processor is used to execute the method in any one of the foregoing second aspect.
  • the processor in the sixth aspect mentioned above can be either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor.
  • the neural network computing processor here can include a graphics processing unit (graphics processing unit). unit, GPU), neural-network processing unit (NPU), tensor processing unit (TPU), and so on.
  • graphics processing unit graphics processing unit
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • TPU is an artificial intelligence accelerator application specific integrated circuit fully customized by Google for machine learning.
  • a computer-readable medium stores program code for device execution, and the program code includes any one of the implementation manners in the first aspect or the above-mentioned second aspect. The method in any implementation.
  • a computer program product containing instructions is provided.
  • the computer program product runs on a computer, the computer executes any one of the above-mentioned implementations of the first aspect or any one of the above-mentioned second aspects. The method in the way.
  • a chip in a ninth aspect, includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface, and executes any one of the implementation methods in the first aspect or the foregoing The method in any one of the implementations of the second aspect.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is configured to execute any one of the implementation manners in the first aspect or the method in any one of the foregoing second aspects.
  • the aforementioned chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • the foregoing chip may be the device for determining the state of the chip in any implementation manner of the foregoing third aspect.
  • an electronic device in a tenth aspect, includes the chip in the ninth aspect.
  • the above-mentioned electronic device may be a terminal device or a server.
  • the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can accurately reflect the working status of the target chip. At this time, according to the availability of the target chip Information distribution training tasks (or training data) can improve the performance of GPU/NPU cluster training.
  • FIG. 1 is a schematic structural diagram of an application scenario provided by an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a method for determining a chip state provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for scheduling cluster resources provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for scheduling cluster resources provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a device for determining the state of a chip provided by an embodiment of the present application
  • Fig. 6 is a schematic structural diagram of an apparatus for scheduling cluster resources provided by an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of an application scenario provided by an embodiment of the present application.
  • the method and device for example, a chip or a server, etc.
  • the embodiment of the present application can be applied to the application scenario shown in FIG.
  • the application scenario shown in Fig. 1 is only an example and not a limitation.
  • the application scenario in Fig. 1 may also include more or fewer nodes (for example, a cluster dispatch server master control node or a cluster dispatch proxy node), a server , Chip or device driver, which is not limited in the embodiment of the present application.
  • the application scenario shown in Figure 1 is an example of a chip cluster.
  • the chip cluster may include a cluster scheduling server master control node (cluster scheduler management server, CSM) and p servers, where server i (that is, among the p servers)
  • a cluster scheduling agent cluster scheduler agent, CSA
  • device driver device driver
  • m chips can be deployed on the i-th server
  • i, m, and p are all positive integers.
  • CSM can be responsible for receiving and issuing training tasks, and overall management and scheduling of chip resources in the chip cluster.
  • CSM can be deployed on a separate server in the chip cluster; CSM can communicate with CSA, and CSM can pass The CSA is controlled to implement the scheduling of chip resources and the issuance of training tasks.
  • the CSA can be deployed on a server (for example, server 1-server p shown in Figure 1); CSA can communicate with the chip on the server (for example, The chip 1-chip m) shown in FIG. 1 communicate, so that information such as the working status and frequency of the chip can be obtained.
  • chip cluster in FIG. 1 may include one server or multiple servers (p is greater than or equal to 2), which is not limited in the embodiment of the present application.
  • the device driver may be as shown in FIG. 1, one device driver is deployed on a server, and all chips on the server communicate with CSA through the device driver; or, multiple device drivers are deployed on one server, for example, There are m device drivers deployed on the server, and the m chips on the server are respectively connected to the m device drivers (that is, the m chips correspond to the m device drivers one-to-one). Each chip can communicate with the CSA through a device driver connected to the chip, which is not limited in the embodiment of the present application.
  • the chip cluster may include an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the ASIC chip in the chip cluster shown in FIG. 1 may be a graphics processing unit (GPU). Or neural network processing unit (NPU).
  • the chip cluster in the embodiment of the present application may also include other chips that can be used to accelerate neural network operations, which is not limited in the present application.
  • the chip clusters are collectively referred to as GPU/NPU clusters, and the chips in the chip clusters are collectively referred to as GPU/NPU chips.
  • GPU/NPU clusters the chips in the chip clusters are collectively referred to as GPU/NPU chips.
  • the training data of the neural network model can be divided, and multiple GPU/NPU chips in the GPU/NPU cluster can be used to train the neural network model at the same time.
  • the model, utilization, bandwidth, memory utilization, power, temperature and other dimensions of each chip in the GPU/NPU cluster (that is, each GPU/NPU chip) need to be obtained separately, and based on presets
  • the execution strategy of the GPU/NPU cluster allocates training tasks (or training data) to each chip in the GPU/NPU cluster.
  • the embodiment of the present application proposes a method for determining the state of a chip and a method for scheduling cluster resources. Based on the state of multiple dimensions of the chip, the integrated state of the chip (that is, the availability information of the chip) is determined, and the chip’s Comprehensive state allocation of training tasks (or training data) can improve the performance of GPU/NPU cluster training.
  • FIG. 2 shows a schematic flowchart of a method 200 for determining a chip state provided by an embodiment of the present application.
  • the method may include step 210, step 220, and step 230.
  • the method can be executed by any chip or any device driver in the chip cluster shown in FIG. 1, or the method can also be executed by other chips, devices, or software (for example, the driver interface of the chip), This is not limited in the embodiments of the present application.
  • S210 Acquire status information of the target chip in the cluster.
  • the status information may include the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, the processes running in the target chip, the One or more of the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
  • the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions.
  • the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
  • acquiring the status information of the target chips in the cluster may refer to acquiring the status information of the target chips in the cluster (one or more of the above) at the same time, or in other words, acquiring the target chips in the cluster at approximately the same time. (One or more of the above) status information.
  • the state information (one or more of the above) of the target chip in the cluster is acquired within a certain time interval.
  • the acquired state information of the target chip can have better consistency, that is, all states in the state information are acquired at the same time, or acquired at approximately the same time, so that according to the The status information of the target chip can more accurately reflect the working status of the target chip.
  • the state information of the target chip may be obtained through a state drive interface of the target chip.
  • the state drive interface may be a device driver (device driver) in any server shown in FIG. 1.
  • the state drive interface can be connected to multiple chips, or the state drive interface can also be connected to one chip.
  • the state drive interface can also be connected to one chip.
  • the target chip may be an ASIC, for example, the ASIC chip may be a GPU or an NPU.
  • the target chip may also be another chip that can be used to accelerate neural network operations, which is not limited in the embodiment of the present application.
  • the target chip may also output the status information.
  • the target chip may output the status information to the CSA or CSM through a device driver.
  • the method 200 may further include step 212, which is specifically as follows.
  • the frequency information may include one or more of the rated frequency of the target chip, the real-time frequency of the target chip, and the average frequency of the target chip within a preset time interval.
  • the real-time frequency of the target chip may refer to the instantaneous operating frequency of the target chip at a certain moment, or the real-time frequency of the target chip may also refer to the target chip within a certain time interval (the The time interval may be shorter) the average value of the instantaneous operating frequency, or the definition of the real-time frequency of the target chip may also refer to the prior art, which is not limited in the embodiment of the present application.
  • the influence of the frequency range fluctuation after the energy-saving feature of the target chip is turned on may not be considered.
  • the frequency information of the target chip may be obtained through a frequency drive interface of the target chip.
  • the frequency drive interface may be a device driver in any server shown in FIG. 1, that is, the state drive interface and the frequency drive interface may be a drive interface (ie, a device driver). China does not limit this.
  • the frequency drive interface can be connected to multiple chips, or the frequency drive interface can also be connected to one chip.
  • the frequency drive interface can also be connected to one chip.
  • S220 Determine availability information of the target chip according to the status information.
  • the availability information may be used to indicate whether the target chip is available.
  • the availability information may include at least one of normal, general, important, and critical alarms.
  • the availability of the target chip can be more intuitively indicated through status information such as normal, general alarm, important alarm, and emergency alarm.
  • the status items included in the usability information are only examples and not limitation.
  • the usability information may also include other similar status items that can indicate the working status of the chip, which is not limited in the embodiment of the present application.
  • the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
  • the definition here is only an example and not a limitation.
  • the status items in the availability information may also be defined in other ways, or the availability information may also be defined in the embodiments of the present application.
  • the other status items included in it are defined.
  • the availability information of the target chip may also be determined according to the status information and the frequency information.
  • the availability information of the target chip can be determined more accurately based on the status information and the frequency information.
  • training tasks or training data are allocated according to the availability information of the target chip. ), which can improve the performance of GPU/NPU cluster training.
  • the status information of the target chip is normal, and the frequency information of the target chip is also normal (that is, the rated frequency of the target chip, the real-time frequency of the target chip and/or the target chip If the average frequency is within the preset range), it is determined that the availability information of the target chip is normal.
  • the target chip may output the availability information.
  • the target chip may output the availability information to the CSA or CSM through a device driver.
  • the target chip may also output frequency information of the target chip.
  • the target chip may output the frequency information to the CSA or CSM through a device driver.
  • outputting the frequency information of the target chip can enable the server (for example, CSA or CSM) to conveniently obtain the frequency status of the target chip.
  • the server for example, CSA or CSM
  • the target chip may also determine frequency availability information of the target chip according to the frequency information, and the frequency availability information is used to indicate whether the frequency of the target chip meets a preset condition.
  • the target chip may output the frequency availability information.
  • the target chip may output the frequency availability information to the CSA or CSM through a device driver.
  • outputting the frequency availability information enables the server (for example, CSA or CSM) to directly obtain the frequency availability of the target chip.
  • the server can directly allocate training tasks according to the state of the target chip. (In other words, training data), there is no need to determine the frequency availability (based on the frequency information of the target chip). Therefore, the burden on the server (based on the frequency information) can be reduced.
  • the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can more accurately determine the working status of the target chip.
  • the availability of information to allocate training tasks (or training data) can improve the performance of GPU/NPU cluster training.
  • FIG. 3 shows a schematic flowchart of a method 300 for scheduling cluster resources provided by an embodiment of the present application.
  • the method may include step 310 and step 320.
  • the method may be executed by the CSM or CSA in FIG. 1, or the method may also be executed by other chips, devices, or software (for example, the driver interface of the chip), which is not limited in the embodiment of the present application. .
  • the availability information may be used to indicate whether the target chip is available.
  • the availability information may include at least one of normal, general, important, and critical alarms.
  • the availability of the target chip can be determined more accurately according to the normal, general, important, and emergency status information included in the availability information.
  • the status items included in the availability information are only examples and not limiting, and the availability information may also include other similar status items that can indicate the working status of the chip, which is not limited in the embodiment of the present application.
  • the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
  • the definition here is only an example and not a limitation.
  • the status items in the availability information may also be defined in other ways, or the availability information may also be defined in the embodiments of the present application.
  • the other status items included in it are defined.
  • the frequency information of the target chip can be obtained.
  • the frequency information may include at least one of the rated frequency of the target chip, the real-time frequency of the target chip, and the average frequency of the target chip within a preset time interval.
  • the real-time frequency of the target chip may refer to the instantaneous operating frequency of the target chip at a certain moment, or the real-time frequency of the target chip may also refer to the target chip within a certain time interval (the The time interval may be shorter) the average value of the instantaneous operating frequency, or the definition of the real-time frequency of the target chip may also refer to the prior art, which is not limited in the embodiment of the present application.
  • S320 Allocate the target chip according to the availability information.
  • the foregoing allocating the target chip may refer to: determining whether the target chip is available according to the availability information, and when the target chip is available, allocating a training task to the target chip.
  • the allocating the target chip according to the availability information may include: allocating the target chip according to the availability information and the frequency information.
  • the target chip may be determined whether the frequency information of the target chip meets a preset condition, Subsequently, the target chip may be allocated according to whether the target chip is available and whether the frequency information meets a preset condition.
  • the preset condition may be: the frequency information of the target chip (the rated frequency of the target chip, the real-time frequency of the target chip and/or the average frequency of the target chip) is within a preset range Inside.
  • the preset condition may be: the frequency information of the target chip is greater than or equal to a preset frequency threshold.
  • the frequency availability information of the target chip can also be obtained.
  • the frequency availability information may be used to indicate whether the frequency information of the target chip meets a preset condition.
  • the frequency availability information of the target chip may be obtained from the target chip, or may be obtained from a device driver.
  • the target chip may also be allocated according to the availability information and the frequency availability information.
  • the working status of the chip can be determined more accurately. Therefore, training tasks (or in other words, the training tasks) are allocated according to the availability information and the frequency availability information. , Training data), which can improve the performance of GPU/NPU cluster training.
  • the allocating the target chip according to the availability information may further include: determining the operation of the target chip on the cluster according to the availability information, the frequency information, and the allocated chips in the cluster Performance impact; when it is determined that the impact is that the computing performance of the cluster does not decrease, the target chip is allocated.
  • the state of the chip for example, availability information
  • the training task (or in other words) is allocated according to the state of the target chip.
  • Training data which can improve the performance of GPU/NPU cluster training.
  • the following formula may be used to determine the influence of the target chip on the computing performance of the cluster:
  • Fc is the real-time frequency of the target chip
  • Fr is the rated frequency of the target chip
  • N is the number of chips currently allocated in the chip cluster
  • K is the adjustment coefficient
  • K can be based on actual training tasks And the purpose of training.
  • the computing performance of the chip cluster will not decrease, that is, the target chip can be allocated; otherwise, the target chip can be considered After the chips are added to the chip cluster, the computing performance of the chip cluster may decrease. At this time, the target chip may not be allocated.
  • the availability information of the target chip in the cluster is directly obtained, and the working status of the chip can be determined more accurately according to the availability information of the target chip.
  • the training task is allocated according to the availability information of the target chip. (In other words, training data), it can improve the performance of GPU/NPU cluster training.
  • the target chip (or the driver interface of the target chip) in the cluster may also send the status information of the target chip to the server (for example, CSA or CSM), so that the server
  • the server for example, CSA or CSM
  • the training task can be allocated according to the status information of the target chip, and the specific method is shown in Figure 4 below.
  • FIG. 4 shows a schematic flowchart of a method 400 for scheduling cluster resources provided by another embodiment of the present application.
  • the method may include step 410, step 420, step 430, and step 440.
  • the method may be executed by the CSM or CSA in FIG. 1, or the method may also be executed by other chips, devices, or software (for example, the driver interface of the chip), which is not limited in the embodiment of the present application. .
  • the status information may include the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, the processes running in the target chip, the One or more of the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
  • the status information may be sent by the target chip.
  • the target chip it is also possible to determine whether the target chip is available according to the status information of the target chip and the frequency information of the target chip.
  • the frequency information of the target chip may also be sent by the target chip.
  • the preset condition may be: the frequency information of the target chip is greater than or equal to a preset frequency threshold.
  • the frequency information may include at least one of the rated frequency of the target chip, the real-time frequency of the target chip, and the average frequency of the target chip within a preset time interval.
  • S430 Determine whether the target chip will affect the overall performance of the chip cluster.
  • the following formula may be used to determine the influence of the target chip on the computing performance of the cluster:
  • Fc is the real-time frequency of the target chip
  • Fr is the rated frequency of the target chip
  • N is the number of chips currently allocated in the chip cluster
  • K is the adjustment coefficient
  • K can be based on actual training tasks And the purpose of training.
  • the target chip may be allocated after the target chip is added to the chip cluster without reducing the computing performance of the chip cluster.
  • the target chip can be allocated; otherwise, the target chip can be considered After the chips are added to the chip cluster, the computing performance of the chip cluster may decrease. At this time, the target chip may not be allocated.
  • FIG. 5 is a schematic block diagram of an apparatus 500 for determining a chip state according to an embodiment of the present application.
  • the device 500 may be equivalent to the device driver or chip in FIG. 1, and may also be equivalent to other chips, devices, or software (for example, a driver interface of a chip), which is not limited in the embodiment of the present application.
  • the device 500 is only an example.
  • the device in the embodiment of the present application may also include other modules or units, or include modules with similar functions to the modules in FIG. 5, or not necessarily include all the modules in FIG. 5.
  • the acquiring unit 510 is configured to acquire status information of the target chip in the cluster, the status information including the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, and the video memory utilization of the target chip At least one of the process rate, the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip;
  • the determining unit 520 is configured to determine availability information of the target chip according to the status information, where the availability information is used to indicate whether the target chip is available;
  • the output unit 530 is configured to output the availability information.
  • the availability information includes at least one of normal, general, important, and critical alarms.
  • the obtaining unit 510 is further configured to: obtain frequency information of the target chip, the frequency information including the rated frequency of the target chip, the real-time frequency of the target chip, and the preset frequency of the target chip. At least one of the average frequencies in the time interval.
  • the output unit 530 is further configured to output frequency information of the target chip.
  • the determining unit 520 is specifically configured to determine the availability information of the target chip according to the status information and the frequency information.
  • the determining unit 520 is further configured to determine frequency availability information of the target chip according to the frequency information, where the frequency availability information is used to indicate whether the frequency of the target chip meets a preset condition;
  • the output unit 530 is further configured to output the frequency availability information.
  • the above-mentioned device 500 for determining the state of a chip may be either a device driver or a chip in a chip cluster.
  • FIG. 6 is a schematic structural diagram of an apparatus 600 for scheduling cluster resources according to an embodiment of the present application.
  • the device 600 may be equivalent to the server (for example, CSM or CSA) in FIG. 1, and may also be equivalent to other chips, devices, or software (for example, chip drive interfaces), which is not limited in the embodiment of the present application.
  • server for example, CSM or CSA
  • CSA chip drive interfaces
  • the device 600 is only an example.
  • the device in the embodiment of the present application may also include other modules or units, or include modules with similar functions to the modules in FIG. 6, or not necessarily include all the modules in FIG. 6.
  • the obtaining unit 610 is configured to obtain availability information of the target chip in the cluster, where the availability information is used to indicate whether the target chip is available;
  • the allocation unit 620 is configured to allocate the target chip according to the availability information.
  • the availability information includes at least one of normal, general, important, and critical alarms.
  • the acquiring unit 610 is further configured to: acquire frequency information of the target chip, where the frequency information includes the rated frequency of the target chip, and/or the real-time frequency of the target chip and the target chip At least one of the average frequencies in the preset time interval.
  • the allocating unit 620 is specifically configured to: determine the impact of the target chip on the computing performance of the cluster according to the availability information, the frequency information, and the allocated chips in the cluster; In the case that the computing performance does not decrease, the target chip is allocated.
  • the obtaining unit 610 is further configured to: obtain frequency availability information of the target chip;
  • the allocation unit 620 is specifically configured to allocate the target chip according to the availability information and the frequency availability information.
  • the above-mentioned device 600 for determining the chip status can be either a CSM or a CSM-deployed server, or a CSA or a CSA-deployed server.
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), and application-specific integrated circuits. (application specific integrated circuit, ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • Access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory Take memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions or computer programs.
  • the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • the following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one item (a) of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A chip state determining method and device, and a cluster resource scheduling method and device, relating to the cluster scheduling technology in the field of artificial intelligence. The chip state determining method comprises: acquiring state information of a target chip in a cluster (210); determining availability information of the target chip according to the state information (220), the availability information being used for indicating whether the target chip is available; and outputting the availability information (230). The present method can improve the performance of GPU/NPU cluster training.

Description

确定芯片状态的方法、调度集群资源的方法及其装置Method for determining chip state, method and device for scheduling cluster resources 技术领域Technical field
本申请涉及人工智能领域中的集群调度技术,并且更具体地,涉及确定芯片状态的方法、调度集群资源的方法及其装置。This application relates to cluster scheduling technology in the field of artificial intelligence, and more specifically, to a method for determining the state of a chip, a method for scheduling cluster resources, and an apparatus thereof.
背景技术Background technique
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
图形处理器(graphics processing unit,GPU)/神经网络处理单元(neural network processing unit,NPU)具备出色的数据级并行计算能力,使得GPU/NPU集群具备超强的大规模数据级并行计算能力,因此,可以使用GPU/NPU集群对机器学习中的模型进行训练,例如,GPU/NPU集群可以用于神经网络(neural network,NN)模型的训练。对神经网络模型进行训练时,可以对该神经网络模型的训练数据进行划分,使用GPU/NPU集群中的多个芯片(即多个GPU/NPU芯片)同时对该神经网络模型进行训练。Graphics processing unit (GPU)/neural network processing unit (NPU) has excellent data-level parallel computing capabilities, making GPU/NPU clusters have super large-scale data-level parallel computing capabilities, so , GPU/NPU clusters can be used to train models in machine learning. For example, GPU/NPU clusters can be used to train neural network (NN) models. When training the neural network model, the training data of the neural network model can be divided, and multiple chips in the GPU/NPU cluster (ie, multiple GPU/NPU chips) are used to train the neural network model at the same time.
目前,利用GPU/NPU集群进行训练时,需要分别获取GPU/NPU集群中的各个芯片(即各个GPU/NPU芯片)的型号、利用率、带宽、显存利用率、功率、温度等多个维度的状态,并基于预设的执行策略为GPU/NPU集群中的各个芯片分配训练任务(或训练数据)。但是,使用这种方式分配训练任务(或训练数据),会导致GPU/NPU集群训练的性能较低。At present, when using GPU/NPU clusters for training, it is necessary to obtain the model, utilization, bandwidth, memory utilization, power, temperature and other dimensions of each chip in the GPU/NPU cluster (that is, each GPU/NPU chip). State, and allocate training tasks (or training data) to each chip in the GPU/NPU cluster based on a preset execution strategy. However, using this method to allocate training tasks (or training data) will result in lower performance of GPU/NPU cluster training.
发明内容Summary of the invention
本申请提供一种确定芯片状态的方法、调度集群资源的方法及其装置,能够提升GPU/NPU集群训练的性能。This application provides a method for determining the state of a chip, a method for scheduling cluster resources and a device thereof, which can improve the performance of GPU/NPU cluster training.
第一方面,提供了一种确定芯片状态的方法,该方法包括:获取集群中的目标芯片的状态信息;根据所述状态信息确定所述目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;输出所述可用性信息。In a first aspect, a method for determining the state of a chip is provided. The method includes: acquiring state information of a target chip in a cluster; determining availability information of the target chip according to the state information, and the availability information is used to indicate all Whether the target chip is available; output the availability information.
在本申请实施例中,直接基于目标芯片的状态信息确定该目标芯片的可用性信息,因此,该目标芯片的可用性信息可以准确地反映该目标芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can accurately reflect the working status of the target chip. At this time, according to the availability of the target chip Information distribution training tasks (or training data) can improve the performance of GPU/NPU cluster training.
需要说明的是,上述获取集群中的目标芯片的状态信息可以是指:同时获取集群中的目标芯片的(上述一项或多项)状态信息,或者说,近似同时获取集群中的目标芯片的(上述一项或多项)状态信息。例如,在一定的时间间隔内获取集群中的目标芯片的(上述一项或多项)状态信息。It should be noted that acquiring the status information of the target chips in the cluster may refer to acquiring the status information of the target chips in the cluster (one or more of the above) at the same time, or in other words, acquiring the target chips in the cluster at approximately the same time. (One or more of the above) status information. For example, the state information (one or more of the above) of the target chip in the cluster is acquired within a certain time interval.
这样,可以使获取到的所述目标芯片的状态信息具有较好的一致性,即所述状态信息中的各项状态都是同一时刻获取的,或近似同一时刻获取的,从而,根据所述目标芯片的状态信息,可以更准确地反映出所述目标芯片的工作状态。In this way, the acquired state information of the target chip can have better consistency, that is, all states in the state information are acquired at the same time, or acquired at approximately the same time, so that according to the The status information of the target chip can more accurately reflect the working status of the target chip.
可选地,所述目标芯片可以为ASIC,例如,ASIC芯片可以为GPU或NPU。或者,所述目标芯片也可以为其他能够用于加速神经网络运算的芯片,本申请实施例中对此并不限定。Optionally, the target chip may be an ASIC, for example, the ASIC chip may be a GPU or an NPU. Alternatively, the target chip may also be another chip that can be used to accelerate neural network operations, which is not limited in the embodiment of the present application.
结合第一方面,在第一方面的某些实现方式中,所述状态信息包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的至少一项。With reference to the first aspect, in some implementations of the first aspect, the status information includes the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, and the memory utilization of the target chip. At least one of the rate, the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
在本申请实施例中,基于上述多个维度的状态,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions. At this time, the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
结合第一方面,在第一方面的某些实现方式中,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。With reference to the first aspect, in some implementation manners of the first aspect, the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
在本申请实施例中,通过正常、一般告警、重要告警及紧急告警等状态信息,可以更直观地指示目标芯片的可用性。In the embodiment of the present application, the availability of the target chip can be more intuitively indicated through status information such as normal, general alarm, important alarm, and emergency alarm.
例如,可以预先对所述可用性信息进行配置,可以将所述可用性信息中的状态项定义为:0表示正常,1表示一般告警,2表示重要告警,3表示紧急告警。For example, the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the target chip At least one of the real-time frequency of the chip and the average frequency of the target chip in the preset time interval.
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:输出所述目标芯片的频率信息。With reference to the first aspect, in some implementation manners of the first aspect, the method further includes: outputting frequency information of the target chip.
在本申请实施例中,输出所述目标芯片的频率信息,可以使得服务器(例如,集群调度服务器)方便地获得所述目标芯片的频率状态。In the embodiment of the present application, outputting the frequency information of the target chip can enable a server (for example, a cluster scheduling server) to conveniently obtain the frequency status of the target chip.
结合第一方面,在第一方面的某些实现方式中,所述根据所述状态信息确定所述目标芯片的可用性信息,包括:根据所述状态信息及所述频率信息确定所述目标芯片的可用性信息。With reference to the first aspect, in some implementations of the first aspect, the determining the availability information of the target chip according to the status information includes: determining the availability information of the target chip according to the status information and the frequency information Availability information.
在本申请实施例中,根据所述状态信息及所述频率信息,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the status information and the frequency information. At this time, training tasks (or training data) are allocated according to the availability information of the target chip. ), which can improve the performance of GPU/NPU cluster training.
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:根据所述频率信息确定所述目标芯片的频率可用性信息,所述频率可用性信息用于指示所述目标芯片的频率是否满足预设的条件;输出所述频率可用性信息。With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining frequency availability information of the target chip according to the frequency information, where the frequency availability information is used to indicate the frequency availability of the target chip Whether the frequency meets a preset condition; output the frequency availability information.
在本申请实施例中,输出所述频率可用性信息,能够使得服务器(例如,集群调度服务器)直接获得所述目标芯片的频率可用性,此时,该服务器可以直接根据该目标芯片的状态分配训练任务(或者说,训练数据),而不需要再去(根据该目标芯片的频率信息)确定所述频率可用性,因此,可以减轻服务器的负担。In the embodiment of the present application, outputting the frequency availability information enables a server (for example, a cluster scheduling server) to directly obtain the frequency availability of the target chip. At this time, the server can directly allocate training tasks according to the state of the target chip (Or training data), there is no need to go (according to the frequency information of the target chip) to determine the frequency availability, so the burden on the server can be reduced.
第二方面,提供了一种调度集群资源的方法,该方法包括:获取集群中的目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;根据所述可用性信息分配所述目标芯片。In a second aspect, a method for scheduling cluster resources is provided. The method includes: obtaining availability information of a target chip in a cluster, where the availability information is used to indicate whether the target chip is available; Target chip.
在本申请实施例中,直接获取集群中的目标芯片的可用性信息,根据该目标芯片的可用性信息可以更准确地确定出该芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip in the cluster is directly obtained, and the working status of the chip can be determined more accurately according to the availability information of the target chip. At this time, the training task is allocated according to the availability information of the target chip. (In other words, training data), it can improve the performance of GPU/NPU cluster training.
其中,所述目标芯片的可用性信息可以是根据下述各项中的至少一项确定的:Wherein, the availability information of the target chip may be determined according to at least one of the following items:
所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度。The model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, the processes running in the target chip, the power of the target chip, the target chip The voltage of the chip or the temperature of the target chip.
在本申请实施例中,基于上述多个维度的状态,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions. At this time, the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
结合第二方面,在第二方面的某些实现方式中,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。With reference to the second aspect, in some implementation manners of the second aspect, the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
在本申请实施例中,根据所述可用性信息中包括的正常、一般告警、重要告警及紧急告警等状态信息,可以更准确地确定目标芯片的可用性。In the embodiment of the present application, the availability of the target chip can be determined more accurately according to the normal, general, important, and emergency status information included in the availability information.
例如,可以预先对所述可用性信息进行配置,可以将所述可用性信息中的状态项定义为:0表示正常,1表示一般告警,2表示重要告警,3表示紧急告警。For example, the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
结合第二方面,在第二方面的某些实现方式中,所述方法还包括:获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。With reference to the second aspect, in some implementations of the second aspect, the method further includes: acquiring frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the target chip At least one of the real-time frequency of the chip and the average frequency of the target chip in the preset time interval.
结合第二方面,在第二方面的某些实现方式中,所述根据所述可用性信息分配所述目标芯片,包括:根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响;在确定所述影响为所述集群的运算性能不下降的情况下,分配所述目标芯片。With reference to the second aspect, in some implementations of the second aspect, the allocating the target chip according to the availability information includes: determining according to the availability information, the frequency information, and the allocated chips in the cluster The influence of the target chip on the computing performance of the cluster; if it is determined that the influence is that the computing performance of the cluster does not decrease, the target chip is allocated.
在本申请实施例中,根据所述状态信息及所述频率信息,可以更准确地确定出该芯片的状态(例如,可用性信息),此时,根据该目标芯片的状态分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, according to the state information and the frequency information, the state of the chip (for example, availability information) can be determined more accurately. At this time, the training task (or in other words) is allocated according to the state of the target chip. , Training data), which can improve the performance of GPU/NPU cluster training.
结合第二方面,在第二方面的某些实现方式中,所述方法还包括:获取所述目标芯片的频率可用性信息;所述根据所述可用性信息分配所述目标芯片,包括:根据所述可用性信息和所述频率可用性信息分配所述目标芯片。With reference to the second aspect, in some implementations of the second aspect, the method further includes: obtaining frequency availability information of the target chip; and the allocating the target chip according to the availability information includes: The availability information and the frequency availability information allocate the target chip.
在本申请实施例中,根据所述可用性信息和所述频率可用性信息,可以更准确地确定出该芯片的工作状态,因此,根据所述可用性信息和所述频率可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, according to the availability information and the frequency availability information, the working status of the chip can be determined more accurately. Therefore, training tasks (or in other words, the training tasks) are allocated according to the availability information and the frequency availability information. , Training data), which can improve the performance of GPU/NPU cluster training.
第三方面,提供了一种确定芯片状态的装置,包括:获取单元,用于获取集群中的目标芯片的状态信息;确定单元,用于根据所述状态信息确定所述目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;输出单元,用于输出所述可用性信息。In a third aspect, a device for determining the status of a chip is provided, including: an acquiring unit, configured to acquire status information of a target chip in a cluster; and a determining unit, configured to determine availability information of the target chip according to the status information, The availability information is used to indicate whether the target chip is available; an output unit is used to output the availability information.
在本申请实施例中,直接基于目标芯片的状态信息确定该目标芯片的可用性信息,因此,该目标芯片的可用性信息可以准确地反映该目标芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can accurately reflect the working status of the target chip. At this time, according to the availability of the target chip Information distribution training tasks (or training data) can improve the performance of GPU/NPU cluster training.
需要说明的是,上述获取集群中的目标芯片的状态信息可以是指:同时获取集群中的目标芯片的(上述一项或多项)状态信息,或者说,近似同时获取集群中的目标芯片的(上述一项或多项)状态信息。例如,在一定的时间间隔内获取集群中的目标芯片的(上述一项或多项)状态信息。It should be noted that acquiring the status information of the target chips in the cluster may refer to acquiring the status information of the target chips in the cluster (one or more of the above) at the same time, or in other words, acquiring the target chips in the cluster at approximately the same time. (One or more of the above) status information. For example, the state information (one or more of the above) of the target chip in the cluster is acquired within a certain time interval.
这样,可以使获取到的所述目标芯片的状态信息具有较好的一致性,即所述状态信息中的各项状态都是同一时刻获取的,或近似同一时刻获取的,从而,根据所述目标芯片的状态信息,可以更准确地反映出所述目标芯片的工作状态。In this way, the acquired state information of the target chip can have better consistency, that is, all states in the state information are acquired at the same time, or acquired at approximately the same time, so that according to the The status information of the target chip can more accurately reflect the working status of the target chip.
可选地,所述目标芯片可以为ASIC,例如,ASIC芯片可以为GPU或NPU。或者,所述目标芯片也可以为其他能够用于加速神经网络运算的芯片,本申请实施例中对此并不限定。Optionally, the target chip may be an ASIC, for example, the ASIC chip may be a GPU or an NPU. Alternatively, the target chip may also be another chip that can be used to accelerate neural network operations, which is not limited in the embodiment of the present application.
结合第三方面,在第三方面的某些实现方式中,所述状态信息包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的至少一项。With reference to the third aspect, in some implementations of the third aspect, the status information includes the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, and the memory utilization of the target chip. At least one of the rate, the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
在本申请实施例中,基于上述多个维度的状态,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions. At this time, the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
结合第三方面,在第三方面的某些实现方式中,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。With reference to the third aspect, in some implementation manners of the third aspect, the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
在本申请实施例中,通过正常、一般告警、重要告警及紧急告警等状态信息,可以更直观地指示目标芯片的可用性。In the embodiment of the present application, the availability of the target chip can be more intuitively indicated through status information such as normal, general alarm, important alarm, and emergency alarm.
例如,可以预先对所述可用性信息进行配置,可以将所述可用性信息中的状态项定义为:0表示正常,1表示一般告警,2表示重要告警,3表示紧急告警。For example, the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
结合第三方面,在第三方面的某些实现方式中,所述获取单元还用于:获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。With reference to the third aspect, in some implementations of the third aspect, the acquiring unit is further configured to: acquire frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the At least one of the real-time frequency of the target chip and the average frequency of the target chip in a preset time interval.
结合第三方面,在第三方面的某些实现方式中,所述输出单元还用于:输出所述目标芯片的频率信息。With reference to the third aspect, in some implementation manners of the third aspect, the output unit is further configured to output frequency information of the target chip.
在本申请实施例中,输出所述目标芯片的频率信息,可以使得服务器(例如,集群调度服务器)方便地获得所述目标芯片的频率状态。In the embodiment of the present application, outputting the frequency information of the target chip can enable a server (for example, a cluster scheduling server) to conveniently obtain the frequency status of the target chip.
结合第三方面,在第三方面的某些实现方式中,所述确定单元具体用于:根据所述状态信息及所述频率信息确定所述目标芯片的可用性信息。With reference to the third aspect, in some implementation manners of the third aspect, the determining unit is specifically configured to determine the availability information of the target chip according to the state information and the frequency information.
在本申请实施例中,根据所述状态信息及所述频率信息,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the status information and the frequency information. At this time, training tasks (or training data) are allocated according to the availability information of the target chip. ), which can improve the performance of GPU/NPU cluster training.
结合第三方面,在第三方面的某些实现方式中,所述确定单元还用于:根据所述频率信息确定所述目标芯片的频率可用性信息,所述频率可用性信息用于指示所述目标芯片的频率是否满足预设的条件;所述输出单元还用于:输出所述频率可用性信息。With reference to the third aspect, in some implementations of the third aspect, the determining unit is further configured to determine frequency availability information of the target chip according to the frequency information, and the frequency availability information is used to indicate the target Whether the frequency of the chip meets a preset condition; the output unit is further configured to output the frequency availability information.
在本申请实施例中,输出所述频率可用性信息,能够使得服务器(例如,集群调度服务器)直接获得所述目标芯片的频率可用性,此时,该服务器可以直接根据该目标芯片的状态分配训练任务(或者说,训练数据),而不需要再去(根据该目标芯片的频率信息)确定所述频率可用性,因此,可以减轻服务器(基于该频率信息)的负担。In the embodiment of the present application, outputting the frequency availability information enables a server (for example, a cluster scheduling server) to directly obtain the frequency availability of the target chip. At this time, the server can directly allocate training tasks according to the state of the target chip (In other words, training data), there is no need to determine the frequency availability (based on the frequency information of the target chip). Therefore, the burden on the server (based on the frequency information) can be reduced.
第四方面,提供了一种调度集群资源的装置,包括:获取单元,用于获取集群中的目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;分配单元,用于根据所述可用性信息分配所述目标芯片。In a fourth aspect, an apparatus for scheduling cluster resources is provided, including: an obtaining unit for obtaining availability information of a target chip in the cluster, where the availability information is used to indicate whether the target chip is available; and an allocation unit for Allocate the target chip according to the availability information.
在本申请实施例中,直接获取集群中的目标芯片的可用性信息,根据该目标芯片的可用性信息可以更准确地确定出该芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip in the cluster is directly obtained, and the working status of the chip can be determined more accurately according to the availability information of the target chip. At this time, the training task is allocated according to the availability information of the target chip. (In other words, training data), it can improve the performance of GPU/NPU cluster training.
其中,所述目标芯片的可用性信息可以是根据下述各项中的至少一项确定的:Wherein, the availability information of the target chip may be determined according to at least one of the following items:
所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度。The model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, the processes running in the target chip, the power of the target chip, the target chip The voltage of the chip or the temperature of the target chip.
在本申请实施例中,基于上述多个维度的状态,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions. At this time, the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
结合第四方面,在第四方面的某些实现方式中,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。With reference to the fourth aspect, in some implementation manners of the fourth aspect, the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
在本申请实施例中,根据所述可用性信息中包括的正常、一般告警、重要告警及紧急告警等状态信息,可以更准确地确定目标芯片的可用性。In the embodiment of the present application, the availability of the target chip can be determined more accurately according to the normal, general, important, and emergency status information included in the availability information.
例如,可以预先对所述可用性信息进行配置,可以将所述可用性信息中的状态项定义为:0表示正常,1表示一般告警,2表示重要告警,3表示紧急告警。For example, the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
结合第四方面,在第四方面的某些实现方式中,所述获取单元还用于:获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。With reference to the fourth aspect, in some implementations of the fourth aspect, the acquiring unit is further configured to: acquire frequency information of the target chip, where the frequency information includes the rated frequency of the target chip and/or the At least one of the real-time frequency of the target chip and the average frequency of the target chip in a preset time interval.
结合第四方面,在第四方面的某些实现方式中,所述分配单元具体用于:根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响;在确定所述影响为所述集群的运算性能不下降的情况下,分配所述目标芯片。With reference to the fourth aspect, in some implementations of the fourth aspect, the allocating unit is specifically configured to: determine that the target chip is paired with the target chip according to the availability information, the frequency information, and the allocated chips in the cluster. The influence of the computing performance of the cluster; when it is determined that the influence is that the computing performance of the cluster does not decrease, the target chip is allocated.
在本申请实施例中,根据所述状态信息及所述频率信息,可以更准确地确定出该芯片的状态(例如,可用性信息),此时,根据该目标芯片的状态分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, according to the state information and the frequency information, the state of the chip (for example, availability information) can be determined more accurately. At this time, the training task (or in other words) is allocated according to the state of the target chip. , Training data), which can improve the performance of GPU/NPU cluster training.
结合第四方面,在第四方面的某些实现方式中,所述获取单元还用于:获取所述目标 芯片的频率可用性信息;所述分配单元具体用于:根据所述可用性信息和所述频率可用性信息分配所述目标芯片。With reference to the fourth aspect, in some implementations of the fourth aspect, the acquiring unit is further configured to: acquire frequency availability information of the target chip; and the allocating unit is specifically configured to: according to the availability information and the The frequency availability information is allocated to the target chip.
在本申请实施例中,根据所述可用性信息和所述频率可用性信息,可以更准确地确定出该芯片的工作状态,因此,根据所述可用性信息和所述频率可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, according to the availability information and the frequency availability information, the working status of the chip can be determined more accurately. Therefore, training tasks (or in other words, the training tasks) are allocated according to the availability information and the frequency availability information. , Training data), which can improve the performance of GPU/NPU cluster training.
第五方面,提供了一种确定芯片状态的装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第一方面中的任意一种实现方式中的方法。In a fifth aspect, a device for determining the state of a chip is provided. The device includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the device The processor is used to execute the method in any one of the foregoing first aspect.
上述第五方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(***)为机器学习全定制的人工智能加速器专用集成电路。The processor in the fifth aspect mentioned above can be either a central processing unit (CPU), or a combination of a CPU and a neural network processing unit. The neural network processing unit here can include a graphics processing unit (graphics processing unit). unit, GPU), neural-network processing unit (NPU), tensor processing unit (TPU), and so on. Among them, TPU is an artificial intelligence accelerator application specific integrated circuit fully customized by Google for machine learning.
第六方面,提供了一种调度集群资源的装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述第二方面中的任意一种实现方式中的方法。In a sixth aspect, an apparatus for scheduling cluster resources is provided. The apparatus includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the The processor is used to execute the method in any one of the foregoing second aspect.
上述第六方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(***)为机器学习全定制的人工智能加速器专用集成电路。The processor in the sixth aspect mentioned above can be either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor. The neural network computing processor here can include a graphics processing unit (graphics processing unit). unit, GPU), neural-network processing unit (NPU), tensor processing unit (TPU), and so on. Among them, TPU is an artificial intelligence accelerator application specific integrated circuit fully customized by Google for machine learning.
第七方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面中的任意一种实现方式或上述第二方面中的任意一种实现方式中的方法。In a seventh aspect, a computer-readable medium is provided, and the computer-readable medium stores program code for device execution, and the program code includes any one of the implementation manners in the first aspect or the above-mentioned second aspect. The method in any implementation.
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面中的任意一种实现方式或上述第二方面中的任意一种实现方式中的方法。In an eighth aspect, a computer program product containing instructions is provided. When the computer program product runs on a computer, the computer executes any one of the above-mentioned implementations of the first aspect or any one of the above-mentioned second aspects. The method in the way.
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面中的任意一种实现方式或上述第二方面中的任意一种实现方式中的方法。In a ninth aspect, a chip is provided. The chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface, and executes any one of the implementation methods in the first aspect or the foregoing The method in any one of the implementations of the second aspect.
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中的任意一种实现方式或上述第二方面中的任意一种实现方式中的方法。Optionally, as an implementation manner, the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory. When the instructions are executed, the The processor is configured to execute any one of the implementation manners in the first aspect or the method in any one of the foregoing second aspects.
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。The aforementioned chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
可选地,上述芯片可以为上述第三方面中的任意一种实现方式中的确定芯片状态的装置。Optionally, the foregoing chip may be the device for determining the state of the chip in any implementation manner of the foregoing third aspect.
第十方面,提供了一种电子设备,该电子设备包括上述第九方面中的芯片。In a tenth aspect, an electronic device is provided, and the electronic device includes the chip in the ninth aspect.
上述电子设备可以是终端设备或服务器。The above-mentioned electronic device may be a terminal device or a server.
在本申请实施例中,直接基于目标芯片的状态信息确定该目标芯片的可用性信息,因此,该目标芯片的可用性信息可以准确地反映该目标芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can accurately reflect the working status of the target chip. At this time, according to the availability of the target chip Information distribution training tasks (or training data) can improve the performance of GPU/NPU cluster training.
附图说明Description of the drawings
图1为本发明实施例提供的一种应用场景的结构示意图;FIG. 1 is a schematic structural diagram of an application scenario provided by an embodiment of the present invention;
图2是本申请实施例提供的确定芯片状态的方法的示意性流程图;FIG. 2 is a schematic flowchart of a method for determining a chip state provided by an embodiment of the present application;
图3是本申请实施例提供的调度集群资源的方法的示意性流程图;FIG. 3 is a schematic flowchart of a method for scheduling cluster resources provided by an embodiment of the present application;
图4是本申请实施例提供的调度集群资源的方法的示意性流程图;FIG. 4 is a schematic flowchart of a method for scheduling cluster resources provided by an embodiment of the present application;
图5是本申请实施例提供的确定芯片状态的装置的结构示意图;FIG. 5 is a schematic structural diagram of a device for determining the state of a chip provided by an embodiment of the present application;
图6是本申请实施例提供的调度集群资源的装置的结构示意图。Fig. 6 is a schematic structural diagram of an apparatus for scheduling cluster resources provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请中的技术方案进行描述。The technical solution in this application will be described below in conjunction with the accompanying drawings.
图1为本发明实施例提供的一种应用场景的结构示意图,本申请实施例中的方法及装置(例如,芯片或服务器等)可以应用于图1所示的应用场景中。应理解,图1所示的应用场景仅为示例而非限定,图1中的应用场景还可以包括更多或更少的节点(例如,集群调度服务器主控节点或集群调度代理节点)、服务器、芯片或设备驱动,本申请实施例对此并不限定。FIG. 1 is a schematic structural diagram of an application scenario provided by an embodiment of the present application. The method and device (for example, a chip or a server, etc.) in the embodiment of the present application can be applied to the application scenario shown in FIG. It should be understood that the application scenario shown in Fig. 1 is only an example and not a limitation. The application scenario in Fig. 1 may also include more or fewer nodes (for example, a cluster dispatch server master control node or a cluster dispatch proxy node), a server , Chip or device driver, which is not limited in the embodiment of the present application.
图1所示的应用场景为一种芯片集群的示例,该芯片集群可以包含集群调度服务器主控节点(cluster scheduler management server,CSM)及p个服务器,其中,服务器i(即该p个服务器中的第i个服务器)上可以部署集群调度代理节点(cluster scheduler agent,CSA),设备驱动(device driver)及m个芯片,i、m及p均为正整数。The application scenario shown in Figure 1 is an example of a chip cluster. The chip cluster may include a cluster scheduling server master control node (cluster scheduler management server, CSM) and p servers, where server i (that is, among the p servers) A cluster scheduling agent (cluster scheduler agent, CSA), device driver (device driver), and m chips can be deployed on the i-th server), and i, m, and p are all positive integers.
其中,CSM可以负责训练任务的接收和下发,在整体上对芯片集群中的芯片资源进行管理和调度,CSM可以部署在芯片集群中单独的服务器上;CSM可以与CSA进行通信,CSM可以通过控制CSA来实现芯片资源的调度及训练任务的下发,CSA可以部署在服务器(例如,图1中所示的服务器1-服务器p)上;CSA可以通过设备驱动与服务器上的芯片(例如,图1中所示的芯片1-芯片m)进行通信,从而可以获取芯片的工作状态及频率等信息。Among them, CSM can be responsible for receiving and issuing training tasks, and overall management and scheduling of chip resources in the chip cluster. CSM can be deployed on a separate server in the chip cluster; CSM can communicate with CSA, and CSM can pass The CSA is controlled to implement the scheduling of chip resources and the issuance of training tasks. The CSA can be deployed on a server (for example, server 1-server p shown in Figure 1); CSA can communicate with the chip on the server (for example, The chip 1-chip m) shown in FIG. 1 communicate, so that information such as the working status and frequency of the chip can be obtained.
需要说明的是,图1中的芯片集群中可以包括一个服务器或多个服务器(p大于或等于2),本申请实施例对此并不限定。It should be noted that the chip cluster in FIG. 1 may include one server or multiple servers (p is greater than or equal to 2), which is not limited in the embodiment of the present application.
所述设备驱动可以如图1中所示,一个服务器上部署一个设备驱动,该服务器上的所有芯片均通过该设备驱动与CSA进行通信;或者,在一个服务器上部署多个设备驱动,例如,该服务器上部署有m个设备驱动,该服务器上的m个芯片分别与该m个设备驱动连接(也就是说,该m个芯片与m个设备驱动一一对应),该m个芯片中的每个芯片可以通过与该芯片连接的设备驱动与CSA进行通信,本申请实施例对此并不限定。The device driver may be as shown in FIG. 1, one device driver is deployed on a server, and all chips on the server communicate with CSA through the device driver; or, multiple device drivers are deployed on one server, for example, There are m device drivers deployed on the server, and the m chips on the server are respectively connected to the m device drivers (that is, the m chips correspond to the m device drivers one-to-one). Each chip can communicate with the CSA through a device driver connected to the chip, which is not limited in the embodiment of the present application.
在本申请实施例中,芯片集群可以包括专用集成电路芯片(application specific integrated circuit,ASIC),例如,图1中所示的芯片集群中的ASIC芯片可以为图形处理 器(graphics processing unit,GPU)或神经网络处理单元(neural network processing unit,NPU)。或者,本申请实施例中的芯片集群中还可以包括其他能够用于加速神经网络运算的芯片,本申请对此并不限定。In the embodiment of the present application, the chip cluster may include an application specific integrated circuit (ASIC). For example, the ASIC chip in the chip cluster shown in FIG. 1 may be a graphics processing unit (GPU). Or neural network processing unit (NPU). Alternatively, the chip cluster in the embodiment of the present application may also include other chips that can be used to accelerate neural network operations, which is not limited in the present application.
为便于描述,在下面的实施例中将芯片集群统一称为GPU/NPU集群,将芯片集群中的芯片统一称为GPU/NPU芯片。本领域技术人员可以理解,这些描述仅为示例性描述,而不是对其进行限定。For ease of description, in the following embodiments, the chip clusters are collectively referred to as GPU/NPU clusters, and the chips in the chip clusters are collectively referred to as GPU/NPU chips. Those skilled in the art can understand that these descriptions are only exemplary descriptions, rather than limiting them.
目前,对神经网络模型进行训练时,可以对该神经网络模型的训练数据进行划分,使用GPU/NPU集群中的多个GPU/NPU芯片同时对该神经网络模型进行训练。在训练过程中,需要分别获取GPU/NPU集群中的各个芯片(即各个GPU/NPU芯片)的型号、利用率、带宽、显存利用率、功率、温度等多个维度的状态,并基于预设的执行策略为GPU/NPU集群中的各个芯片分配训练任务(或训练数据)。At present, when training a neural network model, the training data of the neural network model can be divided, and multiple GPU/NPU chips in the GPU/NPU cluster can be used to train the neural network model at the same time. During the training process, the model, utilization, bandwidth, memory utilization, power, temperature and other dimensions of each chip in the GPU/NPU cluster (that is, each GPU/NPU chip) need to be obtained separately, and based on presets The execution strategy of the GPU/NPU cluster allocates training tasks (or training data) to each chip in the GPU/NPU cluster.
但是,根据上述获取的多个状态中的任意一个,都无法准确地确定出该芯片的真实的工作状态,此时,再使用这种方式为GPU/NPU集群中的各个芯片分配训练任务(或者说,训练数据),会使得GPU/NPU集群训练的性能较低。However, according to any one of the multiple states obtained above, it is impossible to accurately determine the true working state of the chip. At this time, use this method to allocate training tasks for each chip in the GPU/NPU cluster (or Say, training data), will make the performance of GPU/NPU cluster training lower.
本申请实施例提出了一种确定芯片状态的方法、调度集群资源的方法,基于芯片的多个维度的状态,确定出该芯片的综合状态(即该芯片的可用性信息),并根据该芯片的综合状态分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。The embodiment of the present application proposes a method for determining the state of a chip and a method for scheduling cluster resources. Based on the state of multiple dimensions of the chip, the integrated state of the chip (that is, the availability information of the chip) is determined, and the chip’s Comprehensive state allocation of training tasks (or training data) can improve the performance of GPU/NPU cluster training.
需要说明的是,本申请实施例中的方法,不仅可以应用于GPU/NPU集群,也可以应用于其他类型的芯片集群,或者,还可以应用于其他的集群场景,本申请实施例对此并不限定。It should be noted that the method in the embodiments of this application can be applied not only to GPU/NPU clusters, but also to other types of chip clusters, or to other cluster scenarios. The embodiments of this application do not apply to this. Not limited.
图2示出了本申请实施例提供的确定芯片状态的方法200的示意性流程图,该方法可以包括步骤210、步骤220及步骤230。FIG. 2 shows a schematic flowchart of a method 200 for determining a chip state provided by an embodiment of the present application. The method may include step 210, step 220, and step 230.
可选地,该方法可以由图1所示的芯片集群中的任意一个芯片或任意一个设备驱动执行,或者,该方法也可以由其他芯片、装置或软件(例如,芯片的驱动接口)执行,本申请实施例中对此并不限定。Optionally, the method can be executed by any chip or any device driver in the chip cluster shown in FIG. 1, or the method can also be executed by other chips, devices, or software (for example, the driver interface of the chip), This is not limited in the embodiments of the present application.
S210,获取集群中的目标芯片的状态信息。S210: Acquire status information of the target chip in the cluster.
其中,所述状态信息可以包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的一项或多项。Wherein, the status information may include the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, the processes running in the target chip, the One or more of the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
在本申请实施例中,基于上述多个维度的状态,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the state of the above multiple dimensions. At this time, the training task (or training data) can be allocated according to the availability information of the target chip. Improve the performance of GPU/NPU cluster training.
需要说明的是,上述获取集群中的目标芯片的状态信息可以是指:同时获取集群中的目标芯片的(上述一项或多项)状态信息,或者说,近似同时获取集群中的目标芯片的(上述一项或多项)状态信息。例如,在一定的时间间隔内获取集群中的目标芯片的(上述一项或多项)状态信息。It should be noted that acquiring the status information of the target chips in the cluster may refer to acquiring the status information of the target chips in the cluster (one or more of the above) at the same time, or in other words, acquiring the target chips in the cluster at approximately the same time. (One or more of the above) status information. For example, the state information (one or more of the above) of the target chip in the cluster is acquired within a certain time interval.
这样,可以使获取到的所述目标芯片的状态信息具有较好的一致性,即所述状态信息中的各项状态都是同一时刻获取的,或近似同一时刻获取的,从而,根据所述目标芯片的状态信息,可以更准确地反映出所述目标芯片的工作状态。In this way, the acquired state information of the target chip can have better consistency, that is, all states in the state information are acquired at the same time, or acquired at approximately the same time, so that according to the The status information of the target chip can more accurately reflect the working status of the target chip.
可选地,所述目标芯片的状态信息可以是通过所述目标芯片的状态驱动接口获取的。所述状态驱动接口可以为图1所示的任一服务器中的设备驱动(device driver)。Optionally, the state information of the target chip may be obtained through a state drive interface of the target chip. The state drive interface may be a device driver (device driver) in any server shown in FIG. 1.
可选地,所述状态驱动接口可以与多个芯片连接,或者,所述状态驱动接口也可以与一个芯片连接,具体可以参照上述图1中的描述,这里不再赘述。Optionally, the state drive interface can be connected to multiple chips, or the state drive interface can also be connected to one chip. For details, please refer to the description in FIG. 1 above, which will not be repeated here.
可选地,所述目标芯片可以为ASIC,例如,ASIC芯片可以为GPU或NPU。或者,所述目标芯片也可以为其他能够用于加速神经网络运算的芯片,本申请实施例中对此并不限定。Optionally, the target chip may be an ASIC, for example, the ASIC chip may be a GPU or an NPU. Alternatively, the target chip may also be another chip that can be used to accelerate neural network operations, which is not limited in the embodiment of the present application.
可选地,所述目标芯片也可以输出所述状态信息。例如,所述目标芯片可以通过设备驱动,向CSA或CSM输出所述状态信息。Optionally, the target chip may also output the status information. For example, the target chip may output the status information to the CSA or CSM through a device driver.
在本申请实施例中,所述方法200还可以包括步骤212,具体如下。In the embodiment of the present application, the method 200 may further include step 212, which is specifically as follows.
S212,获取所述目标芯片的频率信息。S212: Acquire frequency information of the target chip.
其中,所述频率信息可以包括所述目标芯片的额定频率、所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的一种或多种。The frequency information may include one or more of the rated frequency of the target chip, the real-time frequency of the target chip, and the average frequency of the target chip within a preset time interval.
可选地,所述目标芯片的实时频率可以指所述目标芯片在某一时刻的瞬时工作频率,或者,所述目标芯片的实时频率也可以指所述目标芯片在一定的时间间隔内(该时间间隔可以较短)的瞬时工作频率的平均值,或者,所述目标芯片的实时频率的定义也可以参考现有技术,本申请实施例中对此并不限定。Optionally, the real-time frequency of the target chip may refer to the instantaneous operating frequency of the target chip at a certain moment, or the real-time frequency of the target chip may also refer to the target chip within a certain time interval (the The time interval may be shorter) the average value of the instantaneous operating frequency, or the definition of the real-time frequency of the target chip may also refer to the prior art, which is not limited in the embodiment of the present application.
可选地,获取所述目标芯片的频率信息时,可以不考虑所述目标芯片开启节能特性后频率范围波动的影响。Optionally, when acquiring the frequency information of the target chip, the influence of the frequency range fluctuation after the energy-saving feature of the target chip is turned on may not be considered.
可选地,所述目标芯片的频率信息可以是通过所述目标芯片的频率驱动接口获取的。所述频率驱动接口可以为图1所示的任一服务器中的设备驱动,也就是说,所述状态驱动接口与所述频率驱动接口可以为一个驱动接口(即设备驱动),本申请实施例中对此并不限定。Optionally, the frequency information of the target chip may be obtained through a frequency drive interface of the target chip. The frequency drive interface may be a device driver in any server shown in FIG. 1, that is, the state drive interface and the frequency drive interface may be a drive interface (ie, a device driver). China does not limit this.
可选地,所述频率驱动接口可以与多个芯片连接,或者,所述频率驱动接口也可以与一个芯片连接,具体可以参照上述图1中的描述,这里不再赘述。Optionally, the frequency drive interface can be connected to multiple chips, or the frequency drive interface can also be connected to one chip. For details, please refer to the description in FIG. 1 above, which will not be repeated here.
S220,根据所述状态信息确定所述目标芯片的可用性信息。S220: Determine availability information of the target chip according to the status information.
其中,所述可用性信息可以用于指示所述目标芯片是否可用。Wherein, the availability information may be used to indicate whether the target chip is available.
可选地,所述可用性信息可以包括正常、一般告警、重要告警及紧急告警中的至少一项。Optionally, the availability information may include at least one of normal, general, important, and critical alarms.
在本申请实施例中,通过正常、一般告警、重要告警及紧急告警等状态信息,可以更直观地指示目标芯片的可用性。In the embodiment of the present application, the availability of the target chip can be more intuitively indicated through status information such as normal, general alarm, important alarm, and emergency alarm.
所述可用性信息中包括的这些状态项仅为示例而非限定,所述可用性信息还可以包括其他类似的能够指示芯片工作状态的状态项,本申请实施例对此并不限定。The status items included in the usability information are only examples and not limitation. The usability information may also include other similar status items that can indicate the working status of the chip, which is not limited in the embodiment of the present application.
例如,可以预先对所述可用性信息进行配置,可以将所述可用性信息中的状态项定义为:0表示正常,1表示一般告警,2表示重要告警,3表示紧急告警。For example, the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
应理解,这里的定义仅为示例而非限定,本申请实施例中也可以通过其他的方式对所述可用性信息中的状态项进行定义,或者,本申请实施例中也可以对所述可用性信息中包括的其他的状态项进行定义。It should be understood that the definition here is only an example and not a limitation. In the embodiments of the present application, the status items in the availability information may also be defined in other ways, or the availability information may also be defined in the embodiments of the present application. The other status items included in it are defined.
可选地,在S220中,也可以根据所述状态信息及所述频率信息确定所述目标芯片的 可用性信息。Optionally, in S220, the availability information of the target chip may also be determined according to the status information and the frequency information.
在本申请实施例中,根据所述状态信息及所述频率信息,可以更准确地确定出该目标芯片的可用性信息,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip can be determined more accurately based on the status information and the frequency information. At this time, training tasks (or training data) are allocated according to the availability information of the target chip. ), which can improve the performance of GPU/NPU cluster training.
例如,可以在所述目标芯片的所述状态信息均正常,且所述目标芯片的频率信息也正常(即所述目标芯片的额定频率、所述目标芯片的实时频率和/或所述目标芯片的平均频率是否在预设的范围内)的情况下,确定所述目标芯片的可用性信息为正常。For example, it may be that the status information of the target chip is normal, and the frequency information of the target chip is also normal (that is, the rated frequency of the target chip, the real-time frequency of the target chip and/or the target chip If the average frequency is within the preset range), it is determined that the availability information of the target chip is normal.
本领域技术人员可以理解,上述实施例仅为示例而非限定,具体确定可用性信息的方法可以与实际的训练任务和训练目的等有关,本申请实施例中对此并不限定。Those skilled in the art can understand that the above-mentioned embodiments are only examples and not limitations. The specific method for determining availability information may be related to actual training tasks and training purposes, which are not limited in the embodiments of the present application.
S230,输出所述可用性信息。S230: Output the availability information.
可选地,所述目标芯片可以输出所述可用性信息。例如,所述目标芯片可以通过设备驱动,向CSA或CSM输出所述可用性信息。Optionally, the target chip may output the availability information. For example, the target chip may output the availability information to the CSA or CSM through a device driver.
可选地,所述目标芯片还可以输出所述目标芯片的频率信息。例如,所述目标芯片可以通过设备驱动,向CSA或CSM输出所述频率信息。Optionally, the target chip may also output frequency information of the target chip. For example, the target chip may output the frequency information to the CSA or CSM through a device driver.
在本申请实施例中,输出所述目标芯片的频率信息,可以使得服务器(例如,CSA或CSM)方便地获得所述目标芯片的频率状态。In the embodiment of the present application, outputting the frequency information of the target chip can enable the server (for example, CSA or CSM) to conveniently obtain the frequency status of the target chip.
在本申请实施例中,所述目标芯片还可以根据所述频率信息确定所述目标芯片的频率可用性信息,所述频率可用性信息用于指示所述目标芯片的频率是否满足预设的条件。In the embodiment of the present application, the target chip may also determine frequency availability information of the target chip according to the frequency information, and the frequency availability information is used to indicate whether the frequency of the target chip meets a preset condition.
此时,所述目标芯片可以输出所述频率可用性信息。例如,所述目标芯片可以通过设备驱动,向CSA或CSM输出所述频率可用性信息。At this time, the target chip may output the frequency availability information. For example, the target chip may output the frequency availability information to the CSA or CSM through a device driver.
在本申请实施例中,输出所述频率可用性信息,能够使得服务器(例如,CSA或CSM)直接获得所述目标芯片的频率可用性,此时,该服务器可以直接根据该目标芯片的状态分配训练任务(或者说,训练数据),而不需要再去(根据该目标芯片的频率信息)确定所述频率可用性,因此,可以减轻服务器(基于该频率信息)的负担。In the embodiment of the present application, outputting the frequency availability information enables the server (for example, CSA or CSM) to directly obtain the frequency availability of the target chip. At this time, the server can directly allocate training tasks according to the state of the target chip. (In other words, training data), there is no need to determine the frequency availability (based on the frequency information of the target chip). Therefore, the burden on the server (based on the frequency information) can be reduced.
在本申请实施例中,直接基于目标芯片的状态信息确定该目标芯片的可用性信息,因此,该目标芯片的可用性信息可以更准确地确定出该目标芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip is determined directly based on the status information of the target chip. Therefore, the availability information of the target chip can more accurately determine the working status of the target chip. The availability of information to allocate training tasks (or training data) can improve the performance of GPU/NPU cluster training.
图3示出了本申请实施例提供的调度集群资源的方法300的示意性流程图,该方法可以包括步骤310及步骤320。FIG. 3 shows a schematic flowchart of a method 300 for scheduling cluster resources provided by an embodiment of the present application. The method may include step 310 and step 320.
可选地,该方法可以由图1中的CSM或CSA执行,或者,该方法还可以由其他芯片、装置或软件(例如,芯片的驱动接口)执行,本申请实施例中对此并不限定。Optionally, the method may be executed by the CSM or CSA in FIG. 1, or the method may also be executed by other chips, devices, or software (for example, the driver interface of the chip), which is not limited in the embodiment of the present application. .
S310,获取集群中的目标芯片的可用性信息。S310: Obtain availability information of the target chip in the cluster.
其中,所述可用性信息可以用于指示所述目标芯片是否可用。Wherein, the availability information may be used to indicate whether the target chip is available.
可选地,所述可用性信息可以包括正常、一般告警、重要告警及紧急告警中的至少一项。Optionally, the availability information may include at least one of normal, general, important, and critical alarms.
在本申请实施例中,根据所述可用性信息中包括的正常、一般告警、重要告警及紧急告警等状态信息,可以更准确地确定目标芯片的可用性。In the embodiment of the present application, the availability of the target chip can be determined more accurately according to the normal, general, important, and emergency status information included in the availability information.
所述可用性信息中包括的这些状态项仅为示例而非限定,所述可用性信息还可以包括 其他类似的能够指示芯片工作状态的状态项,本申请实施例对此并不限定。The status items included in the availability information are only examples and not limiting, and the availability information may also include other similar status items that can indicate the working status of the chip, which is not limited in the embodiment of the present application.
例如,可以预先对所述可用性信息进行配置,可以将所述可用性信息中的状态项定义为:0表示正常,1表示一般告警,2表示重要告警,3表示紧急告警。For example, the availability information can be configured in advance, and the status item in the availability information can be defined as: 0 means normal, 1 means general alarm, 2 means important alarm, and 3 means urgent alarm.
应理解,这里的定义仅为示例而非限定,本申请实施例中也可以通过其他的方式对所述可用性信息中的状态项进行定义,或者,本申请实施例中也可以对所述可用性信息中包括的其他的状态项进行定义。It should be understood that the definition here is only an example and not a limitation. In the embodiments of the present application, the status items in the availability information may also be defined in other ways, or the availability information may also be defined in the embodiments of the present application. The other status items included in it are defined.
在本申请实施例中,可以获取所述目标芯片的频率信息。In the embodiment of the present application, the frequency information of the target chip can be obtained.
其中,所述频率信息可以包括所述目标芯片的额定频率、所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。The frequency information may include at least one of the rated frequency of the target chip, the real-time frequency of the target chip, and the average frequency of the target chip within a preset time interval.
可选地,所述目标芯片的实时频率可以指所述目标芯片在某一时刻的瞬时工作频率,或者,所述目标芯片的实时频率也可以指所述目标芯片在一定的时间间隔内(该时间间隔可以较短)的瞬时工作频率的平均值,或者,所述目标芯片的实时频率的定义也可以参考现有技术,本申请实施例中对此并不限定。Optionally, the real-time frequency of the target chip may refer to the instantaneous operating frequency of the target chip at a certain moment, or the real-time frequency of the target chip may also refer to the target chip within a certain time interval (the The time interval may be shorter) the average value of the instantaneous operating frequency, or the definition of the real-time frequency of the target chip may also refer to the prior art, which is not limited in the embodiment of the present application.
S320,根据所述可用性信息分配所述目标芯片。S320: Allocate the target chip according to the availability information.
可选地,上述分配所述目标芯片可以是指:根据所述可用性信息确定所述目标芯片是否可用,并在所述目标芯片可用的情况下,为所述目标芯片分配训练任务。Optionally, the foregoing allocating the target chip may refer to: determining whether the target chip is available according to the availability information, and when the target chip is available, allocating a training task to the target chip.
可选地,所述根据所述可用性信息分配所述目标芯片,可以包括:根据所述可用性信息及所述频率信息,分配所述目标芯片。Optionally, the allocating the target chip according to the availability information may include: allocating the target chip according to the availability information and the frequency information.
例如,可以如下述图4所示,先根据所述可用性信息确定所述目标芯片是否可用,在所述目标芯片可用的情况下,可以确定所述目标芯片的频率信息是否满足预设的条件,随后,可以根据所述目标芯片是否可用以及所述频率信息是否满足预设的条件,分配所述目标芯片。For example, as shown in FIG. 4 below, it is possible to first determine whether the target chip is available according to the availability information, and if the target chip is available, it may be determined whether the frequency information of the target chip meets a preset condition, Subsequently, the target chip may be allocated according to whether the target chip is available and whether the frequency information meets a preset condition.
其中,所述预设的条件可以为:所述目标芯片的频率信息(所述目标芯片的额定频率、所述目标芯片的实时频率和/或所述目标芯片的平均频率)在预设的范围内。例如,所述预设的条件可以为:所述目标芯片的频率信息大于或等于预设的频率阈值。Wherein, the preset condition may be: the frequency information of the target chip (the rated frequency of the target chip, the real-time frequency of the target chip and/or the average frequency of the target chip) is within a preset range Inside. For example, the preset condition may be: the frequency information of the target chip is greater than or equal to a preset frequency threshold.
在本申请实施例中,还可以获取所述目标芯片的频率可用性信息。其中,所述频率可用性信息可以用于指示所述目标芯片的频率信息是否满足预设的条件。In the embodiment of the present application, the frequency availability information of the target chip can also be obtained. Wherein, the frequency availability information may be used to indicate whether the frequency information of the target chip meets a preset condition.
可选地,所述目标芯片的频率可用性信息可以是从所述目标芯片获取的,也可以是从设备驱动获取的。Optionally, the frequency availability information of the target chip may be obtained from the target chip, or may be obtained from a device driver.
相应地,在S320中,也可以根据所述可用性信息和所述频率可用性信息分配所述目标芯片。Correspondingly, in S320, the target chip may also be allocated according to the availability information and the frequency availability information.
在本申请实施例中,根据所述可用性信息和所述频率可用性信息,可以更准确地确定出该芯片的工作状态,因此,根据所述可用性信息和所述频率可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, according to the availability information and the frequency availability information, the working status of the chip can be determined more accurately. Therefore, training tasks (or in other words, the training tasks) are allocated according to the availability information and the frequency availability information. , Training data), which can improve the performance of GPU/NPU cluster training.
可选地,所述根据所述可用性信息分配所述目标芯片,还可以包括:根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响;在确定所述影响为所述集群的运算性能不下降的情况下,分配所述目标芯片。Optionally, the allocating the target chip according to the availability information may further include: determining the operation of the target chip on the cluster according to the availability information, the frequency information, and the allocated chips in the cluster Performance impact; when it is determined that the impact is that the computing performance of the cluster does not decrease, the target chip is allocated.
在本申请实施例中,根据所述状态信息及所述频率信息,可以更准确地确定出该芯片的状态(例如,可用性信息),此时,根据该目标芯片的状态分配训练任务(或者说,训 练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, according to the state information and the frequency information, the state of the chip (for example, availability information) can be determined more accurately. At this time, the training task (or in other words) is allocated according to the state of the target chip. , Training data), which can improve the performance of GPU/NPU cluster training.
例如,可以采用以下公式确定所述目标芯片对所述集群的运算性能的影响:For example, the following formula may be used to determine the influence of the target chip on the computing performance of the cluster:
(Fr-Fc)*N*K<Fc(Fr-Fc)*N*K<Fc
其中,Fc为所述目标芯片的实时频率,Fr为所述目标芯片的额定频率,N为所述芯片集群中目前已分配的芯片的个数,K为调整系数,K可以根据实际的训练任务和训练目的等确定。Wherein, Fc is the real-time frequency of the target chip, Fr is the rated frequency of the target chip, N is the number of chips currently allocated in the chip cluster, K is the adjustment coefficient, and K can be based on actual training tasks And the purpose of training.
可选地,在上式成立的情况下,可以认为所述目标芯片加入芯片集群后,不会使所述芯片集群的运算性能下降,即可以分配所述目标芯片;否则,可以认为所述目标芯片加入芯片集群后,可能会使所述芯片集群的运算性能下降,此时,可以不分配所述目标芯片。Optionally, when the above formula is established, it can be considered that after the target chip is added to the chip cluster, the computing performance of the chip cluster will not decrease, that is, the target chip can be allocated; otherwise, the target chip can be considered After the chips are added to the chip cluster, the computing performance of the chip cluster may decrease. At this time, the target chip may not be allocated.
在本申请实施例中,直接获取集群中的目标芯片的可用性信息,根据该目标芯片的可用性信息可以更准确地确定出该芯片的工作状态,此时,根据该目标芯片的可用性信息分配训练任务(或者说,训练数据),能够提升GPU/NPU集群训练的性能。In the embodiment of the present application, the availability information of the target chip in the cluster is directly obtained, and the working status of the chip can be determined more accurately according to the availability information of the target chip. At this time, the training task is allocated according to the availability information of the target chip. (In other words, training data), it can improve the performance of GPU/NPU cluster training.
在本申请实施例的一种可能的实现方式中,集群中的目标芯片(或者目标芯片的驱动接口)也可以将该目标芯片的状态信息发送至服务器(例如,CSA或CSM),以使得服务器可以根据该目标芯片的状态信息分配训练任务,具体方法如下述图4所示。In a possible implementation of the embodiment of the present application, the target chip (or the driver interface of the target chip) in the cluster may also send the status information of the target chip to the server (for example, CSA or CSM), so that the server The training task can be allocated according to the status information of the target chip, and the specific method is shown in Figure 4 below.
图4示出了本申请另一种实施例提供的调度集群资源的方法400的示意性流程图,该方法可以包括步骤410、步骤420、步骤430及步骤440。FIG. 4 shows a schematic flowchart of a method 400 for scheduling cluster resources provided by another embodiment of the present application. The method may include step 410, step 420, step 430, and step 440.
可选地,该方法可以由图1中的CSM或CSA执行,或者,该方法还可以由其他芯片、装置或软件(例如,芯片的驱动接口)执行,本申请实施例中对此并不限定。Optionally, the method may be executed by the CSM or CSA in FIG. 1, or the method may also be executed by other chips, devices, or software (for example, the driver interface of the chip), which is not limited in the embodiment of the present application. .
S410,判断所述目标芯片是否可用。S410: Determine whether the target chip is available.
可选地,可以根据所述目标芯片的状态信息确定所述目标芯片是否可用。Optionally, it may be determined whether the target chip is available according to the status information of the target chip.
其中,所述状态信息可以包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的一项或多项。Wherein, the status information may include the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, the processes running in the target chip, the One or more of the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
所述状态信息可以是所述目标芯片发送的。The status information may be sent by the target chip.
可选地,也可以根据所述目标芯片的状态信息及所述目标芯片的频率信息,确定所述目标芯片是否可用。所述目标芯片的频率信息也可以是所述目标芯片发送的。Optionally, it is also possible to determine whether the target chip is available according to the status information of the target chip and the frequency information of the target chip. The frequency information of the target chip may also be sent by the target chip.
在所述目标芯片可用的情况下,可以继续执行下述S420。In the case where the target chip is available, the following S420 can be continued.
S420,判断所述目标芯片的频率信息是否满足预设的条件。S420: Determine whether the frequency information of the target chip meets a preset condition.
例如,所述预设的条件可以为:所述目标芯片的频率信息大于或等于预设的频率阈值。For example, the preset condition may be: the frequency information of the target chip is greater than or equal to a preset frequency threshold.
其中,所述频率信息可以包括所述目标芯片的额定频率、所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。The frequency information may include at least one of the rated frequency of the target chip, the real-time frequency of the target chip, and the average frequency of the target chip within a preset time interval.
S430,判断所述目标芯片是否会影响芯片集群的整体性能。S430: Determine whether the target chip will affect the overall performance of the chip cluster.
根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响。Determine the influence of the target chip on the computing performance of the cluster according to the availability information, the frequency information, and the allocated chips in the cluster.
例如,可以采用以下公式确定所述目标芯片对所述集群的运算性能的影响:For example, the following formula may be used to determine the influence of the target chip on the computing performance of the cluster:
(Fr-Fc)*N*K<Fc(Fr-Fc)*N*K<Fc
其中,Fc为所述目标芯片的实时频率,Fr为所述目标芯片的额定频率,N为所述芯片集群中目前已分配的芯片的个数,K为调整系数,K可以根据实际的训练任务和训练目 的等确定。Wherein, Fc is the real-time frequency of the target chip, Fr is the rated frequency of the target chip, N is the number of chips currently allocated in the chip cluster, K is the adjustment coefficient, and K can be based on actual training tasks And the purpose of training.
S440,判断是否选用所述目标芯片。S440: Determine whether to select the target chip.
可选地,可以在所述目标芯片加入芯片集群后,不会使所述芯片集群的运算性能下降的情况下,分配所述目标芯片。Optionally, the target chip may be allocated after the target chip is added to the chip cluster without reducing the computing performance of the chip cluster.
例如,在上式成立的情况下,可以认为所述目标芯片加入芯片集群后,不会使所述芯片集群的运算性能下降,此时,可以分配所述目标芯片;否则,可以认为所述目标芯片加入芯片集群后,可能会使所述芯片集群的运算性能下降,此时,可以不分配所述目标芯片。For example, when the above formula is established, it can be considered that after the target chip is added to the chip cluster, the computing performance of the chip cluster will not decrease. At this time, the target chip can be allocated; otherwise, the target chip can be considered After the chips are added to the chip cluster, the computing performance of the chip cluster may decrease. At this time, the target chip may not be allocated.
图5是本申请实施例的确定芯片状态的装置500的示意性框图。该装置500可以相当于图1中的设备驱动或芯片,还可以相当于其他芯片、装置或软件(例如,芯片的驱动接口),本申请实施例对此并不限定。FIG. 5 is a schematic block diagram of an apparatus 500 for determining a chip state according to an embodiment of the present application. The device 500 may be equivalent to the device driver or chip in FIG. 1, and may also be equivalent to other chips, devices, or software (for example, a driver interface of a chip), which is not limited in the embodiment of the present application.
应理解,装置500仅是一种示例。本申请实施例的装置还可包括其他模块或单元,或者包括与图5中的各个模块的功能相似的模块,或者并非要包括图5中的所有模块。It should be understood that the device 500 is only an example. The device in the embodiment of the present application may also include other modules or units, or include modules with similar functions to the modules in FIG. 5, or not necessarily include all the modules in FIG. 5.
获取单元510,用于获取集群中的目标芯片的状态信息,所述状态信息包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的至少一项;The acquiring unit 510 is configured to acquire status information of the target chip in the cluster, the status information including the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, and the video memory utilization of the target chip At least one of the process rate, the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip;
确定单元520,用于根据所述状态信息确定所述目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;The determining unit 520 is configured to determine availability information of the target chip according to the status information, where the availability information is used to indicate whether the target chip is available;
输出单元530,用于输出所述可用性信息。The output unit 530 is configured to output the availability information.
可选地,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。Optionally, the availability information includes at least one of normal, general, important, and critical alarms.
可选地,所述获取单元510还用于:获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。Optionally, the obtaining unit 510 is further configured to: obtain frequency information of the target chip, the frequency information including the rated frequency of the target chip, the real-time frequency of the target chip, and the preset frequency of the target chip. At least one of the average frequencies in the time interval.
可选地,所述输出单元530还用于:输出所述目标芯片的频率信息。Optionally, the output unit 530 is further configured to output frequency information of the target chip.
可选地,所述确定单元520具体用于:根据所述状态信息及所述频率信息确定所述目标芯片的可用性信息。Optionally, the determining unit 520 is specifically configured to determine the availability information of the target chip according to the status information and the frequency information.
可选地,所述确定单元520还用于:根据所述频率信息确定所述目标芯片的频率可用性信息,所述频率可用性信息用于指示所述目标芯片的频率是否满足预设的条件;Optionally, the determining unit 520 is further configured to determine frequency availability information of the target chip according to the frequency information, where the frequency availability information is used to indicate whether the frequency of the target chip meets a preset condition;
所述输出单元530还用于:输出所述频率可用性信息。The output unit 530 is further configured to output the frequency availability information.
需要说明的是,上述确定芯片状态的装置500既可以是设备驱动,也可以是芯片集群中的芯片。It should be noted that the above-mentioned device 500 for determining the state of a chip may be either a device driver or a chip in a chip cluster.
上述图5中各单元具体的执行动作可参考上述方法实施例,在此不再赘述。For the specific execution actions of the units in FIG. 5, reference may be made to the foregoing method embodiments, and details are not described herein again.
图6是本申请一个实施例的调度集群资源的装置600的示意性结构图。FIG. 6 is a schematic structural diagram of an apparatus 600 for scheduling cluster resources according to an embodiment of the present application.
其中,装置600可以相当于图1中的服务器(例如,CSM或CSA),还可以相当于其他芯片、装置或软件(例如,芯片的驱动接口),本申请实施例对此并不限定。The device 600 may be equivalent to the server (for example, CSM or CSA) in FIG. 1, and may also be equivalent to other chips, devices, or software (for example, chip drive interfaces), which is not limited in the embodiment of the present application.
应理解,装置600仅是一种示例。本申请实施例的装置还可包括其他模块或单元,或者包括与图6中的各个模块的功能相似的模块,或者并非要包括图6中的所有模块。It should be understood that the device 600 is only an example. The device in the embodiment of the present application may also include other modules or units, or include modules with similar functions to the modules in FIG. 6, or not necessarily include all the modules in FIG. 6.
获取单元610,用于获取集群中的目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;The obtaining unit 610 is configured to obtain availability information of the target chip in the cluster, where the availability information is used to indicate whether the target chip is available;
分配单元620,用于根据所述可用性信息分配所述目标芯片。The allocation unit 620 is configured to allocate the target chip according to the availability information.
可选地,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。Optionally, the availability information includes at least one of normal, general, important, and critical alarms.
可选地,所述获取单元610还用于:获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。Optionally, the acquiring unit 610 is further configured to: acquire frequency information of the target chip, where the frequency information includes the rated frequency of the target chip, and/or the real-time frequency of the target chip and the target chip At least one of the average frequencies in the preset time interval.
可选地,所述分配单元620具体用于:根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响;在所述集群的运算性能不下降的情况下,分配所述目标芯片。Optionally, the allocating unit 620 is specifically configured to: determine the impact of the target chip on the computing performance of the cluster according to the availability information, the frequency information, and the allocated chips in the cluster; In the case that the computing performance does not decrease, the target chip is allocated.
可选地,所述获取单元610还用于:获取所述目标芯片的频率可用性信息;Optionally, the obtaining unit 610 is further configured to: obtain frequency availability information of the target chip;
所述分配单元620具体用于:根据所述可用性信息和所述频率可用性信息分配所述目标芯片。The allocation unit 620 is specifically configured to allocate the target chip according to the availability information and the frequency availability information.
需要说明的是,上述确定芯片状态的装置600既可以是CSM或部署CSM的服务器,也可以是CSA或部署CSA的服务器。It should be noted that the above-mentioned device 600 for determining the chip status can be either a CSM or a CSM-deployed server, or a CSA or a CSA-deployed server.
上述图6中各单元具体的执行动作可参考上述方法实施例,在此不再赘述。For the specific execution actions of the units in FIG. 6 described above, reference may be made to the foregoing method embodiments, which will not be repeated here.
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), and application-specific integrated circuits. (application specific integrated circuit, ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。It should also be understood that the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not restrictive description, many forms of random access memory (RAM) are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), and synchronous dynamic random access memory (DRAM). Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Take memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心 进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented using software, the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more sets of available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium. The semiconductor medium may be a solid state drive.
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。It should be understood that the term "and/or" in this text is only an association relationship describing the associated objects, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone exists, and both A and B exist. , There are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this text generally indicates that the associated objects before and after are in an "or" relationship, but it may also indicate an "and/or" relationship, which can be understood with reference to the context.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。In this application, "at least one" refers to one or more, and "multiple" refers to two or more. "The following at least one item (a)" or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a). For example, at least one item (a) of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that in the various embodiments of the present application, the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application. The implementation process constitutes any limitation.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随 机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (28)

  1. 一种确定芯片状态的方法,其特征在于,包括:A method for determining the state of a chip, characterized in that it comprises:
    获取集群中的目标芯片的状态信息;Obtain the status information of the target chip in the cluster;
    根据所述状态信息确定所述目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;Determining availability information of the target chip according to the status information, where the availability information is used to indicate whether the target chip is available;
    输出所述可用性信息。Output the availability information.
  2. 根据权利要求1所述的方法,其特征在于,所述状态信息包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的至少一项。The method according to claim 1, wherein the status information includes the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, and At least one of the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
  3. 根据权利要求1或2所述的方法,其特征在于,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。The method according to claim 1 or 2, wherein the availability information includes at least one of normal, general warning, major warning, and emergency warning.
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 3, wherein the method further comprises:
    获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。Acquire frequency information of the target chip, where the frequency information includes at least one of a rated frequency of the target chip, a real-time frequency of the target chip, and an average frequency of the target chip within a preset time interval.
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    输出所述目标芯片的频率信息。Output the frequency information of the target chip.
  6. 根据权利要求4或5所述的方法,其特征在于,所述根据所述状态信息确定所述目标芯片的可用性信息,包括:The method according to claim 4 or 5, wherein the determining the availability information of the target chip according to the state information comprises:
    根据所述状态信息及所述频率信息确定所述目标芯片的可用性信息。The availability information of the target chip is determined according to the status information and the frequency information.
  7. 根据权利要求4或5所述的方法,其特征在于,所述方法还包括:The method according to claim 4 or 5, wherein the method further comprises:
    根据所述频率信息确定所述目标芯片的频率可用性信息,所述频率可用性信息用于指示所述目标芯片的频率是否满足预设的条件;Determining frequency availability information of the target chip according to the frequency information, where the frequency availability information is used to indicate whether the frequency of the target chip meets a preset condition;
    输出所述频率可用性信息。Output the frequency availability information.
  8. 一种调度集群资源的方法,其特征在于,包括:A method for scheduling cluster resources, characterized in that it includes:
    获取集群中的目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;Acquiring availability information of the target chip in the cluster, where the availability information is used to indicate whether the target chip is available;
    根据所述可用性信息分配所述目标芯片。Allocate the target chip according to the availability information.
  9. 根据权利要求8所述的方法,其特征在于,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。The method according to claim 8, wherein the availability information includes at least one of normal, general alarm, important alarm, and emergency alarm.
  10. 根据权利要求8或9所述的方法,其特征在于,所述方法还包括:The method according to claim 8 or 9, wherein the method further comprises:
    获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。Acquire frequency information of the target chip, where the frequency information includes at least one of the rated frequency of the target chip, and/or the real-time frequency of the target chip and the average frequency of the target chip within a preset time interval kind.
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述可用性信息分配所述目标芯片,包括:The method according to claim 10, wherein the allocating the target chip according to the availability information comprises:
    根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响;Determine the influence of the target chip on the computing performance of the cluster according to the availability information, the frequency information, and the allocated chips in the cluster;
    在确定所述影响为所述集群的运算性能不下降的情况下,分配所述目标芯片。When it is determined that the influence is that the computing performance of the cluster does not decrease, the target chip is allocated.
  12. 根据权利要求8至11中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 8 to 11, wherein the method further comprises:
    获取所述目标芯片的频率可用性信息;Acquiring frequency availability information of the target chip;
    所述根据所述可用性信息分配所述目标芯片,包括:The allocating the target chip according to the availability information includes:
    根据所述可用性信息和所述频率可用性信息分配所述目标芯片。Allocate the target chip according to the availability information and the frequency availability information.
  13. 一种确定芯片状态的装置,其特征在于,包括:A device for determining the state of a chip, characterized in that it comprises:
    获取单元,用于获取集群中的目标芯片的状态信息;The obtaining unit is used to obtain the status information of the target chip in the cluster;
    确定单元,用于根据所述状态信息确定所述目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;A determining unit, configured to determine availability information of the target chip according to the status information, where the availability information is used to indicate whether the target chip is available;
    输出单元,用于输出所述可用性信息。The output unit is used to output the availability information.
  14. 根据权利要求13所述的装置,其特征在于,所述状态信息包括所述目标芯片的型号、所述目标芯片的利用率、所述目标芯片的带宽、所述目标芯片的显存利用率、所述目标芯片中运行的进程、所述目标芯片的功率、所述目标芯片的电压或所述目标芯片的温度中的至少一项。The device according to claim 13, wherein the status information includes the model of the target chip, the utilization rate of the target chip, the bandwidth of the target chip, the memory utilization rate of the target chip, and the At least one of the process running in the target chip, the power of the target chip, the voltage of the target chip, or the temperature of the target chip.
  15. 根据权利要求13或14所述的装置,其特征在于,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。The device according to claim 13 or 14, wherein the availability information includes at least one of normal, general alarm, major alarm, and emergency alarm.
  16. 根据权利要求13至15中任一项所述的装置,其特征在于,所述获取单元还用于:The device according to any one of claims 13 to 15, wherein the acquiring unit is further configured to:
    获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。Acquire frequency information of the target chip, where the frequency information includes at least one of a rated frequency of the target chip, a real-time frequency of the target chip, and an average frequency of the target chip within a preset time interval.
  17. 根据权利要求16所述的装置,其特征在于,所述输出单元还用于:The device according to claim 16, wherein the output unit is further configured to:
    输出所述目标芯片的频率信息。Output the frequency information of the target chip.
  18. 根据权利要求16或17所述的装置,其特征在于,所述确定单元具体用于:The device according to claim 16 or 17, wherein the determining unit is specifically configured to:
    根据所述状态信息及所述频率信息确定所述目标芯片的可用性信息。The availability information of the target chip is determined according to the status information and the frequency information.
  19. 根据权利要求16或17所述的装置,其特征在于,所述确定单元还用于:The device according to claim 16 or 17, wherein the determining unit is further configured to:
    根据所述频率信息确定所述目标芯片的频率可用性信息,所述频率可用性信息用于指示所述目标芯片的频率是否满足预设的条件;Determining frequency availability information of the target chip according to the frequency information, where the frequency availability information is used to indicate whether the frequency of the target chip meets a preset condition;
    所述输出单元还用于:输出所述频率可用性信息。The output unit is further configured to output the frequency availability information.
  20. 一种调度集群资源的装置,其特征在于,包括:A device for scheduling cluster resources, characterized in that it comprises:
    获取单元,用于获取集群中的目标芯片的可用性信息,所述可用性信息用于指示所述目标芯片是否可用;An obtaining unit, configured to obtain availability information of a target chip in the cluster, where the availability information is used to indicate whether the target chip is available;
    分配单元,用于根据所述可用性信息分配所述目标芯片。The allocation unit is configured to allocate the target chip according to the availability information.
  21. 根据权利要求20所述的装置,其特征在于,所述可用性信息包括正常、一般告警、重要告警及紧急告警中的至少一项。The device according to claim 20, wherein the availability information includes at least one of normal, general alarm, important alarm, and emergency alarm.
  22. 根据权利要求20或21所述的装置,其特征在于,所述获取单元还用于:The device according to claim 20 or 21, wherein the acquiring unit is further configured to:
    获取所述目标芯片的频率信息,所述频率信息包括所述目标芯片的额定频率、和/或所述目标芯片的实时频率及所述目标芯片在预设时间间隔内的平均频率中的至少一种。Acquire frequency information of the target chip, where the frequency information includes at least one of the rated frequency of the target chip, and/or the real-time frequency of the target chip and the average frequency of the target chip within a preset time interval kind.
  23. 根据权利要求22所述的装置,其特征在于,所述分配单元具体用于:The device according to claim 22, wherein the allocating unit is specifically configured to:
    根据所述可用性信息、所述频率信息及集群中已分配的芯片,确定所述目标芯片对所述集群的运算性能的影响;Determine the influence of the target chip on the computing performance of the cluster according to the availability information, the frequency information, and the allocated chips in the cluster;
    在确定所述影响为所述集群的运算性能不下降的情况下,分配所述目标芯片。When it is determined that the influence is that the computing performance of the cluster does not decrease, the target chip is allocated.
  24. 根据权利要求20至23中任一项所述的装置,其特征在于,所述获取单元还用于:The device according to any one of claims 20 to 23, wherein the acquiring unit is further configured to:
    获取所述目标芯片的频率可用性信息;Acquiring frequency availability information of the target chip;
    所述分配单元具体用于:The allocation unit is specifically used for:
    根据所述可用性信息和所述频率可用性信息分配所述目标芯片。Allocate the target chip according to the availability information and the frequency availability information.
  25. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至7中任一项所述的方法。A chip, characterized in that, the chip comprises a processor and a data interface, and the processor reads instructions stored on a memory through the data interface to execute the method according to any one of claims 1 to 7 method.
  26. 一种调度集群资源的装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求8至12中任一项所述的方法。A device for scheduling cluster resources, characterized by comprising a processor and a memory, the memory is used to store program instructions, and the processor is used to call the program instructions to execute any one of claims 8 to 12 Methods.
  27. 一种服务器,包括如权利要求20至24中任一项或权利要求26所述的调度集群资源的装置。A server comprising the device for scheduling cluster resources according to any one of claims 20 to 24 or claim 26.
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至7或8至12中任一项所述的方法。A computer-readable storage medium, wherein the computer-readable medium stores program code for device execution, and the program code includes a program code for executing any one of claims 1 to 7 or 8 to 12 Methods.
PCT/CN2020/071989 2020-01-14 2020-01-14 Chip state determining method and device, and cluster resource scheduling method and device WO2021142614A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080093867.XA CN114981778A (en) 2020-01-14 2020-01-14 Method for determining chip state, method for scheduling cluster resources and device thereof
PCT/CN2020/071989 WO2021142614A1 (en) 2020-01-14 2020-01-14 Chip state determining method and device, and cluster resource scheduling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/071989 WO2021142614A1 (en) 2020-01-14 2020-01-14 Chip state determining method and device, and cluster resource scheduling method and device

Publications (1)

Publication Number Publication Date
WO2021142614A1 true WO2021142614A1 (en) 2021-07-22

Family

ID=76863452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/071989 WO2021142614A1 (en) 2020-01-14 2020-01-14 Chip state determining method and device, and cluster resource scheduling method and device

Country Status (2)

Country Link
CN (1) CN114981778A (en)
WO (1) WO2021142614A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053326A1 (en) * 2004-09-03 2006-03-09 Intel Corporation Coordinating idle state transitions in multi-core processors
CN101309208A (en) * 2008-06-21 2008-11-19 华中科技大学 Job scheduling system suitable for grid environment and based on reliable expense
CN101403982A (en) * 2008-11-03 2009-04-08 华为技术有限公司 Task distribution method, system and equipment for multi-core processor
CN104253850A (en) * 2014-01-07 2014-12-31 深圳市华傲数据技术有限公司 Distributed task scheduling method and system
CN106155811A (en) * 2015-04-28 2016-11-23 阿里巴巴集团控股有限公司 Graphic processing facility, resource service device, resource regulating method and device
CN106407013A (en) * 2016-09-30 2017-02-15 郑州云海信息技术有限公司 Resource dynamic dispatching method, apparatus and system, and resource dispatching server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053326A1 (en) * 2004-09-03 2006-03-09 Intel Corporation Coordinating idle state transitions in multi-core processors
CN101309208A (en) * 2008-06-21 2008-11-19 华中科技大学 Job scheduling system suitable for grid environment and based on reliable expense
CN101403982A (en) * 2008-11-03 2009-04-08 华为技术有限公司 Task distribution method, system and equipment for multi-core processor
CN104253850A (en) * 2014-01-07 2014-12-31 深圳市华傲数据技术有限公司 Distributed task scheduling method and system
CN106155811A (en) * 2015-04-28 2016-11-23 阿里巴巴集团控股有限公司 Graphic processing facility, resource service device, resource regulating method and device
CN106407013A (en) * 2016-09-30 2017-02-15 郑州云海信息技术有限公司 Resource dynamic dispatching method, apparatus and system, and resource dispatching server

Also Published As

Publication number Publication date
CN114981778A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
KR102300984B1 (en) Training machine learning models on large distributed systems using job servers
US11824784B2 (en) Automated platform resource management in edge computing environments
EP3629165A1 (en) Accelerated resource allocation techniques
WO2021051914A1 (en) Gpu resource-based data processing method and system, and electronic device
WO2022105589A1 (en) Resource scheduling method and apparatus, electronic device and computer readable medium
CN106951926B (en) Deep learning method and device of hybrid architecture
CN110262901B (en) Data processing method and data processing system
TWI798618B (en) Memory allocation method, device, and electronic equipment
CN110750351B (en) Multi-core task scheduler, multi-core task scheduling method, multi-core task scheduling device and related products
CN112997138A (en) Artificial intelligence enabled management of storage media access
WO2021143155A1 (en) Model training method and apparatus
WO2017128980A1 (en) Method and device for managing resources in cloud platform
US10037225B2 (en) Method and system for scheduling computing
CN111176792A (en) Resource scheduling method, device and related equipment
JP2020027616A (en) Command execution method and device
WO2022126316A1 (en) Development method and apparatus for artificial intelligence (ai) model
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
WO2022134809A1 (en) Model training processing method and apparatus, computer device, and medium
WO2021142614A1 (en) Chip state determining method and device, and cluster resource scheduling method and device
CN113971455A (en) Distributed model training method and device, storage medium and computer equipment
CN115878333A (en) Method, device and equipment for judging consistency between process groups
WO2023087227A1 (en) Data processing apparatus and method
CN114924888A (en) Resource allocation method, data processing method, device, equipment and storage medium
CN114237861A (en) Data processing method and equipment thereof
WO2023115532A1 (en) Data processing method and data processing apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20914468

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20914468

Country of ref document: EP

Kind code of ref document: A1