WO2023040203A1 - 一种人工智能平台的数据获取方法、装置、设备、介质 - Google Patents

一种人工智能平台的数据获取方法、装置、设备、介质 Download PDF

Info

Publication number
WO2023040203A1
WO2023040203A1 PCT/CN2022/078400 CN2022078400W WO2023040203A1 WO 2023040203 A1 WO2023040203 A1 WO 2023040203A1 CN 2022078400 W CN2022078400 W CN 2022078400W WO 2023040203 A1 WO2023040203 A1 WO 2023040203A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
node
artificial intelligence
target
computing
Prior art date
Application number
PCT/CN2022/078400
Other languages
English (en)
French (fr)
Inventor
姬贵阳
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023040203A1 publication Critical patent/WO2023040203A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS

Definitions

  • the present application relates to the technical field of data acquisition, and in particular to a data acquisition method, device, equipment, and medium for an artificial intelligence platform.
  • AI Artificial Intelligence
  • a variety of artificial intelligence clusters are also constantly emerging on the market.
  • An important basic function of artificial intelligence clusters is file operations, including local download cache of data sets, file reading during training, and training task logs.
  • the purpose of this application is to provide a data acquisition method, device, equipment, and medium for an artificial intelligence platform, which can reduce the network and disk pressure of the main storage node in the artificial intelligence cluster, and make the network resources between computing nodes It has been fully utilized, enhancing the utilization rate of artificial intelligence cluster resources.
  • the specific plan is as follows:
  • the present application discloses a data acquisition method of an artificial intelligence platform, which is applied to an artificial intelligence cluster including main storage nodes and multiple computing nodes, including:
  • the currently traversed computing node will be saved to the The target data in the computing node is transmitted to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • the data acquisition method of the artificial intelligence platform also includes:
  • the main data will be saved through the shared storage network built based on the remote direct data access technology between different nodes of the artificial intelligence cluster in advance.
  • the target data pre-stored in the storage node is transmitted to the target node, so that the target node performs corresponding operations on the target data according to the data operation request.
  • the process of building the shared storage network includes:
  • a network file system shared storage network with a fully connected network structure is built between different nodes of the artificial intelligence cluster.
  • the statistics of the current data operation task pressure of each of the other computing nodes includes:
  • the current data operation task pressure of the faulty computing node is set to infinity.
  • the statistics of the current data operation task pressure of each of the other computing nodes includes:
  • the current task quantity of each of the other computing nodes is monitored to obtain the current data operation task pressure of the computing node.
  • the process of sequentially traversing all other computing nodes according to the current data operation task pressure in ascending order further includes:
  • the multiple computing nodes are traversed sequentially in descending order of the current data processing capabilities of the computing nodes.
  • the statistics of the current data operation task pressure of each of the other computing nodes includes:
  • the current data operation task pressure of each of the other computing nodes is determined.
  • the present application discloses a data acquisition device for an artificial intelligence platform, which is applied to an artificial intelligence cluster including a main storage node and multiple computing nodes, including:
  • a request acquisition module configured to acquire a data operation request initiated by a target node in the artificial intelligence cluster for target data; the target node is any one of the computing nodes in the artificial intelligence cluster;
  • a statistical module configured to count the current data operation task pressure of each of the other computing nodes
  • the traversal module is used to traverse all the other computing nodes according to the order of the current data operation task pressure from small to large, and in each traversal process, judge whether the currently traversed computing nodes have saved target data;
  • a data transmission module configured to, if the target data is already stored in the computing node traversed currently, through a shared storage network pre-built between different nodes of the artificial intelligence cluster based on remote direct data access technology and transmitting the target data in the computing node currently traversed to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • the present application discloses an electronic device, including a processor and a memory; wherein, when the processor executes the computer program stored in the memory, the aforementioned data acquisition method of the artificial intelligence platform is realized.
  • the present application discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the aforementioned data acquisition method of the artificial intelligence platform is implemented.
  • this application first obtains the data operation request initiated by the target node in the artificial intelligence cluster for the target data; The current data operation task pressure of the node, and then traverse all the other computing nodes in the order of the current data operation task pressure from small to large, and in each traversal process, judge whether the currently traversed computing nodes are The target data has already been saved. If the target data has been saved in the computing node currently traversed, the shared The storage network transmits the target data in the currently traversed computing node to the target node, so that the target node performs corresponding operations on the target data according to the data operation request.
  • the present application uses a shared storage network built based on remote direct data access technology to connect each node to each other, and can realize data transmission between different nodes in the artificial intelligence cluster, which fully improves the network and disk capacity of the artificial intelligence cluster. At the same time, it reduces the network and disk pressure of the main storage in the artificial intelligence cluster, and ensures the stable operation of the related platforms of the artificial intelligence cluster.
  • Fig. 1 is a flow chart of a data acquisition method of an artificial intelligence platform disclosed in the present application
  • FIG. 2 is a schematic structural diagram of a faulty shared storage network disclosed in the present application.
  • FIG. 3 is a schematic diagram of a shared storage network structure of a fully connected network structure disclosed in the present application.
  • Fig. 4 is a flow chart of a data acquisition method of a specific artificial intelligence platform disclosed in the present application.
  • FIG. 5 is a schematic diagram of a specific characterization of data manipulation task pressure disclosed in the present application.
  • FIG. 6 is a flow chart of a data acquisition method of a specific artificial intelligence platform disclosed in the present application.
  • FIG. 7 is a flowchart of a data acquisition method of a specific artificial intelligence platform disclosed in the present application.
  • FIG. 8 is a schematic structural diagram of a data acquisition device of an artificial intelligence platform disclosed in the present application.
  • FIG. 9 is a structural diagram of an electronic device disclosed in the present application.
  • the embodiment of the present application discloses a data acquisition method of an artificial intelligence platform, as shown in FIG. 1, the method includes:
  • Step S11 Obtain the data operation request initiated by the target node in the artificial intelligence cluster for the target data; the target node is any one of the computing nodes in the artificial intelligence cluster.
  • the artificial intelligence cluster mainly includes a main storage node and all other nodes except the main storage node, that is, the computing nodes.
  • the target data includes but not limited to information such as training scripts, training model files, training log information and data set files, and the above information is pre-stored in the main storage node, and can also be stored in any one or more
  • the main storage node is the entry node of the entire artificial intelligence cluster, that is, any target data required by the computing node can be obtained from the main storage node.
  • a data operation request initiated by a pre-specified target node in the artificial intelligence cluster for target data is obtained.
  • the target node is any one of the computing nodes in the artificial intelligence cluster; the data operation request is initiated for the target data, and the data operation requests corresponding to different target data are also different.
  • the corresponding data operation request may be a transfer operation request for the dataset file.
  • Step S12 Count the current data operation task pressure of each of the other computing nodes.
  • the current data operation task pressure mainly includes, but is not limited to, the computing node's processing capability for the current data operation task and the total amount of data required to be processed.
  • the statistics of the current data operation task pressure of each of the other computing nodes may specifically include: monitoring whether any of the other computing nodes fails; if any of the other computing nodes fails, a The current data operation task pressure of the faulty computing node is set to infinity.
  • the faulty computing node can be The current data operation task pressure of the node is set to infinity, that is, the target data cannot be queried from the faulty computing node. For example, referring to FIG. 2 , if a software and/or hardware failure occurs in the computing node 2, the current data operation task pressure of the computing node 2 is set to infinity.
  • Step S13 Traverse all the other computing nodes in ascending order of the current data operation task pressure, and during each traversal process, determine whether the currently traversed computing node has saved the target data.
  • the current data operation task pressure can be arranged in ascending order, starting from the computing node with the lowest current data operation task pressure Traversing sequentially, and during each traversal process, judging whether the same data as the target data already exists in the computing node currently traversed.
  • Step S14 If the target data has been saved in the computing node currently traversed, then by pre-based remote direct data access (ie RDMA, Remote Direct Memory Access) between different nodes of the artificial intelligence cluster
  • RDMA Remote Direct Memory Access
  • the shared storage network built by the technology transmits the target data in the currently traversed computing node to the target node, so that the target node performs corresponding operations on the target data according to the data operation request.
  • all the other computing nodes are traversed sequentially according to the order of the current data operation task pressure from small to large.
  • send the target data in the currently traversed computing nodes to the corresponding The target node after acquiring the target data, performs corresponding data operations on the target data according to the data operation request.
  • the building process of the shared storage network may specifically include: building a network structure between different nodes of the artificial intelligence cluster based on the remote direct data access technology and the infinite bandwidth technology:
  • a network file system with a fully connected structure shares a storage network.
  • the network structure is built as a fully connected network file system (NFS, Network File System) shared storage network in which any two nodes can communicate with each other.
  • NFS Network File System
  • any two nodes among the main storage node, computing node 1, computing node 2, and computing node 3 can be interconnected and reachable. Connect to the network.
  • the designated target node can communicate with all other computing nodes in the artificial intelligence cluster through the shared storage network, so that During the acquisition process of the target data, the same data as the target data can be queried through all other computing nodes, and the queried data can be sent to the target node. Further, as shown in FIG. 2, if it is monitored that a software and/or hardware failure occurs in computing node 2, it may be queried from computing node 1 and computing node 3 through the shared storage network of the fully connected structure whether it contains the The same data as the data.
  • the data operation request initiated by the target node in the artificial intelligence cluster for the target data is obtained first; the target node is any one of the computing nodes in the artificial intelligence cluster, and then each other
  • the shared storage network transmits the target data in the currently traversed computing node to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • this application makes each node interconnected through the shared storage network built based on the remote direct data access technology in advance, which can realize the data transmission between different nodes in the artificial intelligence cluster, and efficiently utilize the information between other nodes other than the main storage node.
  • the network resources between the artificial intelligence clusters reduce the network and disk pressure of the main storage in the artificial intelligence cluster, and ensure the stable operation of the related platforms of the artificial intelligence cluster.
  • the embodiment of the present application discloses a specific data acquisition method of an artificial intelligence platform, as shown in FIG. 4 , the method includes:
  • Step S21 Obtain the data operation request initiated by the target node in the artificial intelligence cluster for the target data; the target node is any one of the computing nodes in the artificial intelligence cluster.
  • Step S22 Monitor the current number of tasks of each of the other computing nodes to obtain the current data operation task pressure of the computing nodes.
  • each of the other computing nodes can be monitored, and then the number of current tasks of each of the computing nodes mentioned above can be counted in real time,
  • the above-mentioned current task quantity is used as the current data operation task pressure corresponding to the computing node.
  • the current task includes but not limited to operations such as data set caching, data set reading, and training task log write-back.
  • the detected number of current tasks can be used as the distance of the target data transmission corresponding to each computing node.
  • the monitoring node 1 when there is no data set cache task, it indicates that computing node 1 is an available node, and the distance corresponding to the current data operation task pressure corresponding to computing node 1 is set to 0; when it is monitored that there are two current data set caching tasks on computing node 2 , it indicates that computing node 2 is an available node and the distance corresponding to the current data operation task pressure corresponding to computing node 2 is set to 2; when a failure of computing node 3 is detected, it indicates that it is unavailable, and computing node 3 The distance corresponding to the corresponding current data operation task pressure
  • the corresponding distance can be set to S. It can be understood that the specific value of the distance S corresponding to the main storage node set above needs to ensure that there is no When describing the target data, the main storage node can be used as the node that finally transmits the target data stored in it to the target node.
  • Step S23 Traverse all the other computing nodes in ascending order of current data operation task pressure, and during each traversal process, determine whether the currently traversed computing node has saved the target data.
  • the current data operation task pressure can be arranged in ascending order for all other computing nodes.
  • the computing nodes are traversed sequentially, and during each traversal process, it is judged whether the computing node currently traversed has stored the same data as the target data.
  • the computing node 1, computing node 4, and computing node 2 are traversed sequentially according to the order of the current data set cache operation task pressure from small to large, and in each traversal process, it is judged that the current traversal is Whether the above target data has already been saved in the computing node of .
  • the multiple computing nodes in the process of traversing all other computing nodes in ascending order of current data operation task pressure, it may specifically include: if there are multiple computing nodes with the same For the current number of tasks, the multiple computing nodes are traversed sequentially in descending order of the current data processing capabilities of the computing nodes. It can be understood that, if there are multiple computing nodes with the same number of current tasks during the traversal process, the multiple computing nodes can be sequentially sorted according to the current data processing capabilities of the computing nodes in descending order. traverse.
  • the data processing capability includes but not limited to factors such as storage resources, network resources, and performance of the computing node.
  • the current data processing capabilities of the above two computing nodes can be compared. If one of the computing nodes has higher idle storage resources and/or CPU (central processing unit, that is, the processing performance of the central processing unit), the computing nodes are traversed preferentially.
  • CPU central processing unit
  • Step S24 If the target data has been stored in the currently traversed computing node, the current The target data in the traversed computing node is transmitted to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • the current data operation task pressure of the computing node is obtained by monitoring the current task quantity of each of the other computing nodes, and when there are multiple computing nodes with the same current task quantity, according to
  • the current data processing capabilities of the computing nodes traverse the multiple computing nodes with the same number of current tasks in order from large to small, and further optimize the order of traversing the computing nodes through the size of the current data processing capabilities
  • the selection of the order can make full use of the network resources among the various computing nodes, enhance the utilization rate of the artificial intelligence cluster resources, and also improve the overall computing resource utilization efficiency of the artificial intelligence cluster, so that the pressure load is balanced.
  • the embodiment of the present application discloses a specific data acquisition method of an artificial intelligence platform, as shown in FIG. 6, the method includes:
  • Step S31 Obtain the data operation request initiated by the target node in the artificial intelligence cluster for the target data; the target node is any one of the computing nodes in the artificial intelligence cluster.
  • Step S32 Determine the total amount of data to be processed and the current data processing capacity of all current data operation tasks of each of the other computing nodes.
  • the total amount of data to be processed and the current data processing tasks of all current data operation tasks of each of the other computing nodes can be capacity statistics, so as to determine the total amount of data to be processed and the current data processing capacity of all current data operation tasks of each of the other computing nodes.
  • the total amount of data to be processed is the sum of data amounts corresponding to the number of tasks currently to be processed by the computing node.
  • Step S33 Based on the total amount of data to be processed and the current data processing capability of all the current data operation tasks of the computing node, determine the current data operation task pressure of each of the other computing nodes.
  • the data to be processed can be calculated according to the data to be processed
  • the total amount and the current data processing capability determine the pressure value corresponding to the current data operation task pressure of each of the other computing nodes. For example, the smaller the total amount of data to be processed of all current data operation tasks of the computing node and the stronger the current data processing capability, it indicates that the current data operation task pressure of the computing node is lower.
  • Step S34 Traverse all the other computing nodes according to the order of the current data operation task pressure from small to large, and in each traversal process, judge whether the target is saved in the computing node currently traversed data.
  • Step S35 If the target data has been saved in the computing node traversed currently, the current The target data in the traversed computing node is transmitted to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • the embodiment of the present application first determines the total amount of data to be processed and the current data processing capacity of all current data operation tasks of each other computing node, and based on the data to be processed of all current data operation tasks of the computing node The total amount and the current data processing capacity determine the current data operation task pressure of each of the other computing nodes.
  • the current data operation task pressure of each of the other computing nodes can be determined, which can make full use of each node.
  • the network resources and storage resources between them have achieved performance and speed improvements, which are very suitable for business scenarios with massive AI cluster files, enhance the resource utilization of AI clusters, improve the efficiency of model training, and improve the overall performance of AI clusters. Calculate resource usage efficiency.
  • the embodiment of the present application discloses a specific data acquisition method of an artificial intelligence platform, as shown in FIG. 7 , the method includes:
  • Step S41 Obtain the data operation request initiated by the target node in the artificial intelligence cluster for the target data; the target node is any one of the computing nodes in the artificial intelligence cluster.
  • Step S42 Count the current data operation task pressure of each of the other computing nodes.
  • Step S43 Traverse all the other computing nodes in ascending order of the current data operation task pressure, and during each traversal process, determine whether the target is already saved in the computing node currently traversed data.
  • Step S44 If the target data is not saved in all the computing nodes traversed, through the shared storage network built in advance based on the remote direct data access technology between different nodes of the artificial intelligence cluster, the The target data pre-stored in the primary storage node is transmitted to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • the artificial intelligence in the process of sequentially traversing all the other computing nodes, if none of the traversed computing nodes save the same data as the above-mentioned target data, the artificial intelligence
  • the shared storage network built on the basis of remote direct data access technology between different nodes of the cluster sends the above-mentioned target data pre-saved in the main storage node to the above-mentioned target node, that is, the main storage node is finally used to obtain the The node where the target data arrives.
  • the target node After the target node acquires the target data, it may perform a corresponding data operation on the target data according to the data operation request.
  • the shared storage network built in advance based on remote direct data access technology between different nodes of the artificial intelligence cluster will The target data pre-stored in the primary storage node is transmitted to the target node, so as to ensure that the target data is acquired through the primary storage node in the absence of the target data in the computing node.
  • the embodiment of the present application also discloses a data acquisition device for an artificial intelligence platform, which is applied to an artificial intelligence cluster including a main storage node and multiple computing nodes, as shown in FIG. 8 , the device includes:
  • the request acquisition module 11 is configured to acquire a data operation request initiated by a target node in the artificial intelligence cluster for target data; the target node is any one of the computing nodes in the artificial intelligence cluster;
  • a statistical module 12 configured to count the current data operation task pressure of each of the other computing nodes
  • the traversal module 13 is configured to traverse all other computing nodes in ascending order of current data operation task pressure, and determine whether the currently traversed computing node has saved said target data;
  • the data transmission module 14 is configured to, if the target data has already been stored in the computing node traversed currently, through the shared storage pre-built between different nodes of the artificial intelligence cluster based on the remote direct data access technology a network, and transmit the target data in the computing node currently traversed to the target node, so that the target node performs corresponding operations on the target data according to the data operation request.
  • the data operation request initiated by the target node in the artificial intelligence cluster for the target data is obtained first; the target node is any one of the computing nodes in the artificial intelligence cluster, and then each other
  • the shared storage network transmits the target data in the currently traversed computing node to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • this application uses the shared storage network built on the basis of remote direct data access technology to connect each node to each other, and can realize the mutual data transmission between different nodes in the artificial intelligence cluster, which fully improves the network and disk storage capacity of the artificial intelligence cluster. At the same time, it reduces the network and storage pressure of the main storage in the artificial intelligence cluster, and ensures the stable operation of the related platforms of the artificial intelligence cluster.
  • the data acquisition device of the artificial intelligence platform may also include:
  • the first data transmission unit is configured to, when the target data is not saved in all the traversed computing nodes, the remote direct data access technology built between different nodes of the artificial intelligence cluster in advance
  • the shared storage network transmits the target data pre-stored in the primary storage node to the target node, so that the target node performs a corresponding operation on the target data according to the data operation request.
  • the process of building the shared storage network may specifically include:
  • the network building unit is used to build a network file system shared storage network with a network structure of a fully connected structure between different nodes of the artificial intelligence cluster based on remote direct data access technology and unlimited bandwidth technology.
  • the statistics module 12 may specifically include:
  • a first monitoring unit configured to monitor whether any other computing node fails
  • the setting unit is configured to set the current data operation task pressure of the failed computing node to infinity when any other computing node fails.
  • the statistics module 12 may specifically include:
  • the second monitoring unit is configured to monitor the current number of tasks of each of the other computing nodes to obtain the current data operation task pressure of the computing nodes.
  • the process of traversing module 13 may also include:
  • the first traversal unit is configured to, if there are multiple computing nodes all having the same current number of tasks, sequentially traverse the multiple computing nodes according to the descending order of the current data processing capabilities of the computing nodes.
  • the statistics module 12 may specifically include:
  • the first determination unit is configured to determine the total amount of data to be processed and the current data processing capacity of all current data operation tasks of each of the other computing nodes;
  • the second determining unit is configured to determine the current data operation task pressure of each of the other computing nodes based on the total amount of data to be processed and the current data processing capability of all current data operation tasks of the computing node.
  • FIG. 9 is a structural diagram of an electronic device 20 according to an exemplary embodiment.
  • the content in the figure should not be regarded as any limitation on the application scope of this application.
  • FIG. 9 is a schematic structural diagram of an electronic device 20 provided by an embodiment of the present application.
  • the electronic device 20 may specifically include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input/output interface 25 and a communication bus 26 .
  • the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the data acquisition method of the artificial intelligence platform disclosed in any of the foregoing embodiments.
  • the electronic device 20 in this embodiment may specifically be an electronic computer.
  • the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20;
  • the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here;
  • the input and output interface 25 is used to obtain external input data or output data to the external, and its specific interface type can be selected according to specific application needs, here Not specifically limited.
  • the memory 22, as a resource storage carrier can be a read-only memory, random access memory, magnetic disk or optical disk, etc., and the resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage. .
  • the operating system 221 is used to manage and control each hardware device on the electronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc.
  • the computer program 222 may further include computer programs that can be used to complete other specific tasks.
  • the present application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the aforementioned data acquisition method of the artificial intelligence platform is implemented.
  • a computer program when executed by a processor, the aforementioned data acquisition method of the artificial intelligence platform is implemented.
  • the specific steps of the method reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.
  • each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts of each embodiment can be referred to each other.
  • the description is relatively simple, and for the related information, please refer to the description of the method part.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种人工智能平台的数据获取方法、装置、设备、介质,包括:获取人工智能集群中目标节点针对目标数据发起的数据操作请求;统计其他每个计算节点的当前数据操作任务压力;按照当前数据操作任务压力从小到大的顺序对其他所有计算节点进行依次遍历,并在遍历过程中判断当前遍历到的计算节点中是否已保存有目标数据;若已保存有目标数据,则通过预先在集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的计算节点中的目标数据传输至目标节点。通过预先基于远程直接数据存取技术搭建的共享存储网络,能够实现集群内各个节点之间的数据互传,从而降低了主存储节点的磁盘和网络压力,保证了集群的稳定高效。

Description

一种人工智能平台的数据获取方法、装置、设备、介质
本申请要求在2021年9月18日提交中国专利局、申请号为202111096227.7、发明名称为“一种人工智能平台的数据获取方法、装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据获取技术领域,特别涉及一种人工智能平台的数据获取方法、装置、设备、介质。
背景技术
目前,随着人工智能相关产业的蓬勃发展,越来越多的科研企业和高校的研究人员对计算力的要求也是越来越高,人工智能(AI,即Artificial Intelligence)集群的构建有效解决了企业和科研高校对计算力的要求。各种各样的人工智能集群在市面上也不断的踊跃和产生,人工智能集群的一个重要基本功能是文件的操作,包括数据集的本地下载缓存,训练过程中文件的读取,训练任务日志回写以及文件的移动等一系列操作,这些都依赖于集群的存储资源,且大规模人工智能集群对于存储和网络要求都非常高,有频繁的I/O(Input/Output,输入/输出)操作,如何在人工智能集群中进行海量的文件操作,且不影响人工智能集群的性能,成为人工智能集群中首要解决的问题,关乎于人工智能集群用户进行训练任务的工作效率。
然而,当前的人工智能集群大都是以单节点作为存储,或者使用外置存储,属于一对多的存储设计模式,即一个共享存储,挂载到集群中的各个计算节点上,这样的劣势很明显,即网络压力和磁盘I/O压力全部在一个节点上,导致集群资源的使用效率低下,且造成集群资源的浪费,随着节点数的增加,主存储节点压力也在增加,完全不适用于日渐增长的人工智能集群规模的需求。并且由于人工智能集群中有海量的数据集文件,是非重要的备份的文件,如果将数据集放置在主存储节点上,无论是在用户目录的转移还是存储在本 地的缓存,都会造成人工智能集群资源的浪费,使得存储和网络资源得不到充分利用。
发明内容
有鉴于此,本申请的目的在于提供一种人工智能平台的数据获取方法、装置、设备、介质,能够降低人工智能集群中主存储节点的网络和磁盘压力,使各个计算节点之间的网络资源得到充分的利用,增强了人工智能集群资源的使用率。其具体方案如下:
第一方面,本申请公开了一种人工智能平台的数据获取方法,应用于包含主存储节点以及多个计算节点的人工智能集群,包括:
获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点;
统计其他每个所述计算节点的当前数据操作任务压力;
按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据;
如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
可选的,所述的人工智能平台的数据获取方法,还包括:
若所有遍历到的所述计算节点中均未保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将所述主存储节点中预先保存的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
可选的,所述共享存储网络的搭建过程,包括:
基于远程直接数据存取技术以及无限带宽技术,在所述人工智能集群的不同节点之间搭建网络结构为全连通结构的网络文件***共享存储网络。
可选的,所述统计其他每个所述计算节点的当前数据操作任务压力,包括:
监视其他任意所述计算节点是否出现故障;
如果其他任意所述计算节点出现故障,则将出现故障的所述计算节点的当前数据操作任务压力设置为无穷大。
可选的,所述统计其他每个所述计算节点的当前数据操作任务压力,包括:
监测其他每个所述计算节点的当前任务数量,得到所述计算节点的当前数据操作任务压力。
可选的,所述按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历的过程中,还包括:
若存在多个计算节点均具有相同的当前任务数量,则按照所述计算节点的当前数据处理能力从大到小的顺序对所述多个计算节点进行依次遍历。
可选的,所述统计其他每个所述计算节点的当前数据操作任务压力,包括:
确定其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力;
基于所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,确定其他每个所述计算节点的当前数据操作任务压力。
第二方面,本申请公开了一种人工智能平台的数据获取装置,应用于包含主存储节点以及多个计算节点的人工智能集群,包括:
请求获取模块,用于获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点;
统计模块,用于统计其他每个所述计算节点的当前数据操作任务压力;
遍历模块,用于按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据;
数据传输模块,用于如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据 存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
第三方面,本申请公开了一种电子设备,包括处理器和存储器;其中,所述处理器执行所述存储器中保存的计算机程序时实现前述的人工智能平台的数据获取方法。
第四方面,本申请公开了一种计算机可读存储介质,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现前述的人工智能平台的数据获取方法。
可见,本申请先获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点,再统计其他每个所述计算节点的当前数据操作任务压力,然后按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据,如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。可见,本申请通过预先基于远程直接数据存取技术搭建的共享存储网络使得各个节点之间相互连通,能够实现人工智能集群内不同节点之间的数据传输,充分提高了人工智能集群的网络和磁盘的利用率,同时降低了人工智能集群中主存储的网络和磁盘压力,保证人工智能集群的相关平台的业务稳定运行。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请公开的一种人工智能平台的数据获取方法流程图;
图2为本申请公开的一种存在故障的共享存储网络结构示意图;
图3为本申请公开的一种全连通网络结构的共享存储网络结构示意图;
图4为本申请公开的一种具体的人工智能平台的数据获取方法流程图;
图5为本申请公开的一种关于数据操作任务压力的具体表征示意图;
图6为本申请公开的一种具体的人工智能平台的数据获取方法流程图;
图7为本申请公开的一种具体的人工智能平台的数据获取方法流程图;
图8为本申请公开的一种人工智能平台的数据获取装置结构示意图;
图9为本申请公开的一种电子设备结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例公开了一种人工智能平台的数据获取方法,参见图1所示,该方法包括:
步骤S11:获取人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点。
本实施例中,所述人工智能集群主要包括主存储节点以及除所述主存储节点以外的其他所有节点,即所述计算节点。其中,所述目标数据包括但不限于训练脚本、训练模型文件、训练日志记录信息及数据集文件等信息,上述信息预先保存在所述主存储节点中,并且还可以保存于任意一个或多个计算节点中,同时,所述主存储节点为整个人工智能集群的入口节点,即任意所述计算节点需要的目标数据都能够在所述主存储节点中获取到。
具体的,本实施例中,首先获取上述人工智能集群中预先指定的目标节点针对目标数据发起的数据操作请求。其中,所述目标节点为上述人工智能集群中的任一所述计算节点;所述数据操作请求是针对目标数据发起的,不同的目标数据对应的数据操作请求也有所不同。例如,当目标数据为数据集文件时,对应的数据操作请求可以为数据集文件的传输操作请求。
步骤S12:统计其他每个所述计算节点的当前数据操作任务压力。
本实施例中,在获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求之后,需要对除了上述目标节点以外的所有计算节点的当前数据操作的任务压力进行统计。可以理解的是,所述当前数据操作任务压力主要包括但不限于所述计算节点对当前数据操作任务的处理能力以及所需处理数据的总量。具体的,所述计算节点对当前数据操作任务的处理能力越强,则表明当前数据操作任务压力越小;所述计算节点对当前数据操作任务的处理能力越弱,则表明当前数据操作任务压力越大;所述计算节点当前所需处理数据的总量越大,则表明当前数据操作任务压力越大;所述计算节点当前所需处理数据的总量越小,则表明当前数据操作任务压力越小。
本实施例中,所述统计其他每个所述计算节点的当前数据操作任务压力,具体可以包括:监视其他任意所述计算节点是否出现故障;如果其他任意所述计算节点出现故障,则将出现故障的所述计算节点的当前数据操作任务压力设置为无穷大。具体的,在统计其他每个所述计算节点的当前数据操作任务压力的过程中,当监视到所述计算节点中任意一个计算节点发生故障,例如损坏或关机下线,可以将出现故障的计算节点的当前数据操作任务压力设置为无穷大,即无法从存在故障的计算节点中查询到所述目标数据。例如,参见图2所示,若监视到计算节点2发生软件和/或硬件故障时,则将所述计算节点2的当前数据操作任务压力设置为无穷大。
步骤S13:按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据。
本实施例中,在对其他每个所述计算节点的当前数据操作任务压力统计完成之后,可以按照当前数据操作任务压力从小到大的排列顺序,从当前数据操作任务压力最小的计算节点开始进行依次遍历,并且在每次遍历的过程中,判断当前遍历到的上述计算节点中是否已经存在与上述目标数据相同的数据。
步骤S14:如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取(即RDMA,Remote Direct Memory Access)技术搭建的共享存储网络,将当前遍 历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
本实施例中,按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,在每次遍历过程中,当遍历到的所述计算节点中已经保存有与上述目标数据相同的数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据发送到对应的所述目标节点,所述目标节点在获取到上述目标数据之后,会根据所述数据操作请求对所述目标数据进行相应的数据操作。
需要指出的是,本实施例中,所述共享存储网络的搭建过程,具体可以包括:基于远程直接数据存取技术以及无限带宽技术,在所述人工智能集群的不同节点之间搭建网络结构为全连通结构的网络文件***共享存储网络。具体的,本实施例中为了实现人工智能集群中所有节点之间能够进行数据文件的传输,首先基于远程直接数据存取技术以及无限带宽技术(即Infiniband网络)在人工智能集群的不同节点之间搭建网络结构为任意两个节点能够相互连通的全连通结构的网络文件***(NFS,即Network File System)共享存储网络。例如,参见图3所示,通过远程直接数据存取技术以及无限带宽技术能够实现主存储节点、计算节点1、计算节点2、计算节点3中任意两个节点之间都是互通可达的全连通网络。
可以理解的是,在搭建完上述网络文件***共享存储网络之后,被指定的所述目标节点能够通过上述共享存储网络实现与所述人工智能集群中所有其他所述计算节点连通互达,以便在进行所述目标数据获取过程中,能够通过所有其他所述计算节点查询到与所述目标数据相同的数据,并将查询到的数据发送至所述目标节点。进一步的,参见图2所示,若监视到计算节点2发生软件和/或硬件故障时,则可以通过全连通结构的共享存储网络从计算节点1和计算节点3中查询是否包含与所述目标数据相同的数据。
可见,本申请实施例先获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点,再统计其他每个所述计算节点的当前数据操作任务压力,然后按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍 历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据,如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。可见,本申请通过预先基于远程直接数据存取技术搭建的共享存储网络使得各个节点之间相互连通,能够实现人工智能集群内不同节点之间的数据传输,高效利用主存储节点之外其它节点之间的网络资源,同时降低了人工智能集群中主存储的网络和磁盘压力,保证人工智能集群的相关平台的业务稳定运行。
本申请实施例公开了一种具体的人工智能平台的数据获取方法,参见图4所示,该方法包括:
步骤S21:获取人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点。
步骤S22:监测其他每个所述计算节点的当前任务数量,得到所述计算节点的当前数据操作任务压力。
本实施例中,在获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求之后,可以对其他每个所述计算节点进行监测,然后实时的统计各个上述计算节点的当前任务数量,并将上述当前任务数量作为所述计算节点对应的当前数据操作任务压力。其中,所述当前任务包括但不限于数据集的缓存、数据集的读取、训练任务日志回写等操作。本实施例可以在监测其他每个所述计算节点的当前任务数量的过程中,可以将检测到的当前任务数量作为各个计算节点对应的目标数据传输的距离,上述距离的数值越小则表示对应的计算节点的当前数据操作任务压力也越小,上述距离的数值越大则表示对应的计算节点的当前数据操作任务压力也越大。例如,当获取到人工智能集群中目标节点针对目标数据发起的缓存操作请求之后,首先需要对人工智能集群中除了目标节点以外的节点进行监测,参见图5所示,当监测到计算节点1当前没有数据集缓存任务时,则表明计算节点1为可用节点,并将计算节点1对应的当前数据操作任务压力对应的距离设置为0;当监测到计算 节点2的当前数据集缓存任务有两个时,则表明计算节点2为可用节点并将计算节点2对应的当前数据操作任务压力对应的距离设置为2;当监测到计算节点3发生故障后,则表明其不可用,并将计算节点3对应的当前数据操作任务压力对应的距离设置为I,其中I为无穷大,即使得无法从计算节点3中获取到目标数据;当监测到计算节点4的当前数据集缓存任务有一个时,则表明计算节点4为可用节点,并将计算节点4对应的当前数据操作任务压力对应的距离设置为1。另外,对于作为管理节点的主存储节点来说,其对应的距离设置可以为S,可以理解的是,上述设置的主存储节点对应的距离S,其具体数值需要确保在其他计算节点均没有所述目标数据时,使得主存储节点能够作为最后将其内部保存的目标数据传输至目标节点的节点。
步骤S23:按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据。
本实施例中,在监测其他每个所述计算节点的当前任务数量,得到所述计算节点的当前数据操作任务压力之后,可以按照当前数据操作任务压力从小到大的排列顺序,对其他所有所述计算节点进行依次遍历,并在每次遍历的过程中,判断当前遍历到的所述计算节点中是否已经保存有与所述目标数据相同的数据。具体的,例如图5所示,按照当前数据集缓存操作任务压力从小到大的顺序对计算节点1、计算节点4、计算节点2进行依次遍历,并在每次遍历过程中,判断当前遍历到的计算节点中是否已保存有上述目标数据。
另外,需要进一步指出的是,所述按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历的过程中,具体还可以包括:若存在多个计算节点均具有相同的当前任务数量,则按照所述计算节点的当前数据处理能力从大到小的顺序对所述多个计算节点进行依次遍历。可以理解的是,如果在遍历的过程中存在多个计算节点当前任务数量均相同,则可以按照所述计算节点当前对数据的处理能力从大到小的顺序对所述多个计算节点进行依次遍历。其中,所述对数据的处理能力包括但不限于所述计算节点的存储资源、网络资源以及自身的性能等因素。例如,当2个计算节点对应的当前任务数量均为3时,则可以比较上述2个计算节点的当前数据处理能力,若其中一个计算节点具有更高的空闲存储资源和/或CPU(central processing  unit,即中央处理器)处理性能,则优先对所述计算节点进行遍历。
步骤S24:如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
其中,关于上述步骤S21、S24更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
可见,本申请实施例通过监测其他每个所述计算节点的当前任务数量,得到所述计算节点的当前数据操作任务压力,并且,当存在多个计算节点均具有相同的当前任务数量时,按照所述计算节点的当前数据处理能力从大到小的顺序对存在相同当前任务数量的所述多个计算节点进行依次遍历,并通过当前数据处理能力的大小进一步地优化了对计算节点遍历的先后顺序的选取,能够使各个计算节点之间的网络资源得到充分的利用,增强了人工智能集群资源的利用率,也提高了人工智能集群的整体计算资源使用效率,使得压力负载均衡。
本申请实施例公开了一种具体的人工智能平台的数据获取方法,参见图6所示,该方法包括:
步骤S31:获取人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点。
步骤S32:确定其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力。
本实施例中,在获取到上述人工智能集群中目标节点针对目标数据发起的数据操作请求之后,可以对其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力进行统计,从而确定出其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力。其中,所述待处理数据总量为所述计算节点当前所有待处理任务数量对应的数据量总和。
步骤S33:基于所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,确定其他每个所述计算节点的当前数据操作任务压力。
本实施例中,当确定出其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力之后,可以根据所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,确定出其他每个所述计算节点的当前数据操作任务压力对应的压力值。例如,所述计算节点的当前所有数据操作任务的待处理数据总量越小以及当前数据处理能力越强,则表明所述计算节点的当前数据操作任务压力越小。
步骤S34:按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据。
步骤S35:如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
其中,关于上述步骤S31、S34、S35更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
可见,本申请实施例先确定出其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,并基于所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,确定其他每个所述计算节点的当前数据操作任务压力。本申请实施例基于所述计算节点的当前所有数据操作任务的待处理数据总量并结合当前数据处理能力,确定出其他每个所述计算节点的当前数据操作任务压力,能够充分的利用各个节点之间的网络资源和存储资源,实现了性能和速度的提升,非常适用于人工智能集群文件海量的业务场景,增强人工智能集群资源使用率,提升模型训练效率,也提高了人工智能集群的整体计算资源使用效率。
本申请实施例公开了一种具体的人工智能平台的数据获取方法,参见图7 所示,该方法包括:
步骤S41:获取人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点。
步骤S42:统计其他每个所述计算节点的当前数据操作任务压力。
步骤S43:按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据。
步骤S44:若所有遍历到的所述计算节点中均未保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将所述主存储节点中预先保存的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
本实施例中,在对其他所有所述计算节点进行依次遍历的过程中,如果所有遍历到的所述计算节点中均没有保存与上述目标数据相同的数据,则可以通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将所述主存储节点中预先保存的上述目标数据发送到上述目标节点,即所述主存储节点为最后用于获取所述目标数据到达的节点。当上述目标节点获取到上述目标数据之后,可以根据所述数据操作请求对上述目标数据进行相应的数据操作。
另外,需要指出的是,本实施例中需确保人工智能集群的主存储节点无异常并且预先保存有与所述目标数据对应的信息,从而保证整个人工智能集群都是全连通的网络结构。
其中,关于上述步骤S41、S42、S43更加具体的处理过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
可见,本申请实施例中若所有遍历到的计算节点中均未保存有目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将主存储节点中预先保存的所述目标数据传输至所述目标节点,从而确保在所述计算节点中无所述目标数据的情况下通过主存储节点获取所述目标数据。
相应的,本申请实施例还公开了一种人工智能平台的数据获取装置,应用于包含主存储节点以及多个计算节点的人工智能集群,参见图8所示,该装置包括:
请求获取模块11,用于获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点;
统计模块12,用于统计其他每个所述计算节点的当前数据操作任务压力;
遍历模块13,用于按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据;
数据传输模块14,用于如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
其中,关于上述各个模块的具体工作流程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
可见,本申请实施例先获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点,再统计其他每个所述计算节点的当前数据操作任务压力,然后按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据,如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。可见,本申请通过基于远程直接数据存取技术搭建的共享存储网络使得各个节点之间相互连通,能够实现人工智能集群内不同节点之间的数据相互传输,充分提高了人工智能集群的网络和磁盘的利用率,同时降低了人工智能集群中主存储的网络和存储压力,保证人工智能集群的相关 平台的业务稳定运行。
在一些具体实施例中,所述人工智能平台的数据获取装置,还可以包括:
第一数据传输单元,用于当所有遍历到的所述计算节点中均未保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将所述主存储节点中预先保存的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
在一些具体实施例中,所述共享存储网络的搭建过程,具体可以包括:
网络搭建单元,用于基于远程直接数据存取技术以及无限带宽技术,在所述人工智能集群的不同节点之间搭建网络结构为全连通结构的网络文件***共享存储网络。
在一些具体实施例中,所述统计模块12,具体可以包括:
第一监视单元,用于监视其他任意所述计算节点是否出现故障;
设置单元,用于当其他任意所述计算节点出现故障,则将出现故障的所述计算节点的当前数据操作任务压力设置为无穷大。
在一些具体实施例中,所述统计模块12,具体可以包括:
第二监视单元,用于监测其他每个所述计算节点的当前任务数量,得到所述计算节点的当前数据操作任务压力。
在一些具体实施例中,所述遍历模块13的过程中,还可以包括:
第一遍历单元,用于若存在多个计算节点均具有相同的当前任务数量,则按照所述计算节点的当前数据处理能力从大到小的顺序对所述多个计算节点进行依次遍历。
在一些具体实施例中,所述统计模块12,具体可以包括:
第一确定单元,用于确定其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力;
第二确定单元,用于基于所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,确定其他每个所述计算节点的当前数据操作任务压力。
进一步的,本申请实施例还公开了一种电子设备,图9是根据一示例性实施例示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围 的任何限制。
图9为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20,具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,所述存储器22用于存储计算机程序,所述计算机程序由所述处理器21加载并执行,以实现前述任一实施例公开的人工智能平台的数据获取方法中的相关步骤。另外,本实施例中的电子设备20具体可以为电子计算机。
本实施例中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括操作***221、计算机程序222等,存储方式可以是短暂存储或者永久存储。
其中,操作***221用于管理与控制电子设备20上的各硬件设备以及计算机程序222,其可以是Windows Server、Netware、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的人工智能平台的数据获取方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。
进一步的,本申请还公开了一种计算机可读存储介质,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现前述公开的人工智能平台的数据获取方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现, 为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上对本申请所提供的一种人工智能平台的数据获取方法、装置、设备、介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种人工智能平台的数据获取方法,其特征在于,应用于包含主存储节点以及多个计算节点的人工智能集群,包括:
    获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点;
    统计其他每个所述计算节点的当前数据操作任务压力;
    按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据;
    如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
  2. 根据权利要求1所述的人工智能平台的数据获取方法,其特征在于,还包括:
    若所有遍历到的所述计算节点中均未保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将所述主存储节点中预先保存的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
  3. 根据权利要求1所述的人工智能平台的数据获取方法,其特征在于,所述共享存储网络的搭建过程,包括:
    基于远程直接数据存取技术以及无限带宽技术,在所述人工智能集群的不同节点之间搭建网络结构为全连通结构的网络文件***共享存储网络。
  4. 根据权利要求1所述的人工智能平台的数据获取方法,其特征在于,所述统计其他每个所述计算节点的当前数据操作任务压力,包括:
    监视其他任意所述计算节点是否出现故障;
    如果其他任意所述计算节点出现故障,则将出现故障的所述计算节点的当前数据操作任务压力设置为无穷大。
  5. 根据权利要求1至4任一项所述的人工智能平台的数据获取方法,其特征在于,所述统计其他每个所述计算节点的当前数据操作任务压力,包括:
    监测其他每个所述计算节点的当前任务数量,得到所述计算节点的当前数据操作任务压力。
  6. 根据权利要求5所述的人工智能平台的数据获取方法,其特征在于,所述按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历的过程中,还包括:
    若存在多个计算节点均具有相同的当前任务数量,则按照所述计算节点的当前数据处理能力从大到小的顺序对所述多个计算节点进行依次遍历。
  7. 根据权利要求1至4任一项所述的人工智能平台的数据获取方法,其特征在于,所述统计其他每个所述计算节点的当前数据操作任务压力,包括:
    确定其他每个所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力;
    基于所述计算节点的当前所有数据操作任务的待处理数据总量以及当前数据处理能力,确定其他每个所述计算节点的当前数据操作任务压力。
  8. 一种人工智能平台的数据获取装置,其特征在于,应用于包含主存储节点以及多个计算节点的人工智能集群,包括:
    请求获取模块,用于获取所述人工智能集群中目标节点针对目标数据发起的数据操作请求;所述目标节点为所述人工智能集群中的任一所述计算节点;
    统计模块,用于统计其他每个所述计算节点的当前数据操作任务压力;
    遍历模块,用于按照当前数据操作任务压力从小到大的顺序对其他所有所述计算节点进行依次遍历,并在每次遍历过程中,判断当前遍历到的所述计算节点中是否已保存有所述目标数据;
    数据传输模块,用于如果当前遍历到的所述计算节点中已保存有所述目标数据,则通过预先在所述人工智能集群的不同节点之间基于远程直接数据存取技术搭建的共享存储网络,将当前遍历到的所述计算节点中的所述目标数据传输至所述目标节点,以便所述目标节点根据所述数据操作请求对所述目标数据进行相应的操作。
  9. 一种电子设备,其特征在于,包括处理器和存储器;其中,所述处理器执行所述存储器中保存的计算机程序时实现如权利要求1至7任一项所述的人工智能平台的数据获取方法。
  10. 一种计算机可读存储介质,其特征在于,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的人工智能平台的数据获取方法。
PCT/CN2022/078400 2021-09-18 2022-02-28 一种人工智能平台的数据获取方法、装置、设备、介质 WO2023040203A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111096227.7 2021-09-18
CN202111096227.7A CN113965587B (zh) 2021-09-18 2021-09-18 一种人工智能平台的数据获取方法、装置、设备、介质

Publications (1)

Publication Number Publication Date
WO2023040203A1 true WO2023040203A1 (zh) 2023-03-23

Family

ID=79462001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078400 WO2023040203A1 (zh) 2021-09-18 2022-02-28 一种人工智能平台的数据获取方法、装置、设备、介质

Country Status (2)

Country Link
CN (1) CN113965587B (zh)
WO (1) WO2023040203A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032669A (zh) * 2023-03-30 2023-04-28 联一信息技术(北京)有限公司 一种结合人工智能的共享数据隐私处理方法及服务器

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965587B (zh) * 2021-09-18 2022-04-22 苏州浪潮智能科技有限公司 一种人工智能平台的数据获取方法、装置、设备、介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110333937A (zh) * 2019-05-30 2019-10-15 平安科技(深圳)有限公司 任务分发方法、装置、计算机设备和存储介质
CN110764708A (zh) * 2019-10-25 2020-02-07 北京浪潮数据技术有限公司 一种数据读取方法、装置、设备及存储介质
US20200104385A1 (en) * 2018-09-28 2020-04-02 International Business Machines Corporation Sharing container images utilizing a distributed file system
CN113326155A (zh) * 2021-06-28 2021-08-31 深信服科技股份有限公司 一种信息处理方法、装置、***和存储介质
CN113965587A (zh) * 2021-09-18 2022-01-21 苏州浪潮智能科技有限公司 一种人工智能平台的数据获取方法、装置、设备、介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598738A (zh) * 2016-12-13 2017-04-26 郑州云海信息技术有限公司 一种计算机集群***及其并行计算方法
CN107395708B (zh) * 2017-07-14 2021-04-02 郑州云海信息技术有限公司 一种处理下载请求的方法和装置
CN107783731B (zh) * 2017-08-07 2020-12-01 荣科科技股份有限公司 一种大数据实时处理方法及处理***
CN107562385B (zh) * 2017-09-13 2020-08-04 郑州云海信息技术有限公司 分布式存储客户端读取数据的方法、装置和设备
CN110865989A (zh) * 2019-11-22 2020-03-06 浪潮电子信息产业股份有限公司 一种大规模计算集群的业务处理方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200104385A1 (en) * 2018-09-28 2020-04-02 International Business Machines Corporation Sharing container images utilizing a distributed file system
CN110333937A (zh) * 2019-05-30 2019-10-15 平安科技(深圳)有限公司 任务分发方法、装置、计算机设备和存储介质
CN110764708A (zh) * 2019-10-25 2020-02-07 北京浪潮数据技术有限公司 一种数据读取方法、装置、设备及存储介质
CN113326155A (zh) * 2021-06-28 2021-08-31 深信服科技股份有限公司 一种信息处理方法、装置、***和存储介质
CN113965587A (zh) * 2021-09-18 2022-01-21 苏州浪潮智能科技有限公司 一种人工智能平台的数据获取方法、装置、设备、介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032669A (zh) * 2023-03-30 2023-04-28 联一信息技术(北京)有限公司 一种结合人工智能的共享数据隐私处理方法及服务器
CN116032669B (zh) * 2023-03-30 2023-07-25 联一信息技术(北京)有限公司 一种结合人工智能的共享数据隐私处理方法及服务器

Also Published As

Publication number Publication date
CN113965587A (zh) 2022-01-21
CN113965587B (zh) 2022-04-22

Similar Documents

Publication Publication Date Title
WO2023040203A1 (zh) 一种人工智能平台的数据获取方法、装置、设备、介质
Lin et al. QoS-aware data replication for data-intensive applications in cloud computing systems
US10657108B2 (en) Parallel I/O read processing for use in clustered file systems having cache storage
CN102523234B (zh) 一种应用服务器集群实现方法及***
Zhang et al. BitVault: A highly reliable distributed data retention platform
US11743333B2 (en) Tiered queuing system
WO2021142971A1 (zh) 传输速率控制方法、装置、计算机***及可读存储介质
US10387214B1 (en) Managing data processing in a distributed computing environment
WO2022222579A1 (zh) 一种基于数据库中间件集群的高可用客户端负载均衡方法
US10061621B1 (en) Managing resources in a configurable computing environment
CN105183470A (zh) 一种自然语言处理***化服务平台
US20230164088A1 (en) Low Latency Queuing System
CN104063501A (zh) 基于hdfs的副本平衡方法
US9811544B1 (en) Management of real-time and historical streaming data
US20080270483A1 (en) Storage Management System
Chen Design of computer big data processing system based on genetic algorithm
CN107908713A (zh) 一种基于Redis集群的分布式动态杜鹃过滤***及其过滤方法
Wu et al. Optimization design and realization of ceph storage system based on software defined network
US9806956B1 (en) Managing computing resources by predicting resource usage
TWI766387B (zh) 一種具延遲感知負載平衡的反向代理方法和存儲裝置
EP3709173B1 (en) Distributed information memory system, method, and program
Marcu et al. Storage and Ingestion Systems in Support of Stream Processing: A Survey
TWI735520B (zh) 調整元件邏輯執行緒數量的方法及裝置
Meng et al. A network load sensitive block placement strategy of HDFS
Xu et al. A Load-balancing method for high performance cluster computing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868583

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18281749

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE