CN111104387A

CN111104387A - Method and device for acquiring data set on server

Info

Publication number: CN111104387A
Application number: CN201911156049.5A
Authority: CN
Inventors: 王继玉
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-05-05

Abstract

The embodiment of the application discloses a method and a device for acquiring a data set on a server. The method comprises the following steps: after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path; configuring a corresponding thread for each directory entry in the directory information; and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.

Description

Method and device for acquiring data set on server

Technical Field

The present disclosure relates to the field of information processing, and more particularly, to a method and an apparatus for acquiring a data set on a server.

Background

In an artificial intelligence system, in the process of model training, a large number of data sets are required to be used as input sources to complete the training of a model. The acquisition of the data set is that the computing node which needs to execute the model training operation is copied or downloaded, wherein the larger the data set is, the longer the required copying or downloading time is, and the caching of the data set affects the training efficiency of the model and increases the total time of model training. If a distributed cluster is used, each computing node also needs a data set, and if a shared data set is not used, each computing node also needs to cache the data set, which further affects the efficiency of distributed training. Therefore, how to increase the acquisition speed of the data set is an urgent problem to be solved.

Disclosure of Invention

In order to solve any technical problem, embodiments of the present application provide a method and an apparatus for acquiring a data set on a server.

To achieve the object of the embodiment of the present application, an embodiment of the present application provides a method for acquiring a data set on a server, including:

after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;

configuring a corresponding thread for each directory entry in the directory information;

and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.

In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:

controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:

acquiring a file format of data under a directory entry;

when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;

and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.

In an exemplary embodiment, the batch-wise acquiring different types of files according to a preset data acquisition policy includes:

acquiring file size information of each file in the same type of files;

sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;

and according to the sorting information, acquiring the files in the same type of files.

in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;

and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.

In an exemplary embodiment, before the obtaining the directory information of the data set under the storage path, the method further includes:

obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;

the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:

acquiring a storage format corresponding to the type of the storage system;

and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.

An apparatus for obtaining a data set on a server, comprising a processor and a memory, wherein the memory stores a computer program, the processor being configured to invoke the computer program in the memory to implement operations comprising:

In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:

acquiring a file format of data under a directory entry;

In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of batch-wise acquiring different types of files according to a preset data acquisition policy, and includes:

acquiring file size information of each file in the same type of files;

In an exemplary embodiment, the processor is configured to call the computer program in the memory to implement the following operations before the operation of obtaining the directory information of the data set under the storage path is implemented, and the processor is configured to call the computer program in the memory to implement the following operations further including:

the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:

acquiring a storage format corresponding to the type of the storage system;

According to the scheme provided by the embodiment of the application, after the acquisition request of the data set is received, the directory information of the data set under the storage path is acquired, the corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire the data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by utilizing the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.

Additional features and advantages of the embodiments of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the present application and are incorporated in and constitute a part of this specification, illustrate embodiments of the present application and together with the examples of the embodiments of the present application do not constitute a limitation of the embodiments of the present application.

Fig. 1 is a flowchart of a method for acquiring a data set on a server according to an embodiment of the present application;

fig. 2 is a flowchart of an apparatus for acquiring a data set according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the embodiments of the present application, features in the embodiments and the examples may be arbitrarily combined with each other without conflict.

Fig. 1 is a flowchart of a method for acquiring a data set on a server according to an embodiment of the present application. As shown in fig. 1, the method of fig. 1 includes,

step 101, after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;

in one exemplary embodiment, the fetch request may be initiated by a compute node performing a model training operation; the obtaining request may include identification information of the data set, where the identification information may be file name information or storage location information; determining a storage path of the data set according to the identification information of the data set, and reading directory information of data under the storage path;

102, configuring a corresponding thread for each directory entry in the directory information;

in an exemplary embodiment, the directory information may include at least two directory entries, and a thread for acquiring data is created for each directory entry; and establishing a corresponding thread based on the directory information, and establishing a corresponding relation between the thread and the data to be downloaded while finishing the division of the data acquisition task, so that the preparation time of the data set acquisition operation is shortened, and the data acquisition efficiency is improved.

In an exemplary embodiment, the thread setting can also be finished running after the downloading of the data of the corresponding directory entry is completed, so that the reasonable use of thread resources is realized.

And 103, controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.

In an exemplary embodiment, data of each directory entry can be obtained and downloaded in a multi-thread manner by using the thread corresponding to each directory entry, so that the data obtaining efficiency is improved.

According to the method provided by the embodiment of the application, after an acquisition request of a data set is received, directory information of the data set under a storage path is acquired, a corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.

The method provided by the embodiments of the present application is explained as follows:

acquiring a file format of data under a directory entry;

In the above exemplary embodiment, each thread may be selected to perform the above operation, or a part of threads may be selected to perform the above operation, where the selection of the part of threads may be determined according to the size of the data in the entry, for example, when the size of the data in the entry meets a preset judgment condition that the data size is large, the thread in the directory entry may be configured to perform the above operation.

In the above exemplary embodiment, a thread pool is used to open a separate thread for each directory, which is responsible for caching a data set in the directory, before caching, all files in the directory are scanned first, the files are divided into four categories, i.e., pictures, videos, compressed files, and other files, and according to the configured caching priority, the specified file types can be cached preferentially, so that the urgency of obtaining different types of data information by a user is satisfied.

acquiring file size information of each file in the same type of files;

In the above exemplary embodiment, each file type that is preferentially obtained is sorted according to the file size, the small files of the file type are preferentially cached, and when the file downloading is completed, the downloading operation of as many files as possible is completed.

In the above exemplary embodiment, the data set to be cached is checked, and in the specified cache directory location, whether the data set exists or not is checked, if the data set does not exist, the data set is cached according to the latest data set, and if the data set exists, whether the data set is the breakpoint transmission is judged; if the breakpoint continuous transmission is performed, reading the breakpoint information record, caching the data set from the breakpoint, and if the breakpoint continuous transmission is not performed, judging whether the data set changes or not according to the size, the number and the date of the files in the directory; if the change occurs, re-caching; and if no change occurs, outputting prompt information.

acquiring a storage format corresponding to the type of the storage system;

In the above exemplary embodiment, different users may store the data sets in different storage systems, such as beegfr, NFS (Network File System), HDFS (Hadoop Distributed File System), or cloud storage systems, and when caching the data sets, the different storage systems need to be connected. By acquiring the type of the storage system and controlling the thread corresponding to each directory entry to acquire data according to the storage format corresponding to the type of the storage system, the problem of unmatched storage formats generated when a data set is used is reduced, and a training task can be ensured to enter an operating state quickly.

The method examples provided in the examples of the present application are further illustrated below:

the method provided by the embodiment of the application can be applied to a server carrying a linux system, and can be used as a tool module to enter an AI artificial intelligence platform in a nested mode. For example, the execution steps of the method are encapsulated into a linux command, and the linux command is directly used as the linux command and is 6-7 times faster than the copying speed of a linux cp copying command; the tool can also be nested into an AI platform to be used as a module for caching a data set, and also can be configured into an API interface to be directly used through the API interface.

In addition, the method can cache the data set required by AI model training, meanwhile, the BeeGFS storage system, the HDFS storage system, the NFS storage system and the Dahua cloud storage system can be connected in a butt joint mode, the data set in the storage system is cached to the server, the time for caching the data set is shortened, and the training task can be enabled to enter the running state quickly.

Fig. 2 is a flowchart of a data set acquisition method according to an embodiment of the present application. As shown in fig. 1, the method includes:

step 201, reading configuration parameters of the acquisition operation of the data set;

step 202, receiving an operation command and parameters for executing the acquisition operation;

the operation command is to cache the data set to the local server, and the parameter can be the designated position of the data set storage system in the local server;

in addition, instead of caching the data set locally from the interfaced storage system, a file or directory copy may be made directly at the server, similar to the linux cp copy command.

Step 203, judging the type of the storage system;

the method comprises the steps that storage parameters for butting a BeeGFS storage system, an NFS storage system, an HDFS storage system and a Dahua cloud storage system are configured in advance, and the corresponding storage systems can be automatically butted according to the configured or input storage type parameters;

the special packaging processing is carried out on mount type storage systems BeeGFS and NFS, non-mount type storage systems HDFS and Dahua cloud storage systems and linux local storage systems, and when the tool is operated to cache data sets, the tool can automatically match corresponding storage methods according to transmitted parameters to cache the data sets.

When the data set is cached, the caching speed of the data set is influenced by a network and storage equipment, and the higher the network condition and the bandwidth are, the higher the caching speed is; the cache speed of the memory of the storage device is higher than that of the SSD solid state disk, and the cache speed of the solid state disk is higher than that of the HDD mechanical hard disk. Even in a general HDD mechanical hard disk, the cache speed of the tool is 7 times faster than the cp copy command of the linux system itself, and the copy time of a large data set can be shortened to a great extent.

If the corresponding storage system is matched, executing step 204 to step 208; otherwise, go to step 209, output the unusual prompt message;

step 204, after the corresponding storage system is matched, checking whether a data set to be cached exists in a specified cache directory position, if not, caching according to the latest data set, if so, judging whether breakpoint continuous transmission exists, if so, checking the cached data set, reading breakpoint information records after the checking is passed, and starting caching the data set from the breakpoint; if the data set is not the breakpoint continuous transmission, judging whether the data set changes according to the size, the number and the date of the files in the directory, if so, re-caching, and if not, outputting prompt information.

Step 205, when starting to cache the data set, a thread pool is used to open a separate thread for each directory, which is responsible for caching the data set under the directory, before caching, all files under the directory are scanned, the files are divided into four categories, namely pictures, videos, compressed files and other files, and the specified file types can be cached preferentially according to the configured caching priority.

And step 206, when the data set is cached according to the priority, sorting the file types cached preferentially according to the file sizes, and caching the small files preferentially.

Step 207, if abnormal interruption occurs in the caching process, the method and the tool can store the breakpoint information into a breakpoint file or a database according to the configuration information.

And step 208, if the data set is cached completely, outputting a finishing prompt message.

According to the method provided by the embodiment of the application, breakpoint continuous transmission is supported when the data set is cached; the selection of the priority cache is realized by setting the cache priority configuration; supporting directory recursive caching, wherein a thread pool opens a thread for each directory and caches all files under the directory; sorting according to the size of the files, preferentially caching small files in a thread pool, cutting and slicing large files, and caching through a pipeline; by adopting the method, the data set required by the AI model training is cached, the time for caching the data set can be shortened, meanwhile, different storage systems can be connected, the data set is cached from the different storage systems to the server where the training task is located, the training task can be ensured to rapidly enter the running state, and the total time for model training is shortened.

An apparatus for storing a data set on a server is provided in an embodiment of the present application, and includes a processor and a memory, where the memory stores a computer program, and the processor is configured to call the computer program in the memory to implement operations including:

acquiring a file format of data under a directory entry;

acquiring file size information of each file in the same type of files;

acquiring a storage format corresponding to the type of the storage system;

According to the device provided by the embodiment of the application, after the acquisition request of the data set is received, the directory information of the data set under the storage path is acquired, a corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by utilizing the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method of obtaining a data set on a server, comprising:

2. The method according to claim 1, wherein the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:

acquiring a file format of data under a directory entry;

3. The method according to claim 2, wherein the batch-wise acquiring different kinds of files according to the preset data acquisition policy comprises:

acquiring file size information of each file in the same type of files;

4. The method according to claim 1, wherein the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:

5. The method of claim 1, wherein:

before the obtaining the directory information of the data set under the storage path, the method further includes:

acquiring a storage format corresponding to the type of the storage system;

6. An apparatus for obtaining a data set on a server, comprising a processor and a memory, wherein the memory stores a computer program, and the processor is configured to invoke the computer program in the memory to perform operations comprising:

7. The apparatus of claim 6, wherein the processor is configured to invoke a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry, and the operation includes:

acquiring a file format of data under a directory entry;

8. The apparatus of claim 7, wherein the processor is configured to invoke a computer program in the memory to implement the operations of batch-wise fetching different types of files according to a preset data fetching policy, comprising:

acquiring file size information of each file in the same type of files;

9. The apparatus of claim 6, wherein the processor is configured to invoke a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry, and the operation includes:

10. The apparatus of claim 6, wherein:

before the processor is configured to call the computer program in the memory to implement the operation of obtaining the directory information of the data set under the storage path, the processor is configured to call the computer program in the memory to implement the following operations, including:

acquiring a storage format corresponding to the type of the storage system;