CN111104387A - Method and device for acquiring data set on server - Google Patents

Method and device for acquiring data set on server Download PDF

Info

Publication number
CN111104387A
CN111104387A CN201911156049.5A CN201911156049A CN111104387A CN 111104387 A CN111104387 A CN 111104387A CN 201911156049 A CN201911156049 A CN 201911156049A CN 111104387 A CN111104387 A CN 111104387A
Authority
CN
China
Prior art keywords
files
directory entry
thread
directory
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911156049.5A
Other languages
Chinese (zh)
Inventor
王继玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201911156049.5A priority Critical patent/CN111104387A/en
Publication of CN111104387A publication Critical patent/CN111104387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for acquiring a data set on a server. The method comprises the following steps: after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path; configuring a corresponding thread for each directory entry in the directory information; and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.

Description

Method and device for acquiring data set on server
Technical Field
The present disclosure relates to the field of information processing, and more particularly, to a method and an apparatus for acquiring a data set on a server.
Background
In an artificial intelligence system, in the process of model training, a large number of data sets are required to be used as input sources to complete the training of a model. The acquisition of the data set is that the computing node which needs to execute the model training operation is copied or downloaded, wherein the larger the data set is, the longer the required copying or downloading time is, and the caching of the data set affects the training efficiency of the model and increases the total time of model training. If a distributed cluster is used, each computing node also needs a data set, and if a shared data set is not used, each computing node also needs to cache the data set, which further affects the efficiency of distributed training. Therefore, how to increase the acquisition speed of the data set is an urgent problem to be solved.
Disclosure of Invention
In order to solve any technical problem, embodiments of the present application provide a method and an apparatus for acquiring a data set on a server.
To achieve the object of the embodiment of the present application, an embodiment of the present application provides a method for acquiring a data set on a server, including:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In an exemplary embodiment, the batch-wise acquiring different types of files according to a preset data acquisition policy includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In an exemplary embodiment, before the obtaining the directory information of the data set under the storage path, the method further includes:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
An apparatus for obtaining a data set on a server, comprising a processor and a memory, wherein the memory stores a computer program, the processor being configured to invoke the computer program in the memory to implement operations comprising:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of batch-wise acquiring different types of files according to a preset data acquisition policy, and includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In an exemplary embodiment, the processor is configured to call the computer program in the memory to implement the following operations before the operation of obtaining the directory information of the data set under the storage path is implemented, and the processor is configured to call the computer program in the memory to implement the following operations further including:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
According to the scheme provided by the embodiment of the application, after the acquisition request of the data set is received, the directory information of the data set under the storage path is acquired, the corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire the data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by utilizing the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.
Additional features and advantages of the embodiments of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments of the present application and are incorporated in and constitute a part of this specification, illustrate embodiments of the present application and together with the examples of the embodiments of the present application do not constitute a limitation of the embodiments of the present application.
Fig. 1 is a flowchart of a method for acquiring a data set on a server according to an embodiment of the present application;
fig. 2 is a flowchart of an apparatus for acquiring a data set according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the embodiments of the present application, features in the embodiments and the examples may be arbitrarily combined with each other without conflict.
Fig. 1 is a flowchart of a method for acquiring a data set on a server according to an embodiment of the present application. As shown in fig. 1, the method of fig. 1 includes,
step 101, after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
in one exemplary embodiment, the fetch request may be initiated by a compute node performing a model training operation; the obtaining request may include identification information of the data set, where the identification information may be file name information or storage location information; determining a storage path of the data set according to the identification information of the data set, and reading directory information of data under the storage path;
102, configuring a corresponding thread for each directory entry in the directory information;
in an exemplary embodiment, the directory information may include at least two directory entries, and a thread for acquiring data is created for each directory entry; and establishing a corresponding thread based on the directory information, and establishing a corresponding relation between the thread and the data to be downloaded while finishing the division of the data acquisition task, so that the preparation time of the data set acquisition operation is shortened, and the data acquisition efficiency is improved.
In an exemplary embodiment, the thread setting can also be finished running after the downloading of the data of the corresponding directory entry is completed, so that the reasonable use of thread resources is realized.
And 103, controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, data of each directory entry can be obtained and downloaded in a multi-thread manner by using the thread corresponding to each directory entry, so that the data obtaining efficiency is improved.
According to the method provided by the embodiment of the application, after an acquisition request of a data set is received, directory information of the data set under a storage path is acquired, a corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.
The method provided by the embodiments of the present application is explained as follows:
in an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In the above exemplary embodiment, each thread may be selected to perform the above operation, or a part of threads may be selected to perform the above operation, where the selection of the part of threads may be determined according to the size of the data in the entry, for example, when the size of the data in the entry meets a preset judgment condition that the data size is large, the thread in the directory entry may be configured to perform the above operation.
In the above exemplary embodiment, a thread pool is used to open a separate thread for each directory, which is responsible for caching a data set in the directory, before caching, all files in the directory are scanned first, the files are divided into four categories, i.e., pictures, videos, compressed files, and other files, and according to the configured caching priority, the specified file types can be cached preferentially, so that the urgency of obtaining different types of data information by a user is satisfied.
In an exemplary embodiment, the batch-wise acquiring different types of files according to a preset data acquisition policy includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In the above exemplary embodiment, each file type that is preferentially obtained is sorted according to the file size, the small files of the file type are preferentially cached, and when the file downloading is completed, the downloading operation of as many files as possible is completed.
In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In the above exemplary embodiment, the data set to be cached is checked, and in the specified cache directory location, whether the data set exists or not is checked, if the data set does not exist, the data set is cached according to the latest data set, and if the data set exists, whether the data set is the breakpoint transmission is judged; if the breakpoint continuous transmission is performed, reading the breakpoint information record, caching the data set from the breakpoint, and if the breakpoint continuous transmission is not performed, judging whether the data set changes or not according to the size, the number and the date of the files in the directory; if the change occurs, re-caching; and if no change occurs, outputting prompt information.
In an exemplary embodiment, before the obtaining the directory information of the data set under the storage path, the method further includes:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
In the above exemplary embodiment, different users may store the data sets in different storage systems, such as beegfr, NFS (Network File System), HDFS (Hadoop Distributed File System), or cloud storage systems, and when caching the data sets, the different storage systems need to be connected. By acquiring the type of the storage system and controlling the thread corresponding to each directory entry to acquire data according to the storage format corresponding to the type of the storage system, the problem of unmatched storage formats generated when a data set is used is reduced, and a training task can be ensured to enter an operating state quickly.
The method examples provided in the examples of the present application are further illustrated below:
the method provided by the embodiment of the application can be applied to a server carrying a linux system, and can be used as a tool module to enter an AI artificial intelligence platform in a nested mode. For example, the execution steps of the method are encapsulated into a linux command, and the linux command is directly used as the linux command and is 6-7 times faster than the copying speed of a linux cp copying command; the tool can also be nested into an AI platform to be used as a module for caching a data set, and also can be configured into an API interface to be directly used through the API interface.
In addition, the method can cache the data set required by AI model training, meanwhile, the BeeGFS storage system, the HDFS storage system, the NFS storage system and the Dahua cloud storage system can be connected in a butt joint mode, the data set in the storage system is cached to the server, the time for caching the data set is shortened, and the training task can be enabled to enter the running state quickly.
Fig. 2 is a flowchart of a data set acquisition method according to an embodiment of the present application. As shown in fig. 1, the method includes:
step 201, reading configuration parameters of the acquisition operation of the data set;
step 202, receiving an operation command and parameters for executing the acquisition operation;
the operation command is to cache the data set to the local server, and the parameter can be the designated position of the data set storage system in the local server;
in addition, instead of caching the data set locally from the interfaced storage system, a file or directory copy may be made directly at the server, similar to the linux cp copy command.
Step 203, judging the type of the storage system;
the method comprises the steps that storage parameters for butting a BeeGFS storage system, an NFS storage system, an HDFS storage system and a Dahua cloud storage system are configured in advance, and the corresponding storage systems can be automatically butted according to the configured or input storage type parameters;
the special packaging processing is carried out on mount type storage systems BeeGFS and NFS, non-mount type storage systems HDFS and Dahua cloud storage systems and linux local storage systems, and when the tool is operated to cache data sets, the tool can automatically match corresponding storage methods according to transmitted parameters to cache the data sets.
When the data set is cached, the caching speed of the data set is influenced by a network and storage equipment, and the higher the network condition and the bandwidth are, the higher the caching speed is; the cache speed of the memory of the storage device is higher than that of the SSD solid state disk, and the cache speed of the solid state disk is higher than that of the HDD mechanical hard disk. Even in a general HDD mechanical hard disk, the cache speed of the tool is 7 times faster than the cp copy command of the linux system itself, and the copy time of a large data set can be shortened to a great extent.
If the corresponding storage system is matched, executing step 204 to step 208; otherwise, go to step 209, output the unusual prompt message;
step 204, after the corresponding storage system is matched, checking whether a data set to be cached exists in a specified cache directory position, if not, caching according to the latest data set, if so, judging whether breakpoint continuous transmission exists, if so, checking the cached data set, reading breakpoint information records after the checking is passed, and starting caching the data set from the breakpoint; if the data set is not the breakpoint continuous transmission, judging whether the data set changes according to the size, the number and the date of the files in the directory, if so, re-caching, and if not, outputting prompt information.
Step 205, when starting to cache the data set, a thread pool is used to open a separate thread for each directory, which is responsible for caching the data set under the directory, before caching, all files under the directory are scanned, the files are divided into four categories, namely pictures, videos, compressed files and other files, and the specified file types can be cached preferentially according to the configured caching priority.
And step 206, when the data set is cached according to the priority, sorting the file types cached preferentially according to the file sizes, and caching the small files preferentially.
Step 207, if abnormal interruption occurs in the caching process, the method and the tool can store the breakpoint information into a breakpoint file or a database according to the configuration information.
And step 208, if the data set is cached completely, outputting a finishing prompt message.
According to the method provided by the embodiment of the application, breakpoint continuous transmission is supported when the data set is cached; the selection of the priority cache is realized by setting the cache priority configuration; supporting directory recursive caching, wherein a thread pool opens a thread for each directory and caches all files under the directory; sorting according to the size of the files, preferentially caching small files in a thread pool, cutting and slicing large files, and caching through a pipeline; by adopting the method, the data set required by the AI model training is cached, the time for caching the data set can be shortened, meanwhile, different storage systems can be connected, the data set is cached from the different storage systems to the server where the training task is located, the training task can be ensured to rapidly enter the running state, and the total time for model training is shortened.
An apparatus for storing a data set on a server is provided in an embodiment of the present application, and includes a processor and a memory, where the memory stores a computer program, and the processor is configured to call the computer program in the memory to implement operations including:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of batch-wise acquiring different types of files according to a preset data acquisition policy, and includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In an exemplary embodiment, the processor is configured to call the computer program in the memory to implement the following operations before the operation of obtaining the directory information of the data set under the storage path is implemented, and the processor is configured to call the computer program in the memory to implement the following operations further including:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
According to the device provided by the embodiment of the application, after the acquisition request of the data set is received, the directory information of the data set under the storage path is acquired, a corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by utilizing the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (10)

1. A method of obtaining a data set on a server, comprising:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
2. The method according to claim 1, wherein the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
3. The method according to claim 2, wherein the batch-wise acquiring different kinds of files according to the preset data acquisition policy comprises:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
4. The method according to claim 1, wherein the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
5. The method of claim 1, wherein:
before the obtaining the directory information of the data set under the storage path, the method further includes:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
6. An apparatus for obtaining a data set on a server, comprising a processor and a memory, wherein the memory stores a computer program, and the processor is configured to invoke the computer program in the memory to perform operations comprising:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
7. The apparatus of claim 6, wherein the processor is configured to invoke a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
8. The apparatus of claim 7, wherein the processor is configured to invoke a computer program in the memory to implement the operations of batch-wise fetching different types of files according to a preset data fetching policy, comprising:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
9. The apparatus of claim 6, wherein the processor is configured to invoke a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
10. The apparatus of claim 6, wherein:
before the processor is configured to call the computer program in the memory to implement the operation of obtaining the directory information of the data set under the storage path, the processor is configured to call the computer program in the memory to implement the following operations, including:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
CN201911156049.5A 2019-11-22 2019-11-22 Method and device for acquiring data set on server Pending CN111104387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911156049.5A CN111104387A (en) 2019-11-22 2019-11-22 Method and device for acquiring data set on server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911156049.5A CN111104387A (en) 2019-11-22 2019-11-22 Method and device for acquiring data set on server

Publications (1)

Publication Number Publication Date
CN111104387A true CN111104387A (en) 2020-05-05

Family

ID=70421137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911156049.5A Pending CN111104387A (en) 2019-11-22 2019-11-22 Method and device for acquiring data set on server

Country Status (1)

Country Link
CN (1) CN111104387A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858538A (en) * 2020-06-19 2020-10-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for configuring BeeGFS quota by cluster
CN111966283A (en) * 2020-07-06 2020-11-20 云知声智能科技股份有限公司 Client multi-level caching method and system based on enterprise-level super-computation scene
CN112214310A (en) * 2020-09-09 2021-01-12 苏州浪潮智能科技有限公司 Data set cache queuing method and device
CN112600913A (en) * 2020-12-07 2021-04-02 焦点科技股份有限公司 Download management method based on Android
CN113468119A (en) * 2021-05-31 2021-10-01 北京明朝万达科技股份有限公司 File scanning method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101977231A (en) * 2010-10-21 2011-02-16 英业达集团(天津)电子技术有限公司 Method for downloading mapping file
CN103858407A (en) * 2013-12-02 2014-06-11 华为技术有限公司 File processing method, device and system
CN104780120A (en) * 2015-04-15 2015-07-15 天脉聚源(北京)教育科技有限公司 Method and device for transmitting files in local area network
CN110019024A (en) * 2019-04-11 2019-07-16 苏州浪潮智能科技有限公司 A kind of directory method, system and electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101977231A (en) * 2010-10-21 2011-02-16 英业达集团(天津)电子技术有限公司 Method for downloading mapping file
CN103858407A (en) * 2013-12-02 2014-06-11 华为技术有限公司 File processing method, device and system
CN104780120A (en) * 2015-04-15 2015-07-15 天脉聚源(北京)教育科技有限公司 Method and device for transmitting files in local area network
CN110019024A (en) * 2019-04-11 2019-07-16 苏州浪潮智能科技有限公司 A kind of directory method, system and electronic equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858538A (en) * 2020-06-19 2020-10-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for configuring BeeGFS quota by cluster
CN111858538B (en) * 2020-06-19 2022-05-24 苏州浪潮智能科技有限公司 Method, device, equipment and medium for configuring BeeGFS quota by cluster
CN111966283A (en) * 2020-07-06 2020-11-20 云知声智能科技股份有限公司 Client multi-level caching method and system based on enterprise-level super-computation scene
CN112214310A (en) * 2020-09-09 2021-01-12 苏州浪潮智能科技有限公司 Data set cache queuing method and device
CN112214310B (en) * 2020-09-09 2022-08-02 苏州浪潮智能科技有限公司 Data set cache queuing method and device
CN112600913A (en) * 2020-12-07 2021-04-02 焦点科技股份有限公司 Download management method based on Android
CN113468119A (en) * 2021-05-31 2021-10-01 北京明朝万达科技股份有限公司 File scanning method and device

Similar Documents

Publication Publication Date Title
CN111104387A (en) Method and device for acquiring data set on server
US20200050588A1 (en) Automatic file version verification within electronic mail
US20220197953A1 (en) Model pushing method and device, model requesting method and device, storage medium and electronic device
CN104714835A (en) Data access processing method and device
CN104182294A (en) Method and device for backing up and recovering file
CN110727468A (en) Method and apparatus for managing algorithm models
CN111240892A (en) Data backup method and device
CN107943572A (en) Data migration method, device, computer equipment and storage medium
CN110597764B (en) File downloading and version management method and device
CN112134908B (en) Application adaptation method, server, medium and vehicle-mounted multimedia system
CN111698281B (en) Resource downloading method and device, electronic equipment and storage medium
CN113596087A (en) Application upgrading method and device and computer readable storage medium
US20170140009A1 (en) Caching linked queries for optimized compliance management
CN117076096A (en) Task flow execution method and device, computer readable medium and electronic equipment
US10152490B2 (en) Sequential replication with limited number of objects
CN110430279B (en) File downloading control method and device
CN114564211A (en) Cluster deployment method, cluster deployment device, equipment and medium
CN111506601B (en) Data processing method and device
CN114615263A (en) Cluster online migration method, device, equipment and storage medium
CN114020565A (en) Intelligent log collection processing method and device, electronic equipment and storage medium
CN105812894A (en) Video file processing method and device based on intelligent terminal
CN106407320B (en) File processing method, device and system
US12047433B2 (en) Model file issuing method, platform, system, terminal and readable storage medium
CN117519912B (en) Mirror image warehouse deployment method, device, storage medium and equipment
US20240127148A1 (en) Delta based task analysis for ci systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200505