CN111104387A - Method and device for acquiring data set on server - Google Patents
Method and device for acquiring data set on server Download PDFInfo
- Publication number
- CN111104387A CN111104387A CN201911156049.5A CN201911156049A CN111104387A CN 111104387 A CN111104387 A CN 111104387A CN 201911156049 A CN201911156049 A CN 201911156049A CN 111104387 A CN111104387 A CN 111104387A
- Authority
- CN
- China
- Prior art keywords
- files
- directory entry
- thread
- directory
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000004590 computer program Methods 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 8
- 238000012549 training Methods 0.000 description 14
- 230000005540 biological transmission Effects 0.000 description 6
- 238000013461 design Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 210000001503 joint Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44505—Configuring for program initiating, e.g. using registry, configuration files
- G06F9/4451—User profiles; Roaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application discloses a method and a device for acquiring a data set on a server. The method comprises the following steps: after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path; configuring a corresponding thread for each directory entry in the directory information; and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
Description
Technical Field
The present disclosure relates to the field of information processing, and more particularly, to a method and an apparatus for acquiring a data set on a server.
Background
In an artificial intelligence system, in the process of model training, a large number of data sets are required to be used as input sources to complete the training of a model. The acquisition of the data set is that the computing node which needs to execute the model training operation is copied or downloaded, wherein the larger the data set is, the longer the required copying or downloading time is, and the caching of the data set affects the training efficiency of the model and increases the total time of model training. If a distributed cluster is used, each computing node also needs a data set, and if a shared data set is not used, each computing node also needs to cache the data set, which further affects the efficiency of distributed training. Therefore, how to increase the acquisition speed of the data set is an urgent problem to be solved.
Disclosure of Invention
In order to solve any technical problem, embodiments of the present application provide a method and an apparatus for acquiring a data set on a server.
To achieve the object of the embodiment of the present application, an embodiment of the present application provides a method for acquiring a data set on a server, including:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In an exemplary embodiment, the batch-wise acquiring different types of files according to a preset data acquisition policy includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In an exemplary embodiment, before the obtaining the directory information of the data set under the storage path, the method further includes:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
An apparatus for obtaining a data set on a server, comprising a processor and a memory, wherein the memory stores a computer program, the processor being configured to invoke the computer program in the memory to implement operations comprising:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of batch-wise acquiring different types of files according to a preset data acquisition policy, and includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In an exemplary embodiment, the processor is configured to call the computer program in the memory to implement the following operations before the operation of obtaining the directory information of the data set under the storage path is implemented, and the processor is configured to call the computer program in the memory to implement the following operations further including:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
According to the scheme provided by the embodiment of the application, after the acquisition request of the data set is received, the directory information of the data set under the storage path is acquired, the corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire the data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by utilizing the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.
Additional features and advantages of the embodiments of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments of the present application and are incorporated in and constitute a part of this specification, illustrate embodiments of the present application and together with the examples of the embodiments of the present application do not constitute a limitation of the embodiments of the present application.
Fig. 1 is a flowchart of a method for acquiring a data set on a server according to an embodiment of the present application;
fig. 2 is a flowchart of an apparatus for acquiring a data set according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the embodiments of the present application, features in the embodiments and the examples may be arbitrarily combined with each other without conflict.
Fig. 1 is a flowchart of a method for acquiring a data set on a server according to an embodiment of the present application. As shown in fig. 1, the method of fig. 1 includes,
in one exemplary embodiment, the fetch request may be initiated by a compute node performing a model training operation; the obtaining request may include identification information of the data set, where the identification information may be file name information or storage location information; determining a storage path of the data set according to the identification information of the data set, and reading directory information of data under the storage path;
102, configuring a corresponding thread for each directory entry in the directory information;
in an exemplary embodiment, the directory information may include at least two directory entries, and a thread for acquiring data is created for each directory entry; and establishing a corresponding thread based on the directory information, and establishing a corresponding relation between the thread and the data to be downloaded while finishing the division of the data acquisition task, so that the preparation time of the data set acquisition operation is shortened, and the data acquisition efficiency is improved.
In an exemplary embodiment, the thread setting can also be finished running after the downloading of the data of the corresponding directory entry is completed, so that the reasonable use of thread resources is realized.
And 103, controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, data of each directory entry can be obtained and downloaded in a multi-thread manner by using the thread corresponding to each directory entry, so that the data obtaining efficiency is improved.
According to the method provided by the embodiment of the application, after an acquisition request of a data set is received, directory information of the data set under a storage path is acquired, a corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.
The method provided by the embodiments of the present application is explained as follows:
in an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In the above exemplary embodiment, each thread may be selected to perform the above operation, or a part of threads may be selected to perform the above operation, where the selection of the part of threads may be determined according to the size of the data in the entry, for example, when the size of the data in the entry meets a preset judgment condition that the data size is large, the thread in the directory entry may be configured to perform the above operation.
In the above exemplary embodiment, a thread pool is used to open a separate thread for each directory, which is responsible for caching a data set in the directory, before caching, all files in the directory are scanned first, the files are divided into four categories, i.e., pictures, videos, compressed files, and other files, and according to the configured caching priority, the specified file types can be cached preferentially, so that the urgency of obtaining different types of data information by a user is satisfied.
In an exemplary embodiment, the batch-wise acquiring different types of files according to a preset data acquisition policy includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In the above exemplary embodiment, each file type that is preferentially obtained is sorted according to the file size, the small files of the file type are preferentially cached, and when the file downloading is completed, the downloading operation of as many files as possible is completed.
In an exemplary embodiment, the controlling the thread corresponding to each directory entry to perform an obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In the above exemplary embodiment, the data set to be cached is checked, and in the specified cache directory location, whether the data set exists or not is checked, if the data set does not exist, the data set is cached according to the latest data set, and if the data set exists, whether the data set is the breakpoint transmission is judged; if the breakpoint continuous transmission is performed, reading the breakpoint information record, caching the data set from the breakpoint, and if the breakpoint continuous transmission is not performed, judging whether the data set changes or not according to the size, the number and the date of the files in the directory; if the change occurs, re-caching; and if no change occurs, outputting prompt information.
In an exemplary embodiment, before the obtaining the directory information of the data set under the storage path, the method further includes:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
In the above exemplary embodiment, different users may store the data sets in different storage systems, such as beegfr, NFS (Network File System), HDFS (Hadoop Distributed File System), or cloud storage systems, and when caching the data sets, the different storage systems need to be connected. By acquiring the type of the storage system and controlling the thread corresponding to each directory entry to acquire data according to the storage format corresponding to the type of the storage system, the problem of unmatched storage formats generated when a data set is used is reduced, and a training task can be ensured to enter an operating state quickly.
The method examples provided in the examples of the present application are further illustrated below:
the method provided by the embodiment of the application can be applied to a server carrying a linux system, and can be used as a tool module to enter an AI artificial intelligence platform in a nested mode. For example, the execution steps of the method are encapsulated into a linux command, and the linux command is directly used as the linux command and is 6-7 times faster than the copying speed of a linux cp copying command; the tool can also be nested into an AI platform to be used as a module for caching a data set, and also can be configured into an API interface to be directly used through the API interface.
In addition, the method can cache the data set required by AI model training, meanwhile, the BeeGFS storage system, the HDFS storage system, the NFS storage system and the Dahua cloud storage system can be connected in a butt joint mode, the data set in the storage system is cached to the server, the time for caching the data set is shortened, and the training task can be enabled to enter the running state quickly.
Fig. 2 is a flowchart of a data set acquisition method according to an embodiment of the present application. As shown in fig. 1, the method includes:
step 201, reading configuration parameters of the acquisition operation of the data set;
step 202, receiving an operation command and parameters for executing the acquisition operation;
the operation command is to cache the data set to the local server, and the parameter can be the designated position of the data set storage system in the local server;
in addition, instead of caching the data set locally from the interfaced storage system, a file or directory copy may be made directly at the server, similar to the linux cp copy command.
Step 203, judging the type of the storage system;
the method comprises the steps that storage parameters for butting a BeeGFS storage system, an NFS storage system, an HDFS storage system and a Dahua cloud storage system are configured in advance, and the corresponding storage systems can be automatically butted according to the configured or input storage type parameters;
the special packaging processing is carried out on mount type storage systems BeeGFS and NFS, non-mount type storage systems HDFS and Dahua cloud storage systems and linux local storage systems, and when the tool is operated to cache data sets, the tool can automatically match corresponding storage methods according to transmitted parameters to cache the data sets.
When the data set is cached, the caching speed of the data set is influenced by a network and storage equipment, and the higher the network condition and the bandwidth are, the higher the caching speed is; the cache speed of the memory of the storage device is higher than that of the SSD solid state disk, and the cache speed of the solid state disk is higher than that of the HDD mechanical hard disk. Even in a general HDD mechanical hard disk, the cache speed of the tool is 7 times faster than the cp copy command of the linux system itself, and the copy time of a large data set can be shortened to a great extent.
If the corresponding storage system is matched, executing step 204 to step 208; otherwise, go to step 209, output the unusual prompt message;
step 204, after the corresponding storage system is matched, checking whether a data set to be cached exists in a specified cache directory position, if not, caching according to the latest data set, if so, judging whether breakpoint continuous transmission exists, if so, checking the cached data set, reading breakpoint information records after the checking is passed, and starting caching the data set from the breakpoint; if the data set is not the breakpoint continuous transmission, judging whether the data set changes according to the size, the number and the date of the files in the directory, if so, re-caching, and if not, outputting prompt information.
Step 205, when starting to cache the data set, a thread pool is used to open a separate thread for each directory, which is responsible for caching the data set under the directory, before caching, all files under the directory are scanned, the files are divided into four categories, namely pictures, videos, compressed files and other files, and the specified file types can be cached preferentially according to the configured caching priority.
And step 206, when the data set is cached according to the priority, sorting the file types cached preferentially according to the file sizes, and caching the small files preferentially.
Step 207, if abnormal interruption occurs in the caching process, the method and the tool can store the breakpoint information into a breakpoint file or a database according to the configuration information.
And step 208, if the data set is cached completely, outputting a finishing prompt message.
According to the method provided by the embodiment of the application, breakpoint continuous transmission is supported when the data set is cached; the selection of the priority cache is realized by setting the cache priority configuration; supporting directory recursive caching, wherein a thread pool opens a thread for each directory and caches all files under the directory; sorting according to the size of the files, preferentially caching small files in a thread pool, cutting and slicing large files, and caching through a pipeline; by adopting the method, the data set required by the AI model training is cached, the time for caching the data set can be shortened, meanwhile, different storage systems can be connected, the data set is cached from the different storage systems to the server where the training task is located, the training task can be ensured to rapidly enter the running state, and the total time for model training is shortened.
An apparatus for storing a data set on a server is provided in an embodiment of the present application, and includes a processor and a memory, where the memory stores a computer program, and the processor is configured to call the computer program in the memory to implement operations including:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of batch-wise acquiring different types of files according to a preset data acquisition policy, and includes:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
In an exemplary embodiment, the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
In an exemplary embodiment, the processor is configured to call the computer program in the memory to implement the following operations before the operation of obtaining the directory information of the data set under the storage path is implemented, and the processor is configured to call the computer program in the memory to implement the following operations further including:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
According to the device provided by the embodiment of the application, after the acquisition request of the data set is received, the directory information of the data set under the storage path is acquired, a corresponding thread is configured for each directory item in the directory information, the thread corresponding to each directory item is controlled to acquire data under the respective directory item, at least two threads are set according to the directory information on an airport without destroying the original design of a server, and the data of the data set are downloaded by utilizing the at least two threads respectively, so that the acquisition time of the data set is shortened, and the acquisition efficiency of the data set is improved.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
Claims (10)
1. A method of obtaining a data set on a server, comprising:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
2. The method according to claim 1, wherein the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
3. The method according to claim 2, wherein the batch-wise acquiring different kinds of files according to the preset data acquisition policy comprises:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
4. The method according to claim 1, wherein the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
5. The method of claim 1, wherein:
before the obtaining the directory information of the data set under the storage path, the method further includes:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
6. An apparatus for obtaining a data set on a server, comprising a processor and a memory, wherein the memory stores a computer program, and the processor is configured to invoke the computer program in the memory to perform operations comprising:
after receiving an acquisition request of a data set, acquiring directory information of the data set under a storage path;
configuring a corresponding thread for each directory entry in the directory information;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry.
7. The apparatus of claim 6, wherein the processor is configured to invoke a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
acquiring a file format of data under a directory entry;
when the file formats are at least two, classifying the files according to the file formats to obtain at least two types of files;
and acquiring different types of files in batches according to a preset data acquisition strategy, wherein the data acquisition strategy is set according to the file format.
8. The apparatus of claim 7, wherein the processor is configured to invoke a computer program in the memory to implement the operations of batch-wise fetching different types of files according to a preset data fetching policy, comprising:
acquiring file size information of each file in the same type of files;
sorting the files in the same type of files according to the sequence of the sizes of the files from small to large to obtain sorting information;
and according to the sorting information, acquiring the files in the same type of files.
9. The apparatus of claim 6, wherein the processor is configured to invoke a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform the obtaining operation on the data under the respective directory entry, and the operation includes:
controlling at least one thread to execute the following operations on the files under the directory entry corresponding to the thread, wherein the operations comprise:
in the process of acquiring the files under the directory items corresponding to the threads, if the interruption of the file acquisition operation is detected, recording the information of the files subjected to the interruption operation;
and when the acquisition operation of the file in the directory corresponding to the thread is detected to be restarted, continuously executing the acquisition operation according to the information of the file with the interrupted operation.
10. The apparatus of claim 6, wherein:
before the processor is configured to call the computer program in the memory to implement the operation of obtaining the directory information of the data set under the storage path, the processor is configured to call the computer program in the memory to implement the following operations, including:
obtaining the type of a storage system used for storing the data set on the computing node which initiates the obtaining request;
the processor is configured to call a computer program in the memory to implement the operation of controlling the thread corresponding to each directory entry to perform an obtaining operation on data under the respective directory entry, and the operation includes:
acquiring a storage format corresponding to the type of the storage system;
and controlling the thread corresponding to each directory entry to acquire the data under the respective directory entry according to the storage format.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911156049.5A CN111104387A (en) | 2019-11-22 | 2019-11-22 | Method and device for acquiring data set on server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911156049.5A CN111104387A (en) | 2019-11-22 | 2019-11-22 | Method and device for acquiring data set on server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111104387A true CN111104387A (en) | 2020-05-05 |
Family
ID=70421137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911156049.5A Pending CN111104387A (en) | 2019-11-22 | 2019-11-22 | Method and device for acquiring data set on server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111104387A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111858538A (en) * | 2020-06-19 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Method, device, equipment and medium for configuring BeeGFS quota by cluster |
CN111966283A (en) * | 2020-07-06 | 2020-11-20 | 云知声智能科技股份有限公司 | Client multi-level caching method and system based on enterprise-level super-computation scene |
CN112214310A (en) * | 2020-09-09 | 2021-01-12 | 苏州浪潮智能科技有限公司 | Data set cache queuing method and device |
CN112600913A (en) * | 2020-12-07 | 2021-04-02 | 焦点科技股份有限公司 | Download management method based on Android |
CN113468119A (en) * | 2021-05-31 | 2021-10-01 | 北京明朝万达科技股份有限公司 | File scanning method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101977231A (en) * | 2010-10-21 | 2011-02-16 | 英业达集团(天津)电子技术有限公司 | Method for downloading mapping file |
CN103858407A (en) * | 2013-12-02 | 2014-06-11 | 华为技术有限公司 | File processing method, device and system |
CN104780120A (en) * | 2015-04-15 | 2015-07-15 | 天脉聚源(北京)教育科技有限公司 | Method and device for transmitting files in local area network |
CN110019024A (en) * | 2019-04-11 | 2019-07-16 | 苏州浪潮智能科技有限公司 | A kind of directory method, system and electronic equipment and storage medium |
-
2019
- 2019-11-22 CN CN201911156049.5A patent/CN111104387A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101977231A (en) * | 2010-10-21 | 2011-02-16 | 英业达集团(天津)电子技术有限公司 | Method for downloading mapping file |
CN103858407A (en) * | 2013-12-02 | 2014-06-11 | 华为技术有限公司 | File processing method, device and system |
CN104780120A (en) * | 2015-04-15 | 2015-07-15 | 天脉聚源(北京)教育科技有限公司 | Method and device for transmitting files in local area network |
CN110019024A (en) * | 2019-04-11 | 2019-07-16 | 苏州浪潮智能科技有限公司 | A kind of directory method, system and electronic equipment and storage medium |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111858538A (en) * | 2020-06-19 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Method, device, equipment and medium for configuring BeeGFS quota by cluster |
CN111858538B (en) * | 2020-06-19 | 2022-05-24 | 苏州浪潮智能科技有限公司 | Method, device, equipment and medium for configuring BeeGFS quota by cluster |
CN111966283A (en) * | 2020-07-06 | 2020-11-20 | 云知声智能科技股份有限公司 | Client multi-level caching method and system based on enterprise-level super-computation scene |
CN112214310A (en) * | 2020-09-09 | 2021-01-12 | 苏州浪潮智能科技有限公司 | Data set cache queuing method and device |
CN112214310B (en) * | 2020-09-09 | 2022-08-02 | 苏州浪潮智能科技有限公司 | Data set cache queuing method and device |
CN112600913A (en) * | 2020-12-07 | 2021-04-02 | 焦点科技股份有限公司 | Download management method based on Android |
CN113468119A (en) * | 2021-05-31 | 2021-10-01 | 北京明朝万达科技股份有限公司 | File scanning method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104387A (en) | Method and device for acquiring data set on server | |
US20200050588A1 (en) | Automatic file version verification within electronic mail | |
US20220197953A1 (en) | Model pushing method and device, model requesting method and device, storage medium and electronic device | |
CN104714835A (en) | Data access processing method and device | |
CN104182294A (en) | Method and device for backing up and recovering file | |
CN110727468A (en) | Method and apparatus for managing algorithm models | |
CN111240892A (en) | Data backup method and device | |
CN107943572A (en) | Data migration method, device, computer equipment and storage medium | |
CN110597764B (en) | File downloading and version management method and device | |
CN112134908B (en) | Application adaptation method, server, medium and vehicle-mounted multimedia system | |
CN111698281B (en) | Resource downloading method and device, electronic equipment and storage medium | |
CN113596087A (en) | Application upgrading method and device and computer readable storage medium | |
US20170140009A1 (en) | Caching linked queries for optimized compliance management | |
CN117076096A (en) | Task flow execution method and device, computer readable medium and electronic equipment | |
US10152490B2 (en) | Sequential replication with limited number of objects | |
CN110430279B (en) | File downloading control method and device | |
CN114564211A (en) | Cluster deployment method, cluster deployment device, equipment and medium | |
CN111506601B (en) | Data processing method and device | |
CN114615263A (en) | Cluster online migration method, device, equipment and storage medium | |
CN114020565A (en) | Intelligent log collection processing method and device, electronic equipment and storage medium | |
CN105812894A (en) | Video file processing method and device based on intelligent terminal | |
CN106407320B (en) | File processing method, device and system | |
US12047433B2 (en) | Model file issuing method, platform, system, terminal and readable storage medium | |
CN117519912B (en) | Mirror image warehouse deployment method, device, storage medium and equipment | |
US20240127148A1 (en) | Delta based task analysis for ci systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200505 |