CN112579297A

CN112579297A - Data processing method and device

Info

Publication number: CN112579297A
Application number: CN202011565212.6A
Authority: CN
Inventors: 徐翰章
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-30

Abstract

The embodiment of the application provides a data processing method and device. In the method provided by the embodiment of the application, the management server may obtain time consumed by each server in the distributed file system for processing the file processing request, and migrate the file blocks stored in the server to other servers according to the time consumed by the server. Therefore, the management server can determine the hot spot server from the distributed storage system according to the time for processing the file, and migrate the file blocks stored in the hot spot server to the non-hot spot server, so that the dynamic migration of the hot spot data and the hot spot task is realized.

Description

Data processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus.

Background

With the increasing expansion of user data size, the need for analysis and processing of large-scale data is increasing. As a result, distributed storage systems are becoming more widely used. The distributed storage system belongs to one type of server cluster and comprises a plurality of servers. In a distributed storage system, a large file may be broken into multiple file blocks, and the multiple file blocks may be stored on different servers.

Upon receiving the file processing task, the distributed storage system may determine one or more servers that store file blocks of the file to be processed according to the file processing task. The file processing task may then be sent to the server where the file blocks are stored, and the server where the file blocks are stored may process the file blocks. And finally, summarizing the processing result of each file block to obtain the processing result of the file to be processed, and returning the processing result to the user or the upper layer specific application.

However, if the distributed storage system receives a plurality of file processing tasks and the files corresponding to the file processing tasks are stored in the same server, the processing pressure of the server is greatly increased, and the speed of processing the files by the server is greatly reduced.

Moreover, the existing similar distributed storage system only performs fault-tolerant management, consistency management and the like on files, does not sense the existence of a hot spot server, and performs evacuation and online dynamic migration of corresponding file blocks; and the existing similar data processing system can logically disperse file blocks or request tasks for files, but cannot assist the distributed file system to complete the function.

Disclosure of Invention

In view of this, embodiments of the present application provide a data processing method and apparatus, which aim to dynamically migrate a file stored in a distributed storage system according to a time when a server processes the file, that is, a time when a file request is processed, which is recorded continuously and updated.

In a first aspect, an embodiment of the present application provides a data processing method, where the method includes:

the method comprises the steps that a management server receives a file processing request, wherein the file processing request comprises an identifier of a file to be processed, the file to be processed comprises at least one file block, and the at least one file block is stored in one or more servers of a distributed storage system;

the management server determines a first server set according to the identification of the file to be processed, wherein the first server set comprises at least one first server, and the first server stores one or more of the at least one file block;

the management server acquires a first time length corresponding to each first server in the first server set, wherein the first time length is the total time length from the time when the management server receives a file processing request to the time when the first server finishes processing the file processing request;

the management server determines a second server set from the first server set according to the first time length, wherein the second server set comprises at least one second server;

the management server copies any one or more of the at least one file block stored by each of the at least one second server to a target server set, wherein the target server set comprises at least one target server, and the target server is a server other than the second server set.

Optionally, the obtaining the first duration corresponding to each first server in the first server set includes:

the management server records the waiting time before each first server in the first server set processes the file processing request;

the management server records the time spent by each first server in the first server set in processing the file processing request;

and the management server determines the first time length according to the waiting time and the time for processing the file processing request.

Optionally, the determining, by the management server, the second server set from the first server set according to the first duration includes:

and the management server determines the first server with the longest first time length as a second server.

Optionally, the action of copying the file block is triggered by a file processing request, and before copying the file block, the method further includes:

the management server acquires a third server set, wherein the third server set comprises at least one third server, and the third server is a server which does not store the file blocks of the file to be processed;

the management server acquires the residual computing resources of each third server in the third server set;

the management server selects a target server from the third set of servers based on the remaining computing resources.

Optionally, the file processing request is generated by a file management system;

the method further comprises the following steps:

the management server acquires the identification of each target server in the target server set;

the management server sends a directory update request to the file management system, wherein the directory update request comprises an identifier of a target server.

In a second aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus is located in a management server, and the apparatus includes:

the file processing device comprises a receiving unit and a processing unit, wherein the receiving unit is used for receiving a file processing request, the file processing request comprises an identifier of a file to be processed, the file to be processed comprises at least one file block, and the at least one file block is stored in one or more servers of a distributed storage system;

a processing unit, configured to determine a first server set according to the identifier of the file to be processed, where the first server set includes at least one first server, and the first server stores one or more of the at least one file chunk; acquiring a first time length corresponding to each first server in the first server set, wherein the first time length is the total time length from the time when the file processing request is received by the management server to the time when the file processing request is processed by the first server; determining a second server set from the first server set according to the first duration, wherein the second server set comprises at least one second server; and copying any one or more of the at least one file block stored by each of the at least one second server to a target server, wherein the target server set comprises at least one target server, and the target server is a server other than the second server set.

Optionally, the processing unit is configured to record a waiting time before each first server in the first set of servers processes the file processing request; recording the time spent by each first server in the first server set in processing the file processing request; and determining the first time length according to the waiting time and the time for processing the file processing request.

Optionally, the processing unit is configured to determine the first server with the longest first duration as the second server.

Optionally, the processing unit is further configured to obtain a third server set, where the determination of the third server is triggered when a file request arrives each time, and the determined basis is a completed historical file processing request, where the third server set includes at least one third server, and the third server is a server that does not store a file block of the file to be processed; acquiring the residual computing resources of each third server in the third server set; selecting a target server from the third set of servers based on remaining computing resources.

Optionally, the file processing request is generated by a file management system; the processing unit is further configured to obtain an identifier of each target server in the target server set; sending a directory update request to the file management system, the directory update request including an identification of the target server.

In a third aspect, an embodiment of the present application provides a distributed file processing system, where the distributed file processing system includes a management server, and the management server is configured to execute the data processing method according to the first aspect.

The embodiment of the application provides a data processing method and device. The method is applied to a management server, the management server receives a file processing request aiming at a file to be processed, the file processing request comprises an identifier of the file to be processed, and the file to be processed can be divided into at least one sub-monitor to be stored in one or more servers of a distributed storage system. Based on the identification of the pending file, the management server may determine, from a plurality of servers of the distributed storage system, at least one first server storing file blocks of the pending file, which may be referred to as a first set of servers. After the first server set is determined, the management server may process the file to be processed by using the first servers, and record a first time length corresponding to processing of each first server. The first time duration is the total time duration from the receiving of the file processing request to the completion of the processing of the file blocks stored by the first server. Then, the management server may determine at least one second server from the first set of servers according to the first duration, to obtain a second set of servers. Finally, the management server may copy at least one file block stored by each of the second server sets to any one of the target servers in the target server set, where the target server may be any one of servers other than the second servers in the distributed storage system. Therefore, the management server can determine the hot spot server from the distributed storage system according to the history time of the processed file request when the file request arrives each time, and migrate the file block associated with the current file processing request stored in the hot spot server to the non-hot spot server, so that the dynamic migration of the hot spot data and the hot spot task is realized.

Drawings

To illustrate the technical solutions in the present embodiment or the prior art more clearly, the drawings needed to be used in the description of the embodiment or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method of a data processing method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.

Detailed Description

A distributed storage system may split a file into multiple file chunks and store in one or more servers. Any one of the servers may store one or more file blocks. After receiving the file processing request, the management server may determine in which servers the file to be processed is stored according to the identifier of the file to be processed, and send the file processing request to the corresponding server, so that the server processes the stored file block.

However, the ability of the server to process data is limited. When a server node receives multiple file processing requests, the file processing requests that are temporarily not able to be processed may be added to the request queue. The server can process the file processing requests in the processing request queue one by one, so that the files are processed according to the sequence of the file processing requests. However, when there are many file processing requests in the request queue, the waiting time for the file processing request is long, resulting in a long time for the server to process the file to be processed.

In addition, if a server stores a large number of file blocks from the same file to be processed, the server needs to process a relatively large number of file blocks when processing the file to be processed, which results in a long time required for the server to process the file processing request. Since the total length of time required to process the document to be processed depends on the server that takes the longest time to process, the time required to process the entire document to be processed is relatively long.

In order to provide a technical solution that can adjust the storage location of a file block according to the time when a server processes the file block based on an existing distributed storage system, the present application provides a data processing method, and a preferred embodiment of the present application will be described below from the perspective of a management server of the distributed storage system. It should be noted that the management server provided in the embodiment of the present application may be a management server of a distributed storage system, or may be an independent management server.

Referring to fig. 1, fig. 1 is a flowchart of a method of a data processing method provided in an embodiment of the present application, including:

s101: the management server receives a file processing request.

The management server may receive a file processing request, which may include an identification of a pending file. The pending file may be divided into at least one file block for storage in one or more servers of the distributed storage system. Alternatively, different file blocks may be stored in the same server or in different servers.

In the embodiment of the present application, the file processing request may be sent by the user through the terminal device, or may be generated by the file management system according to the request of the user. For example, a user may directly send a file processing request for extracting a file to a management server of the distributed storage system using a mobile terminal such as a mobile phone, or may send a file acquisition request to a server of the file management system, and the server of the file management system generates a file processing request according to the request of the user.

S102: and the management server determines a first server set according to the identifier of the file to be processed.

According to the identification of the file to be processed, the management server determines which server or servers in the distributed storage system store the file blocks of the file to be processed, and determines the server storing the file blocks of the file to be processed as a first server to obtain a first server set.

Optionally, when the pending file is stored in the distributed storage system, the server storing the file block of the pending file may actively notify the management server. For example, suppose file A is split into file block a1 and file block a2 in storage to a distributed storage system, where file block a1 is stored at server X and file block a2 is stored at server Y. Server X may send the identification of file A and the identification of file block a1 to the management server, and server Y may send the identification of file A and the identification of file block a2 to the management server. Thus, upon receiving a processing request for file a, the management server may determine that file a is split into two file chunks, a1 and a2, which are stored in server X and server Y, respectively, thereby determining that the first set of servers includes server X and server Y.

After determining the first server set, the management server may control each first server in the first server set to process the file block of the file to be processed, which is not described herein again.

S103: the management server acquires a first time length corresponding to each first server in the first server set.

The management server may obtain a first time length corresponding to each first server in the first server set, where the first time length is a time interval from when the management server receives the file processing request to when the first server finishes processing the file blocks of the to-be-processed file stored in the first server, and includes a waiting time before the first server starts processing the file blocks of the to-be-processed file and a time consumed by the first server to actually process the file blocks of the to-be-processed file, where the information is information fed back by a completed historical file processing request, and if a certain file block is not processed in history, the first time length is 0.

Alternatively, the management server may record the waiting time of each first server in the first set of servers before processing the file processing request, i.e., the time required for the first server to process the file processing request to process other file processing requests. The latency of the first server indicates how busy the first server is. Obviously, the longer the waiting time, the more file processing requests queued in the waiting queue representing the first server, the more busy the first server. Then, the management server may transfer the data stored in the first server to another server, so as to reduce the load of the first server and improve the processing efficiency.

The management server may also record the actual time spent by each first server in the first set of servers processing the file processing request. The more time the first server consumes to process a file processing request, the more number of file blocks representing a file to be processed stored in the first server, or the more complicated logic for processing a certain file block. Then, the management server may transfer the data stored in the first server and associated with the current file processing request to another server, so as to reduce the load of the first server and improve the processing efficiency.

S104: the management server determines a second server set from the first server set according to the first time length.

After determining the first time duration of each first server in the first server set, the management server may determine at least one second server from the first server set according to the first time duration corresponding to each first server, so as to obtain a second server set. The first duration of the second server may be longer than the first duration of the non-second servers in the first set of servers.

Optionally, the management server may determine a first server with a longest first duration in the first set of servers as the second server. Of course, the management server may also determine, as the second server, the N first servers with the longest first duration in the first set of servers. And N is a positive integer which is greater than 1 and less than the number of the first servers in the first server set.

S105: the management server copies any one or more of the at least one file blocks stored by each of the at least one second server to the set of target servers.

The second server is a server with the first duration in the first server set, which indicates that the waiting duration of the second server is longer and/or the number of stored file blocks is larger, and the second server is a key server which limits the speed of the distributed storage system for processing the file to be processed. Then the management server may copy any one or more of the file blocks stored in the second server to the target server. In this way, the management server can process the copied file blocks by using the target server after receiving the next file processing request for the file to be processed, thereby reducing the pressure of the second server.

In the embodiment of the present application, the target server may be another server in the distributed storage system except the second server, for example, the target server may be the first server or a server that does not store the file blocks of the file to be processed.

In some possible implementations, the management server may determine the target server based on the remaining computing resources of the server or the network bandwidth between the servers. Specifically, the management server may first obtain a third server set, where the third server set includes at least one third server, and the third server is a server that does not store file blocks of the file to be processed in the distributed storage system. Next, the management server may obtain the remaining computing resources of each third server in the third server set, such as parameters of CPU usage, memory usage, and network bandwidth occupancy of the third server. Based on the remaining computing resources of the third server, the management server may determine a target server set from the third server set, for example, determine M third servers with the highest remaining computing resources as target servers, where M is a positive integer greater than 0 and less than the number of servers in the third server set.

Alternatively, the management server may delete the file blocks stored in the second server after copying the file blocks to the target server. Of course, the management server may also delete file blocks stored in the second server.

When the management server reserves the file block stored in the second server and the file processing request is generated by the file management system, the second server may obtain the identifier of each target server in the set of target servers, and send the identifier of the target servers to the file management system with the directory update request. In this way, when a new file processing request is received, the file management system may determine that the file to be processed for the file processing request is stored in the target server instead of the second server, so as to process the file to be processed by using the target server.

The embodiment of the application provides a data processing method. The method is applied to a management server, the management server receives a file processing request aiming at a file to be processed, the file processing request comprises an identifier of the file to be processed, and the file to be processed can be divided into at least one sub-monitor to be stored in one or more servers of a distributed storage system. Based on the identification of the pending file, the management server may determine, from a plurality of servers of the distributed storage system, at least one first server storing file blocks of the pending file, which may be referred to as a first set of servers. After the first server set is determined, the management server may process the file to be processed by using the first servers, and record a first time length corresponding to processing of each first server. The first time duration is the total time duration from the receiving of the file processing request to the completion of the processing of the file blocks stored by the first server. Then, the management server may determine at least one second server from the first set of servers according to the first duration, to obtain a second set of servers. Finally, the management server may copy at least one file block stored by each of the second server sets to any one of the target servers in the target server set, where the target server may be any one of servers other than the second servers in the distributed storage system. Therefore, the management server can determine the hotspot server from the distributed storage system according to the time for processing the file, and migrate the file block associated with the current file processing request stored in the hotspot server to the non-hotspot server, so that the dynamic migration of hotspot data and hotspot tasks is realized.

The foregoing provides some specific implementation manners of the data processing method for the embodiments of the present application, and based on this, the present application also provides a corresponding apparatus. The data processing apparatus provided in the embodiments of the present application will be described below from the perspective of functional modularization.

Referring to the schematic structural diagram of the data processing apparatus shown in fig. 2, the apparatus 200 includes:

a receiving unit 210, configured to receive a file processing request, where the file processing request includes an identifier of a file to be processed, and the file to be processed includes at least one file block, and the at least one file block is stored in one or more servers of the distributed storage system.

A processing unit 220, configured to determine a first server set according to the identifier of the file to be processed, where the first server set includes at least one first server, and the first server stores one or more of the at least one file chunk; acquiring a first time length corresponding to each first server in the first server set, wherein the first time length is the total time length from the time when the file processing request is received by the management server to the time when the file processing request is processed by the first server; determining a second server set from the first server set according to the first duration, wherein the second server set comprises at least one second server; and copying any one or more of the at least one file block stored by each of the at least one second server to a target server, wherein the target server set comprises at least one target server, and the target server is a server other than the second server set.

Optionally, in some possible implementations, the processing unit 220 is configured to record a waiting time before each first server in the first set of servers processes the file processing request; recording the time spent by each first server in the first server set in processing the file processing request; and determining the first time length according to the waiting time and the time for processing the file processing request.

Optionally, in some possible implementation manners, the processing unit 220 is configured to determine the first server with the longest first duration as the second server.

Optionally, in some possible implementation manners, the processing unit 220 is further configured to obtain a third server set, where the third server set includes at least one third server, and the third server is a server that does not store the file block of the file to be processed; acquiring the residual computing resources of each third server in the third server set; selecting a target server from the third set of servers based on remaining computing resources.

Optionally, in some possible implementations, the file processing request is generated by a file management system; the processing unit 220 is further configured to obtain an identifier of each target server in the set of target servers; sending a directory update request to the file management system, the directory update request including an identification of the target server.

The embodiment of the application provides a data processing device. The method is applied to a management server, the management server receives a file processing request aiming at a file to be processed, the file processing request comprises an identifier of the file to be processed, and the file to be processed can be divided into at least one sub-monitor to be stored in one or more servers of a distributed storage system. Based on the identification of the pending file, the management server may determine, from a plurality of servers of the distributed storage system, at least one first server storing file blocks of the pending file, which may be referred to as a first set of servers. After the first server set is determined, the management server may process the file to be processed by using the first servers, and record a first time length corresponding to processing of each first server. The first time duration is the total time duration from the receiving of the file processing request to the completion of the processing of the file blocks stored by the first server. Then, the management server may determine at least one second server from the first set of servers according to the first duration, to obtain a second set of servers. Finally, the management server may copy at least one file block stored by each of the second server sets to any one of the target servers in the target server set, where the target server may be any one of servers other than the second servers in the distributed storage system. Therefore, the management server can determine the hotspot server from the distributed storage system according to the time for processing the file, and migrate the file block associated with the current file processing request stored in the hotspot server to the non-hotspot server, so that the dynamic migration of hotspot data and hotspot tasks is realized.

In the embodiments of the present application, the names "first" and "second" in the names "first server" and "second server" are used merely as name labels, and do not represent the first and second in sequence.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the obtaining the first duration corresponding to each first server in the first set of servers comprises:

the management server determines the first duration based on the wait time and the time taken to process the document processing request, the determination of the duration being obtained only by recording objective document processing requests, and not based on any estimate of duration.

3. The method of claim 1, wherein the determining, by the management server, the second set of servers from the first set of servers based on the first duration comprises:

4. The method of claim 1, wherein prior to copying the file block, the method further comprises:

5. The method of claim 1, wherein the file processing request is generated by a file management system

The method further comprises the following steps:

6. A data processing apparatus, wherein the apparatus is located at a management server, comprising:

a processing unit, configured to determine a first server set according to the identifier of the file to be processed, where the first server set includes at least one first server, and the first server stores one or more of the at least one file chunk; acquiring a first time length corresponding to each first server in the first server set, wherein the first time length is the total time length from the time when the file processing request is received by the management server to the time when the file processing request is processed by the first server, and the time length is updated after the file processing request is completed; determining a second server set from the first server set according to the first duration, wherein the second server set comprises at least one second server; and copying any one or more of the at least one file block stored by each of the at least one second server to a target server, wherein the target server set comprises at least one target server, and the target server is a server other than the second server set.

7. The apparatus of claim 6,

the processing unit is used for recording the waiting time before each first server in the first server set processes the file processing request; recording the time spent by each first server in the first server set in processing the file processing request; and determining the first time length according to the waiting time and the time spent on processing the file processing request, wherein the first time length is a result of the real execution of the completed request.

8. The apparatus of claim 6,

the processing unit is configured to determine the first server with the longest first duration as the second server.

9. The apparatus of claim 6,

the processing unit is further configured to acquire a third server set, where the third server set includes at least one third server, and the third server is a server that does not store file blocks of the file to be processed; acquiring the residual computing resources of each third server in the third server set; selecting a target server from the third set of servers based on remaining computing resources.

10. The apparatus of claim 6, wherein the file processing request is generated by a file management system;

the processing unit is further configured to obtain an identifier of each target server in the target server set; sending a directory update request to the file management system, the directory update request including an identification of the target server.