WO2018045820A1

WO2018045820A1 - File synchronization method, device, and system

Info

Publication number: WO2018045820A1
Application number: PCT/CN2017/092523
Authority: WO
Inventors: 赵彦荣; 梁殿鹏; 崔鑫
Original assignee: 华为技术有限公司
Priority date: 2016-09-07
Filing date: 2017-07-11
Publication date: 2018-03-15
Also published as: CN106372221B; CN106372221A

Abstract

A file synchronization method, device, and system. The method comprises the following steps: a copy management device sending a last directory operation number of a previously synchronized file to a name node of a first HDFS cluster; receiving information about a file to be synchronized sent by the name node of the first HDFS cluster, the information about the file to be synchronized being information about a file corresponding to a directory operation number that is subsequent to the last directory operation number and determined by the name node of the first HDFS cluster; determining a target synchronization task according to the information about the file to be synchronized; a target copy execution device synchronizing, according to information about at least one file in the target synchronization task, the at least one file from a source data node to a destination data node. The synchronization method does not need to scan the entire file directory, thus improving the efficiency of file synchronization.

Description

一种文件同步的方法、设备及***Method, device and system for file synchronization

技术领域Technical field

本发明涉及数据处理技术领域，具体涉及一种文件同步的方法、设备及***。The present invention relates to the field of data processing technologies, and in particular, to a method, device, and system for file synchronization.

背景技术Background technique

Hadoop分布式文件***(Hadoop Distributed File System，HDFS)是高度容错的***，能提供高吞吐量的数据访问，适合部署大数据业务。The Hadoop Distributed File System (HDFS) is a highly fault-tolerant system that provides high-throughput data access and is ideal for deploying big data services.

HDFS可以包括多个HDFS集群，HDFS集群的布局采用主从结构，一个HDFS集群是包括一个主节点(NameNode，NN)和若干个数据节点(DataNode，DN)，主节点负责管理该HDFS集群的元数据，例如：管理该集群所存储文件的信息。数据节点负责存储文件。The HDFS can include multiple HDFS clusters. The layout of the HDFS cluster adopts a master-slave structure. An HDFS cluster includes a master node (NameNode, NN) and several data nodes (DataNodes, DNs). The master node is responsible for managing the elements of the HDFS cluster. Data, for example: information that manages files stored by the cluster. The data node is responsible for storing the file.

在当前大数据的环境下，数据的同步特别是跨地域跨集群的数据同步变得越来越重要，拥有很广阔的应用需求。In the current big data environment, data synchronization, especially cross-regional cross-cluster data synchronization is becoming more and more important, with a wide range of application requirements.

目前HDFS集群的文件同步采用的是Hadoop提供的分布式复制(Distribut copy，Distcp)方案，Distcp是用于大规模集群内部或集群之间文件同步的工具。Distcp使用映射/归约(Map/Reduce)实现文件同步，也就是通从源数据节点-Map/Reduce节点-目的数据节点的文件同步过程。目前的Distcp存在Distcp1与Distcp2两个版本。Currently, file synchronization in HDFS clusters is based on the distributed replication (Distcp) solution provided by Hadoop. Distcess is a tool for file synchronization within large-scale clusters or between clusters. Distcp uses Map/Reduce to implement file synchronization, which is the file synchronization process from the source data node-Map/Reduce node-destination data node. Currently, Distcp has two versions of Distcp1 and Distcp2.

Distcp1的操作过程是扫描整个目录，生成目录列表，然后根据目录列表中的文件名称以及文件大小，进行切分，切分为若干个碎片(splits)，一个map任务对应一个split，每个split是一个包含若干文件的文件列表，启动具有map任务的Map/Reduce节点进行复制，每个map任务负责同步一个split中的所有文件。The operation of Distcp1 is to scan the entire directory, generate a directory list, and then split according to the file name and file size in the directory list, and divide it into several fragments (splits). One map task corresponds to one split, and each split is A list of files containing several files, starting a Map/Reduce node with a map task for replication, each map task is responsible for synchronizing all files in a split.

Distcp2是在Distcp1基础上进行的改进，但仍需扫描整个目录，才能生成目录列表。Distcp2 is an improvement based on Distcp1, but the entire directory still needs to be scanned to generate a directory listing.

以上，Distcp1 Distcp2的复制方案都需要扫描整个目录，导致文件同步效率低下。Above, the replication scheme of Distcp1 Distcp2 needs to scan the entire directory, resulting in inefficient file synchronization.

发明内容Summary of the invention

为解决现有技术中HDFS集群文件同步效率低下的问题，本发明实施例提供一种文件同步的方法，可以依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息，进而确定待同步文件的信息，对待同步文件进行同步，不需要扫描整个文件目录，从而提高了文件同步的效率。本发明实施例还提供了相应的设备及***。In order to solve the problem of low efficiency of the HDFS cluster file synchronization in the prior art, the embodiment of the present invention provides a file synchronization method, which can directly determine the directory operation number connected to the directory end operation number according to the previous synchronized directory end operation number. The information of the corresponding file, thereby determining the information of the file to be synchronized, synchronizing the files to be synchronized, does not need to scan the entire file directory, thereby improving the efficiency of file synchronization. The embodiments of the present invention also provide corresponding devices and systems.

本发明第一方面提供一种文件同步的方法，该方法应用于Hadoop分布式文件***HDFS，该HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息，该多个HDFS 集群包括第一HDFS集群，该方法包括：复制管理设备向第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号；复制管理设备接收第一HDFS集群的主节点发送的待同步文件的信息，该待同步文件的信息是第一HDFS集群的主节点确定的与目录结束操作编号接续的目录操作编号所对应文件的信息；复制管理设备根据该待同步文件的信息，确定至少一个同步任务，其中，每个同步任务包含该待同步文件中至少一个文件的信息；复制管理设备接收目标复制执行设备发送的任务请求后，向目标复制执行设备发送目标同步任务，该目标同步任务用于目标复制执行设备根据目标同步任务中的至少一个文件的信息，将该至少一个文件从源数据节点同步到目的数据节点，源数据节点属于第一HDFS集群。其中，文件同步可以是周期性进行的，可以是一个周期同步一次，周期长度可以预先设定，本次同步是以紧相邻的前一次同步为基础的，前一次和本次指的是周期上连续的两次。目录结束操作编号可以是前一次同步的目标编号中最大的目录操作编号。接续指的是顺序上能衔接上，如：从小到大的衔接，目录结束操作编号是123，接续的目录操作编号是124，而且，接续的目录操作编号不限于一个，可以有多个，有一个与目录结束操作编号衔接的比结束操作编号大的所有操作编号都可以是与目录结束操作编号接续的目录操作编号。如：124、125和126等都是与123接续的目录操作编号。待同步文件中包括至少一个文件。目标复制执行设备是至少一个复制执行设备中的一个，目标同步任务是至少一个同步任务中的一个。从上述第一方面可以看出，本次同步可以依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息，进而确定待同步文件的信息，对待同步文件进行同步，不需要扫描整个文件目录，从而提高了文件同步的效率。A first aspect of the present invention provides a file synchronization method, which is applied to a Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary a node and at least two data nodes. For each HDFS cluster, at least two data nodes store files, and the master node maintains information about files stored by at least two data nodes in the cluster, and the plurality of HDFSs The cluster includes a first HDFS cluster, and the method includes: the replication management device sends a directory end operation number of the previously synchronized file to the primary node of the first HDFS cluster; and the replication management device receives the to-be-synchronized file sent by the primary node of the first HDFS cluster. The information of the file to be synchronized is information of a file corresponding to the directory operation number determined by the primary node of the first HDFS cluster and connected with the directory end operation number; the copy management device determines at least one synchronization according to the information of the file to be synchronized. a task, wherein each synchronization task includes information of at least one file in the file to be synchronized; after receiving the task request sent by the target copy execution device, the replication management device sends a target synchronization task to the target copy execution device, where the target synchronization task is used The target copy execution device synchronizes the at least one file from the source data node to the destination data node according to information of at least one file in the target synchronization task, and the source data node belongs to the first HDFS cluster. The file synchronization may be performed periodically, or may be synchronized once, and the period length may be preset. The current synchronization is based on the immediately preceding synchronization, and the previous and current periods are periodic. On two consecutive times. The directory end operation number can be the largest directory operation number of the previous synchronized target number. Connection refers to the sequence can be connected, such as: from small to large connection, the directory end operation number is 123, the subsequent directory operation number is 124, and the serial directory operation number is not limited to one, there can be more than one, there are All the operation numbers that are larger than the end operation number that are connected to the directory end operation number may be the directory operation numbers that are connected to the directory end operation number. For example, 124, 125, and 126 are all directory operation numbers connected to 123. At least one file is included in the file to be synchronized. The target copy execution device is one of at least one copy execution device, and the target synchronization task is one of at least one synchronization task. It can be seen from the above first aspect that the synchronization can directly determine the information of the file corresponding to the directory operation number connected to the directory end operation number according to the previous synchronized directory end operation number, thereby determining the information of the file to be synchronized, and treating Synchronizing files for synchronization does not require scanning the entire file directory, which improves the efficiency of file synchronization.

结合第一方面，在第一种可能的实现方式中，本发明提供的文件同步的方法还包括：该多个HDFS集群还包括第二HDFS集群，目的数据节点属于第二HDFS集群；至少一个文件的信息与存储该至少一个文件的源数据块的地址信息对应；该至少一个文件的信息用于目标复制执行设备从第一HDFS集群的主节点获取与该至少一个文件的信息对应的源数据块的地址信息，源数据块的地址信息用于目标复制执行设备确定源数据块的数量，源数据块的数量用于目标复制执行设备从第二HDFS集群的主节点获取目的数据块的地址信息，目的数据块的地址信息是第二HDFS集群的主节点根据源数据块的数量为至少一个文件分配的，源数据块的地址信息和目的数据块的地址信息用于目标复制执行设备指示所述源数据节点将至少一个文件从源数据块同步到目的数据块。从上述第一方面第一种可能的实现方式中可以看出，文件同步可以是跨集群的，采用依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息的方式进行文件同步，提高了集群间文件同步的效率。With reference to the first aspect, in a first possible implementation manner, the method for file synchronization provided by the present invention further includes: the multiple HDFS clusters further include a second HDFS cluster, and the destination data node belongs to the second HDFS cluster; at least one file The information corresponding to the address information of the source data block storing the at least one file; the information of the at least one file is used by the target copy execution device to acquire the source data block corresponding to the information of the at least one file from the primary node of the first HDFS cluster The address information of the source data block is used by the target copy execution device to determine the number of the source data blocks, and the number of the source data blocks is used by the target copy execution device to obtain the address information of the destination data block from the master node of the second HDFS cluster. The address information of the destination data block is that the primary node of the second HDFS cluster is allocated to at least one file according to the number of source data blocks, and the address information of the source data block and the address information of the destination data block are used by the target copy execution device to indicate the source. The data node synchronizes at least one file from the source data block to the destination data block. As can be seen from the first possible implementation manner of the foregoing first aspect, the file synchronization may be a cross-cluster, and the directory operation number corresponding to the directory end operation number is directly determined according to the directory end operation operation number of the previous synchronization. The file information is synchronized in the way of file synchronization, which improves the efficiency of file synchronization between clusters.

本发明第二方面提供一种文件同步的方法，该方法应用于Hadoop分布式文件***HDFS，该HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中该至少两个数据节点所存储文件的信息，该多个HDFS集群包括第一HDFS集群，该方法包括：复制执行设备接收复制管理设备发送的目标同步任务，该目标同步任务是复制管理设备根据待同步文件的信息确定的至少一个同步任务中的一个，其中，每个同步任务包含待同步文件中至少一个文件的信息，待同步文件的信息是复制管理设备向第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号后，由第一HDFS集群的主节点所确定的与所述目录结束操作编号接续的目录操作编号所对应文件的信息；复制执行设备根据目标同步任务中的至少一个文件的信息，将至少一个文件从源数据节点同步到目的数据节点，源数据节点属于所述第一HDFS集群。从上述第二方面可以看出，本次同步可以依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息，进而确定待同步文件的信息，对待同步文件进行同步，不需要扫描整个文件目录，从而提高了文件同步的效率。A second aspect of the present invention provides a method for file synchronization, which is applied to a Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary a node and at least two data nodes. For each HDFS cluster, at least two data nodes store files, and the master node maintains information about files stored by the at least two data nodes in the cluster, and the plurality of HDFS clusters include the first In the HDFS cluster, the method includes: the replication execution device receives a target synchronization task sent by the replication management device, where the target synchronization task is one of at least one synchronization task determined by the replication management device according to the information of the file to be synchronized, where each synchronization task Contains information about at least one file in the file to be synchronized, pending The information of the synchronization file is the directory that is determined by the master node of the first HDFS cluster and is connected to the directory end operation number after the copy management device sends the directory end operation number of the previously synchronized file to the primary node of the first HDFS cluster. The information of the file corresponding to the operation number; the copy execution device synchronizes the at least one file from the source data node to the destination data node according to the information of the at least one file in the target synchronization task, and the source data node belongs to the first HDFS cluster. It can be seen from the above second aspect that the synchronization can directly determine the information of the file corresponding to the directory operation number connected to the directory end operation number according to the previous synchronized directory end operation number, thereby determining the information of the file to be synchronized, and treating Synchronizing files for synchronization does not require scanning the entire file directory, which improves the efficiency of file synchronization.

结合第二方面，在第一种可能的实现方式中，该多个HDFS集群包括第二HDFS集群，目的数据节点属于所述第二HDFS集群；至少一个文件的信息与存储该至少一个文件的源数据块的地址信息对应；其中，第二方面中的步骤：复制执行设备根据目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，包括：目标复制执行设备根据至少一个文件的信息从第一HDFS集群的主节点获取与至少一个文件的信息对应的源数据块的地址信息；目标复制执行设备根据源数据块的地址信息，确定源数据块的数量；目标复制执行设备根据源数据块的数量从第二HDFS集群的主节点获取目的数据块的地址信息，目的数据块的地址信息是第二HDFS集群的主节点根据源数据块的数量为至少一个文件分配的；目标复制执行设备向源数据节点发送同步指示消息，该同步指示消息包含源数据块的地址信息和目的数据块的地址信息，源数据块的地址信息和目的数据块的地址信息用于源数据节点将至少一个文件从源数据块同步到目的数据块。从上述第二方面第一种可能的实现方式中可以看出，文件同步可以是跨集群的，采用依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息的方式进行文件同步，提高了集群间文件同步的效率。With reference to the second aspect, in a first possible implementation, the multiple HDFS clusters include a second HDFS cluster, and the destination data node belongs to the second HDFS cluster; information of at least one file and a source storing the at least one file The address information of the data block corresponds to; wherein, the step of the second aspect: the copy execution device synchronizes the at least one file from the source data node to the destination data node according to the information of the at least one file in the target synchronization task, including: the target The copy execution device acquires address information of the source data block corresponding to the information of the at least one file from the primary node of the first HDFS cluster according to the information of the at least one file; the target copy execution device determines the source data block according to the address information of the source data block. The target copy execution device obtains the address information of the destination data block from the master node of the second HDFS cluster according to the number of the source data blocks, and the address information of the destination data block is the master node of the second HDFS cluster according to the number of the source data blocks. a file allocation; the target replication execution device sends a synchronization index to the source data node a message indicating that the synchronization indication message includes address information of the source data block and address information of the destination data block, and the address information of the source data block and the address information of the destination data block are used by the source data node to synchronize at least one file from the source data block to the destination. data block. As can be seen from the first possible implementation manner of the foregoing second aspect, the file synchronization may be a cross-cluster, and the directory operation number corresponding to the directory end operation number is directly determined according to the directory end operation operation number of the previous synchronization. The file information is synchronized in the way of file synchronization, which improves the efficiency of file synchronization between clusters.

结合第二方面第一种可能的实现方式，在第二种可能的实现方式中，第二方面第一种可能的实现方式中的步骤：复制执行设备向源数据节点发送同步指示消息，包括：当所述源数据块的地址信息指示所述源数据块有多个时，复制执行设备针对每个源数据块向源数据节点并行发送所述同步指示消息。从上述第二方面第二种可能的实现方式中可以看出，同步指示消息可以只针对每个源数据块并行发送的，文件可以并行同步，从而进一步提高了文件同步的效率。With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the step of the first possible implementation manner of the second aspect: the replication execution device sends the synchronization indication message to the source data node, including: When the address information of the source data block indicates that there are multiple source data blocks, the copy execution device transmits the synchronization indication message to the source data node in parallel for each source data block. It can be seen from the second possible implementation manner of the foregoing second aspect that the synchronization indication message can be sent only in parallel for each source data block, and the files can be synchronized in parallel, thereby further improving the efficiency of file synchronization.

本发明第三方面提供一种文件同步的方法，该方法应用于Hadoop分布式文件***HDFS，该HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息，多个HDFS集群包括第一HDFS集群，当主节点属于所述第一HDFS集群时，该方法包括：主节点接收所述复制管理设备发送的前一次所同步数据文件的目录结束操作编号；主节点从文件的目录操作编号中确定与目录结束操作编号接续的目录操作编号，并确定所述接续的目录操作编号所对应的待同步文件的信息；主节点向复制管理设备发送待同步文件的信息，待同步文件的信息用于复制管理设备确定至少一个同步任务，其中，每个同步任务包含待同步文件中至少一个文件的信息，至少一个文件的信息用于目标复制执行设备将至少一个文件从源数据节点同步到目的数据节点，源数据节点属于第一HDFS集群。从上述第三方面可以看出，本次同步可以依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息，进而确定待同步文件的信息，对待同步文件进行同步，不需要扫描整个文件目录，从而提高了文件同步的效率。A third aspect of the present invention provides a file synchronization method, which is applied to a Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary a node and at least two data nodes. For each HDFS cluster, at least two data nodes store files, and the master node maintains information about files stored in at least two data nodes in the cluster, and the plurality of HDFS clusters include the first HDFS cluster. When the primary node belongs to the first HDFS cluster, the method includes: receiving, by the primary node, a directory end operation number of the previously synchronized data file sent by the replication management device; the primary node determining and the directory from the directory operation number of the file Ending the directory operation number of the operation number, and determining the information of the file to be synchronized corresponding to the subsequent directory operation number; the master node sends the information of the file to be synchronized to the copy management device, and the information of the file to be synchronized is used for the copy management device Determining at least one synchronization task, wherein each synchronization task includes to be synchronized At least one piece of information in the information document, at least one file replication target for execution The row device synchronizes at least one file from the source data node to the destination data node, and the source data node belongs to the first HDFS cluster. It can be seen from the above third aspect that the synchronization can directly determine the information of the file corresponding to the directory operation number connected to the directory end operation number according to the previous synchronized directory end operation number, thereby determining the information of the file to be synchronized, and treating Synchronizing files for synchronization does not require scanning the entire file directory, which improves the efficiency of file synchronization.

结合第三方面，在第一种可能的实现方式中，至少一个文件的信息与存储该至少一个文件的源数据块的地址信息对应；上述第三方面中的步骤：主节点向所述复制管理设备发送所述待同步文件的信息之后，该方法还包括：主节点接收目标复制执行设备发送的至少一个文件的信息；主节点确定与至少一个文件的信息对应的源数据块的地址信息；主节点向目标复制执行设备发送源数据块的地址信息，源数据块的地址信息用于目标复制执行设备确定源数据块的数量。从上述第三方面第一种可能的实现方式中可以看出，主节点可以确定出源数据块的地址信息，从而使目标复制执行设备可以针对每个源数据块并行向源数据节点发送同步指示消息，文件可以并行同步，从而进一步提高了文件同步的效率。With reference to the third aspect, in a first possible implementation, the information of the at least one file corresponds to the address information of the source data block storing the at least one file; the step in the foregoing third aspect: the primary node sends the copy management After the device sends the information of the file to be synchronized, the method further includes: the primary node receives the information of the at least one file sent by the target copy execution device; the primary node determines the address information of the source data block corresponding to the information of the at least one file; The node sends the address information of the source data block to the target copy execution device, and the address information of the source data block is used by the target copy execution device to determine the number of the source data block. It can be seen from the first possible implementation manner of the foregoing third aspect that the master node can determine the address information of the source data block, so that the target copy performing device can send the synchronization indication to the source data node in parallel for each source data block. Messages, files can be synchronized in parallel, further improving the efficiency of file synchronization.

结合第三方面，在第二种可能的实现方式中，该多个HDFS集群还包括第二HDFS集群，目的数据节点属于第二HDFS集群，当主节点属于第二HDFS集群时，该方法还包括：主节点接收目标复制执行设备发送的源数据块的数量；主节点根据源数据块的数量为至少一个文件创建目的数据块，并为目的数据块分配地址信息；主节点向目标复制执行设备发送目的数据块的地址信息，源数据块的地址信息和目的数据块的地址信息用于所述目标复制执行设备指示所述源数据节点将至少一个文件从源数据块同步到目的数据块。从上述第三方面第二种可能的实现方式中可以看出，主节点可以为依据源数据块的数量直接创建出相应数量的目的数据块，不需要逐个创建，从而进一步提高了文件同步的效率。With reference to the third aspect, in a second possible implementation, the multiple HDFS clusters further include a second HDFS cluster, and the destination data node belongs to the second HDFS cluster. When the primary node belongs to the second HDFS cluster, the method further includes: The master node receives the number of source data blocks sent by the target copy execution device; the master node creates a destination data block for at least one file according to the number of source data blocks, and allocates address information for the destination data block; the primary node sends the destination information to the target copy execution device. The address information of the data block, the address information of the source data block, and the address information of the destination data block are used by the target copy execution device to instruct the source data node to synchronize at least one file from the source data block to the destination data block. It can be seen from the second possible implementation manner of the foregoing third aspect that the master node can directly create a corresponding number of destination data blocks according to the number of source data blocks, and does not need to be created one by one, thereby further improving the efficiency of file synchronization. .

本发明第四方面提供一种文件同步的方法，该方法应用于Hadoop分布式文件***HDFS，该HDFS包括复制管理设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息，该多个HDFS集群包括第一HDFS集群，该方法包括：复制管理设备向第一HDFS集群中的主节点发送同步消息，该同步消息用于指示该第一HDFS集群中的主节点扫描文件目录，生成当前该第一HDFS集群中的主节点中所维护的文件信息的列表；该复制管理设备接收该列表后，通过与前一次所同步文件的列表进行比对，确定待同步文件的信息，该复制管理设备向源数据节点发送同步任务，该同步任务用于指示源数据节点将该待同步文件直接同步到目的数据节点。待同步文件可以分布在多个源数据块上，该同步可以是并行进行的，可以是源数据节点并行将各源数据块上的文件同步到目的数据节点的目的数据块上。从上述第四方面可以看出，本次同步是由源数据节点将待同步文件直接同步到目的数据节点，不需要由复制管理设备中转，缩减了文件同步的路径，提高了文件同步的效率。而且，文件同步可以是依据数据块并行进行，进一步提高了文件同步的效率。A fourth aspect of the present invention provides a file synchronization method, which is applied to a Hadoop distributed file system (HDFS), where the HDFS includes a replication management device and multiple HDFS clusters, and each HDFS cluster includes a primary node and at least two data respectively. The node, for each HDFS cluster, at least two data nodes store files, and the primary node maintains information about files stored in at least two data nodes in the cluster, the plurality of HDFS clusters including the first HDFS cluster, and the method includes: The replication management device sends a synchronization message to the primary node in the first HDFS cluster, where the synchronization message is used to indicate that the primary node in the first HDFS cluster scans the file directory and generates the current maintenance in the primary node in the first HDFS cluster. a list of file information; after receiving the list, the copy management device compares the list of previously synchronized files to determine information of the file to be synchronized, and the copy management device sends a synchronization task to the source data node, where the synchronization task is used The source data node is instructed to directly synchronize the file to be synchronized to the destination data node. The files to be synchronized may be distributed on multiple source data blocks, and the synchronization may be performed in parallel, and the source data nodes may synchronously synchronize the files on the source data blocks to the destination data blocks of the destination data node. It can be seen from the above fourth aspect that the synchronization is performed by the source data node to directly synchronize the file to be synchronized to the destination data node, which does not need to be transferred by the replication management device, reduces the path of file synchronization, and improves the efficiency of file synchronization. Moreover, file synchronization can be performed in parallel according to data blocks, further improving the efficiency of file synchronization.

本发明第五方面提供一种文件同步的装置，该处理装置被配置实现上述第一方面或第一方面任一可选的实现方式所提供的方法的功能，由软件实现，其软件包括与上述功能相应的单元，与上述功能相应的单元可以包括接收单元、处理单元和发送单元，该接收单元、处理单元和发送单元通信连接，接收单元用于实现相应接收的功能，发送单元用于实现相应发送的功能，处理单元用于实现相应处理的功能。A fifth aspect of the present invention provides a device for synchronizing files, which is configured to implement the functions of the method provided by any of the foregoing first aspect or the optional implementation of the first aspect, implemented by software, and the software includes The unit corresponding to the function, the unit corresponding to the above function may include a receiving unit, a processing unit and a sending unit, the receiving unit, the processing unit and the sending unit are communicatively connected, the receiving unit is configured to implement a corresponding receiving function, and the sending unit is configured to implement The corresponding function is sent, and the processing unit is used to implement the corresponding processing function.

本发明第六方面提供一种文件同步的装置，该处理装置被配置实现上述第二方面或第二方面任一可选的实现方式所提供的方法的功能，由软件实现，其软件包括与上述功能相应的单元，与上述功能相应的单元可以包括接收单元、处理单元和发送单元，该接收单元、处理单元和发送单元通信连接，接收单元用于实现相应接收的功能，发送单元用于实现相应发送的功能，处理单元用于实现相应处理的功能。A sixth aspect of the present invention provides a device for synchronizing files, the processing device being configured to implement the functions of the method provided by any of the foregoing second aspect or the optional implementation of the second aspect, implemented by software, the software comprising the foregoing The unit corresponding to the function may include a receiving unit, a processing unit, and a sending unit. The receiving unit, the processing unit, and the sending unit are communicably connected, and the receiving unit is configured to implement a corresponding receiving function, and the sending unit is configured to implement a corresponding The function sent, the processing unit is used to implement the corresponding processing function.

本发明第七方面提供一种文件同步的装置，该处理装置被配置实现上述第三方面或第三方面任一可选的实现方式所提供的方法的功能，由软件实现，其软件包括与上述功能相应的单元，与上述功能相应的单元可以包括接收单元、处理单元和发送单元，该接收单元、处理单元和发送单元通信连接，接收单元用于实现相应接收的功能，发送单元用于实现相应发送的功能，处理单元用于实现相应处理的功能。A seventh aspect of the present invention provides a device for synchronizing files, the processing device being configured to implement the functions of the method provided by any of the foregoing third aspect or the optional implementation of the third aspect, implemented by software, the software comprising the foregoing The unit corresponding to the function may include a receiving unit, a processing unit, and a sending unit. The receiving unit, the processing unit, and the sending unit are communicably connected, and the receiving unit is configured to implement a corresponding receiving function, and the sending unit is configured to implement a corresponding The function sent, the processing unit is used to implement the corresponding processing function.

本发明第八方面提供一种文件同步的装置，该处理装置被配置实现上述第四方面所提供的方法的功能，由软件实现，其软件包括与上述功能相应的单元，与上述功能相应的单元可以包括接收单元、处理单元和发送单元，该接收单元、处理单元和发送单元通信连接，接收单元用于实现相应接收的功能，发送单元用于实现相应发送的功能，处理单元用于实现相应处理的功能。An eighth aspect of the present invention provides a device for file synchronization, the processing device being configured to implement the functions of the method provided by the foregoing fourth aspect, implemented by software, the software comprising a unit corresponding to the foregoing function, and a unit corresponding to the function The receiving unit, the processing unit and the sending unit are communicatively connected, the receiving unit is configured to implement a corresponding receiving function, the sending unit is configured to implement a corresponding sending function, and the processing unit is configured to implement a corresponding processing The function.

本发明第九方面提供一种复制管理设备，该复制管理设备被配置实现上述第一方面或第一方面任一可选的实现方式所提供的方法的功能，由硬件实现，其硬件包括与上述功能相应的器件，与上述功能相应的器件可以包括收发器、处理器和存储器，该收发器、所述处理器和所述存储器通过总线连接，存储器用于存储处理器执行文件同步的程序，收发器用于实现相应的收发功能，处理器用于实现相应的处理功能。A ninth aspect of the present invention provides a copy management device, which is configured to implement the functions of the method provided by the foregoing first aspect or any of the optional implementations of the first aspect, implemented by hardware, and the hardware includes the foregoing The function corresponding device, the device corresponding to the above function may include a transceiver, a processor and a memory, the transceiver, the processor and the memory are connected by a bus, and the memory is used for storing a program for executing a file synchronization by the processor, and transmitting and receiving The device is used to implement the corresponding transceiving function, and the processor is used to implement the corresponding processing function.

本发明第十方面提供一种复制执行设备，该复制执行设备被配置实现上述第二方面或第二方面任一可选的实现方式所提供的方法的功能，由硬件实现，其硬件包括与上述功能相应的器件，与上述功能相应的器件可以包括收发器、处理器和存储器，该收发器、所述处理器和所述存储器通过总线连接，存储器用于存储处理器执行文件同步的程序，收发器用于实现相应的收发功能，处理器用于实现相应的处理功能。A tenth aspect of the present invention provides a copy execution device, which is configured to implement the functions of the method provided by any of the foregoing second aspect or the optional implementation of the second aspect, implemented by hardware, and the hardware includes the foregoing The function corresponding device, the device corresponding to the above function may include a transceiver, a processor and a memory, the transceiver, the processor and the memory are connected by a bus, and the memory is used for storing a program for executing a file synchronization by the processor, and transmitting and receiving The device is used to implement the corresponding transceiving function, and the processor is used to implement the corresponding processing function.

本发明第十一方面提供一种主节点，该主节点被配置实现上述第三方面或第三方面任一可选的实现方式所提供的方法的功能，由硬件实现，其硬件包括与上述功能相应的器件，与上述功能相应的器件可以包括收发器、处理器和存储器，该收发器、所述处理器和所述存储器通过总线连接，存储器用于存储处理器执行文件同步的程序，收发器用于实现相应的收发功能，处理器用于实现相应的处理功能。An eleventh aspect of the present invention provides a master node, where the master node is configured to implement the functions of the method provided by any of the foregoing third aspect or the third aspect, which is implemented by hardware, and the hardware includes the foregoing functions. Corresponding device, the device corresponding to the above function may include a transceiver, a processor and a memory, the transceiver, the processor and the memory are connected by a bus, and the memory is used for storing a program for executing a file synchronization by the processor, for the transceiver To implement the corresponding transceiving function, the processor is used to implement the corresponding processing function.

本发明第十二方面提供一种复制管理设备，该复制管理设备被配置实现上述第四方面所提供的方法的功能，由硬件实现，其硬件包括与上述功能相应的器件，与上述功能相应的器件可以包括收发器、处理器和存储器，该收发器、所述处理器和所述存储器通过总线连接，存储器用于存储处理器执行文件同步的程序，收发器用于实现相应的收发功能，处理器用于实现相应的处理功能。A twelfth aspect of the present invention provides a copy management device configured to implement the functions of the method provided by the foregoing fourth aspect, implemented by hardware, the hardware comprising a device corresponding to the foregoing function, corresponding to the foregoing function The device can include a transceiver, a processor, and a memory, the transceiver, the processor and the memory are connected by a bus, the memory is used to store a program for the processor to perform file synchronization, and the transceiver is configured to implement a corresponding transceiving function, and the processor To achieve the corresponding processing functions.

本发明第十三方面提供一种计算机存储介质，该计算机存储介质存储有上述第一方面或第一方面任一可选的实现方式的文件同步的程序。A thirteenth aspect of the present invention provides a computer storage medium storing the first Aspect or program of file synchronization of any of the alternative implementations of the first aspect.

本发明第十四方面提供一种计算机存储介质，该计算机存储介质存储有上述第二方面或第二方面任一可选的实现方式的文件同步的程序。A fourteenth aspect of the present invention provides a computer storage medium storing the program for file synchronization of the second aspect or any alternative implementation of the second aspect.

本发明第十五方面提供一种计算机存储介质，该计算机存储介质存储有上述第三方面或第三方面任一可选的实现方式的文件同步的程序。A fifteenth aspect of the present invention provides a computer storage medium storing the program for file synchronization of any of the third or third alternative aspects described above.

本发明第十六方面提供一种计算机存储介质，该计算机存储介质存储有上述第四方面的文件同步的程序。A sixteenth aspect of the invention provides a computer storage medium storing the program for file synchronization of the fourth aspect described above.

本发明第十七方面提供一种Hadoop分布式文件***，包括：复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息，多个HDFS集群包括第一HDFS集群；复制管理设备为上述第五方面所描述的文件同步的装置；所述复制执行设备为上述第六方面所描述的文件同步的装置；主节点为上述第七方面所描述的文件同步的装置。A seventeenth aspect of the present invention provides a Hadoop distributed file system, including: a replication management device, at least one replication execution device, and multiple HDFS clusters, each HDFS cluster respectively including a primary node and at least two data nodes, for each In the HDFS cluster, at least two data nodes store files, and the primary node maintains information about files stored in at least two data nodes in the cluster. The multiple HDFS clusters include the first HDFS cluster; the replication management device is the fifth aspect. A device for describing file synchronization; the copy execution device is the device for file synchronization described in the sixth aspect; the master node is the device for file synchronization described in the seventh aspect above.

本发明第十八方面提供一种Hadoop分布式文件***，包括：复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息，多个HDFS集群包括第一HDFS集群；复制管理设备为上述第十三方面所描述的复制管理设备；所述复制执行设备为上述第十四方面所描述的复制执行设备；主节点为上述第十五方面所描述的主节点。The eighteenth aspect of the present invention provides a Hadoop distributed file system, including: a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary node and at least two data nodes, respectively, for each HDFS clusters, at least two data nodes store files, the master node maintains information about files stored in at least two data nodes in the cluster, and multiple HDFS clusters include the first HDFS cluster; the replication management device is the thirteenth aspect described above. The described copy management device; the copy execution device is the copy execution device described in the above fourteenth aspect; the master node is the master node described in the fifteenth aspect above.

附图说明DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings can also be obtained from those skilled in the art based on these drawings without paying any creative effort.

图1是Hadoop分布式文件***HDFS的一网络结构示意图；1 is a schematic diagram of a network structure of a Hadoop distributed file system HDFS;

图2是本发明实施例中Hadoop分布式文件***HDFS的一网络结构示意图；2 is a schematic diagram of a network structure of a Hadoop distributed file system HDFS according to an embodiment of the present invention;

图3是本发明实施例中按照数据块进行并行同步的示例示意图；3 is a schematic diagram showing an example of parallel synchronization according to data blocks in an embodiment of the present invention;

图4是本发明实施例中Hadoop分布式文件***HDFS的另一网络结构示意图；4 is a schematic diagram of another network structure of a Hadoop distributed file system HDFS according to an embodiment of the present invention;

图5是本发明实施例中文件同步的方法的一实施例示意图；FIG. 5 is a schematic diagram of an embodiment of a method for file synchronization according to an embodiment of the present invention; FIG.

图6是本发明实施例中文件同步的方法的另一实施例示意图；6 is a schematic diagram of another embodiment of a method for file synchronization in an embodiment of the present invention;

图7是本发明实施例中复制管理设备、复制执行设备或主节点以主机的形式存在的一结构示意图。FIG. 7 is a schematic structural diagram of a replication management device, a replication execution device, or a master node in the form of a host according to an embodiment of the present invention.

具体实施方式detailed description

本发明实施例提供一种文件同步的方法，可以依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息，进而确定待同步文件的信息，对待同步文件进行同步，不需要扫描整个文件目录，从而提高了文件同步的效率。本发明实施例还提供了相应的设备及***。以下分别进行详细说明。The embodiment of the invention provides a file synchronization method, which can directly determine the information of the file corresponding to the directory operation number connected to the directory end operation number according to the previous synchronized directory end operation number, and further determine the information of the file to be synchronized, and treat Synchronize files for synchronization, without having to scan the entire file directory, which improves The efficiency of file synchronization. The embodiments of the present invention also provide corresponding devices and systems. The details are described below separately.

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

图1为Hadoop分布式文件***(Hadoop Distributed File System，HDFS)的一网络结构示意图。FIG. 1 is a schematic diagram of a network structure of a Hadoop Distributed File System (HDFS).

目前的HDFS通常包括映射/归约设备和多个HDFS集群，每个HDFS集群分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息。The current HDFS usually includes a mapping/reduction device and multiple HDFS clusters. Each HDFS cluster includes a primary node and at least two data nodes. For each HDFS cluster, at least two data nodes store files, and the primary node maintains Information about files stored by at least two data nodes in the cluster.

如图1所示，HDFS包括映射/归约设备10、第一HDFS集群和第二HDFS集群，当然图1只是画出了第一HDFS集群和第二HDFS集群，但是不应理解为是对HDFS集群数量的限定。第一HDFS集群中包括主节点20A、数据节点30A和数据节点30B，第一HDFS集群中包括主节点20B、数据节点30C和数据节点30D，第一HDFS集群和第二HDFS集群都只画出了两个数据节点，但不应理解为是对HDFS集群中数据节点数量的限定。As shown in FIG. 1 , the HDFS includes a mapping/reduction device 10, a first HDFS cluster, and a second HDFS cluster. Of course, FIG. 1 only shows the first HDFS cluster and the second HDFS cluster, but should not be understood as being for HDFS. The number of clusters is limited. The first HDFS cluster includes a primary node 20A, a data node 30A, and a data node 30B. The first HDFS cluster includes a primary node 20B, a data node 30C, and a data node 30D. The first HDFS cluster and the second HDFS cluster are only drawn. Two data nodes, but should not be understood as a limitation on the number of data nodes in the HDFS cluster.

目前HDFS中文件同步的过程是映射/归约设备10向第一HDFS集群中的主节点20A发送同步消息，该同步消息用于指示该第一HDFS集群中的主节点20A扫描文件目录，生成当前该第一HDFS集群中的主节点20A中所维护的文件信息的列表；该映射/归约设备10接收该列表后，通过与前一次所同步文件的列表进行比对，确定待同步文件的信息，然后映射/归约设备10根据该待同步文件的信息，确定待同步文件处于数据节点30B上，并根据待同步文件的信息确定需要将待同步文件同步到的目的数据节点为数据节点C，映射/归约设备10向数据节点30B发送待同步文件获取请求，数据节点30B根据该待同步文件获取请求向映射/归约设备10返回待同步文件，映射/归约设备10将该接收到的待同步文件发送给目的数据节点30C。从以上描述上可以确定目前的文件同步方案需要扫描整个目录，导致文件同步效率低下。The current process of file synchronization in the HDFS is that the mapping/reduction device 10 sends a synchronization message to the primary node 20A in the first HDFS cluster, and the synchronization message is used to indicate that the primary node 20A in the first HDFS cluster scans the file directory to generate a current a list of file information maintained in the primary node 20A in the first HDFS cluster; after receiving the list, the mapping/reduction device 10 determines the information of the file to be synchronized by comparing with the list of previously synchronized files. And then the mapping/reduction device 10 determines that the file to be synchronized is on the data node 30B according to the information of the file to be synchronized, and determines, according to the information of the file to be synchronized, the destination data node that needs to synchronize the file to be synchronized into the data node C, The mapping/reduction device 10 sends a file synchronization request to the data node 30B, and the data node 30B returns a file to be synchronized to the mapping/reduction device 10 according to the to-be-synchronized file acquisition request, and the mapping/reduction device 10 receives the received file. The file to be synchronized is sent to the destination data node 30C. From the above description, it can be determined that the current file synchronization scheme needs to scan the entire directory, resulting in inefficient file synchronization.

针对当前文件同步效率低下的问题，本发明实施例提供了一种Hadoop分布式文件***。For the problem that the current file synchronization is inefficient, the embodiment of the present invention provides a Hadoop distributed file system.

本发明实施例提供的Hadoop分布式文件***包括：复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，至少两个数据节点存储有文件，主节点维护有该集群中至少两个数据节点所存储文件的信息。The Hadoop distributed file system provided by the embodiment of the present invention includes: a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary node and at least two data nodes, respectively, for each HDFS cluster. At least two data nodes store files, and the master node maintains information about files stored by at least two data nodes in the cluster.

如图2所示，HDFS包括复制管理设备40、复制执行设备50、第一HDFS集群和第二HDFS集群，当然图2只是画出了第一HDFS集群和第二HDFS集群，但是不应理解为是对HDFS集群数量的限定。第一HDFS集群中包括主节点20A、数据节点30A和数据节点30B，第一HDFS集群中包括主节点20B、数据节点30C和数据节点30D，第一HDFS集群和第二HDFS集群都只画出了两个数据节点，但不应理解为是对HDFS集群中数据节点数量的限定。复制执行设备50也只画出了一个，但不应理解为是对复制执行设备数量的限定。As shown in FIG. 2, the HDFS includes a replication management device 40, a replication execution device 50, a first HDFS cluster, and a second HDFS cluster. Of course, FIG. 2 only shows the first HDFS cluster and the second HDFS cluster, but should not be understood as It is a limit on the number of HDFS clusters. The first HDFS cluster includes a primary node 20A, a data node 30A, and a data node 30B. The first HDFS cluster includes a primary node 20B, a data node 30C, and a data node 30D. The first HDFS cluster and the second HDFS cluster are only drawn. Two data nodes, but should not be understood as a limitation on the number of data nodes in the HDFS cluster. The copy execution device 50 also draws only one, but should not be construed as limiting the number of copy execution devices.

本发明实施例中由复制管理设备40来实现复制管理Replication Manager程序，由复制执行设备50来实现复制执行Replication Executor程序。实际上，Replication Manager可以运行一***立的设备上，也可以运行在用户设备、主节点或者数据节点上，在哪个设备上运行，就由哪个设备来充当本发明实施例中的复制管理设备40。Replication Executor可以运行在数据节点上，由该数据节点来充当本发明实施例中的复制执行设备50，而且一个数据节点上可以运行多个Replication Executor程序。也就是说一个数据节点可以扮演多个复制执行设备50的角色。本处针对复制管理设备40和复制执行设备50在HDFS中的存在形态做了相应的解释性说明，但在做文件同步时不需要考虑复制管理设备40和复制执行设备50的存在形态，无论复制管理设备40和复制执行设备50的存在形态是什么，文件同步的过程都是相同的。In the embodiment of the present invention, the replication management device 40 is implemented by the replication management device 40, and the replication execution device 50 implements the replication execution Replication Executor program. In fact, Replication The Manager can run on a separate device, or it can run on the user device, the master node or the data node, and on which device, which device acts as the copy management device 40 in the embodiment of the present invention. The Replication Executor can run on the data node, which acts as the copy execution device 50 in the embodiment of the present invention, and can execute multiple Replication Executor programs on one data node. That is to say, one data node can play the role of multiple copy execution devices 50. The present invention provides a corresponding explanatory explanation for the existence patterns of the copy management device 40 and the copy execution device 50 in the HDFS, but does not need to consider the existence form of the copy management device 40 and the copy execution device 50 when performing file synchronization, regardless of copying. The existence form of the management device 40 and the copy execution device 50 is the same, and the process of file synchronization is the same.

下面参阅图2介绍本发明实施例中文件同步的过程。The process of file synchronization in the embodiment of the present invention will be described below with reference to FIG.

复制管理设备40向第一HDFS集群的主节点20A发送前一次所同步文件的目录结束操作编号。其中，文件同步可以是周期性进行的，可以是一个周期同步一次，周期长度可以预先设定，本次同步是以紧相邻的前一次同步为基础的，前一次和本次指的是周期上连续的两次。目录结束操作编号可以是前一次同步的目标编号中最大的目录操作编号。The copy management device 40 transmits the directory end operation number of the previously synchronized file to the master node 20A of the first HDFS cluster. The file synchronization may be performed periodically, or may be synchronized once, and the period length may be preset. The current synchronization is based on the immediately preceding synchronization, and the previous and current periods are periodic. On two consecutive times. The directory end operation number can be the largest directory operation number of the previous synchronized target number.

主节点20A接收所述复制管理设备40发送的前一次所同步数据文件的目录结束操作编号后，主节点20A从文件的目录操作编号中确定与所述目录结束操作编号接续的目录操作编号，并确定所述接续的目录操作编号所对应的待同步文件的信息。接续指的是顺序上能衔接上，如：从小到大的衔接，目录结束操作编号是123，接续的目录操作编号是124，而且，接续的目录操作编号不限于一个，可以有多个，有一个与目录结束操作编号衔接的比结束操作编号大的所有操作编号都可以是与目录结束操作编号接续的目录操作编号。如：124、125和126等都是与123接续的目录操作编号。After the master node 20A receives the directory end operation number of the previous synchronized data file sent by the copy management device 40, the master node 20A determines the directory operation number following the directory end operation number from the directory operation number of the file, and Determining information of the file to be synchronized corresponding to the serial directory operation number. Connection refers to the sequence can be connected, such as: from small to large connection, the directory end operation number is 123, the subsequent directory operation number is 124, and the serial directory operation number is not limited to one, there can be more than one, there are All the operation numbers that are larger than the end operation number that are connected to the directory end operation number may be the directory operation numbers that are connected to the directory end operation number. For example, 124, 125, and 126 are all directory operation numbers connected to 123.

在每个主节点中，针对每次操作都会产生一个目录操作编号，其中，操作的类型可以是新增、修改和删除等，无论哪种类型的操作主节点都会记录，并产生相应的目录操作编号，本发明实施例中的目录操作编号可以是按照从1开始逐个加1的方式记录的，当然还可以有其他的记录方式，只要是成固定趋势变化就可以，对具体的记录方式不做限定。In each master node, a directory operation number is generated for each operation, wherein the operation type can be new, modified, and deleted, regardless of which type of operation master node records, and corresponding directory operations are generated. The number of the directory operation numbers in the embodiment of the present invention may be recorded in a manner of incrementing one by one from the beginning of 1, and of course, there may be other recording modes, as long as the fixed trend changes, the specific recording mode is not performed. limited.

每个目录操作编号都会对应相应文件的信息，目录操作编号与文件的对应关系可以是多个目录操作编号对应一个文件。例如：编号从100至200对应一个文件，也可以是编号100-105、130-137对应一个文件，编号只是按照操作的先后顺序产生的，并不对应具体文件。但每个目录操作编号都会对应相应的文件的信息，该文件的信息中可以包含该文件所存储的源数据节点的标识，待同步到的目的数据节点的标识。例如：该文件的信息可以是fileA/fromDN30BtoDN30C，也就是将文件A从数据节点30B同步到数据节点30C。当然此处只是举例，该文件的信息可以用其他形式表示。Each directory operation number corresponds to the information of the corresponding file, and the correspondence between the directory operation number and the file may be a file corresponding to multiple directory operation numbers. For example, the number from 100 to 200 corresponds to a file, or the number 100-105, 130-137 corresponds to a file, the number is only generated in the order of operation, and does not correspond to a specific file. However, each directory operation number corresponds to the information of the corresponding file, and the information of the file may include the identifier of the source data node stored in the file, and the identifier of the destination data node to be synchronized. For example, the information of the file may be fileA/fromDN30BtoDN30C, that is, the file A is synchronized from the data node 30B to the data node 30C. Of course, here is just an example, the information of this file can be expressed in other forms.

若目录结束操作编号为1200，则主节点可以确定前一次同步所同步文件的最大目录操作编号为1200，则可以确定编号1200之前的目录操作编号所对应的文件不需要再同步，本次同步只需要从1201开始即可，从1201开始的所有大于1200的目录操作编号都是在前一次发送所同步文件的信息后新产生的，因此，主节点可以确定出目录结束操作编号接续的目录操作编号，若当前的目录操作编号已到达1880，则主节点可以确定从1201到1880的所有目录操作编号都是新产生的，则接续的目录操作编号就是从 1201-1880的所有目录操作编号，确定出接续的目录操作编号后，再确定出与每个接续的目录操作编号所对应的待同步文件的信息即可。待同步的文件可以有多个，针对每个待同步文件都会有相应的信息。然后，主节点20A向复制管理设备发送待同步文件的信息。If the directory end operation number is 1200, the master node can determine that the maximum directory operation number of the synchronized file in the previous synchronization is 1200, and then it can be determined that the file corresponding to the directory operation number before the number 1200 does not need to be resynchronized. It is necessary to start from 1201. All directory operation numbers greater than 1200 starting from 1201 are newly generated after the information of the previously synchronized file is sent. Therefore, the master node can determine the directory operation number of the directory end operation number. If the current directory operation number has reached 1880, the master node can determine that all directory operation numbers from 1201 to 1880 are newly generated, and the subsequent directory operation number is from All directory operation numbers of 1201-1880, after determining the serial directory operation number, determine the information of the file to be synchronized corresponding to each subsequent directory operation number. There may be multiple files to be synchronized, and there will be corresponding information for each file to be synchronized. Then, the master node 20A transmits information of the file to be synchronized to the copy management device.

复制管理设备40接收第一HDFS集群的主节点20A发送的待同步文件的信息，复制管理设备40根据该待同步文件的信息，确定至少一个同步任务，其中，每个同步任务包含该待同步文件中至少一个文件的信息。通常同步任务划分是按照一个文件一个同步任务来划分的，若本次有三个待同步文件，则可以确定三个同步任务当然，也可以是一个同步任务中包括多个待同步文件。The replication management device 40 receives the information of the file to be synchronized sent by the primary node 20A of the first HDFS cluster, and the replication management device 40 determines at least one synchronization task according to the information of the to-be-synchronized file, where each synchronization task includes the to-be-synchronized file. Information about at least one file. Generally, the synchronization task division is divided according to one file and one synchronization task. If there are three files to be synchronized this time, three synchronization tasks can be determined. Of course, one synchronization task includes multiple files to be synchronized.

复制管理设备40接收目标复制执行设备50发送的任务请求后，向目标复制执行设备发送目标同步任务。其中，目标复制执行设备50可以是多个复制执行设备中的一个，目标同步任务可以是多个同步任务中的一个。After receiving the task request transmitted by the target copy executing device 50, the copy management device 40 transmits a target synchronization task to the target copy executing device. The target copy execution device 50 may be one of a plurality of copy execution devices, and the target synchronization task may be one of a plurality of synchronization tasks.

目标复制执行设备50接收复制管理设备40发送的目标同步任务后，根据目标同步任务中的至少一个文件的信息，将至少一个文件从源数据节点30B同步到目的数据节点30C，源数据节点30B属于所述第一HDFS集群。After receiving the target synchronization task sent by the replication management device 40, the target copy execution device 50 synchronizes at least one file from the source data node 30B to the destination data node 30C according to the information of at least one file in the target synchronization task, and the source data node 30B belongs to The first HDFS cluster.

从上述描述可以看出，本次同步可以依据前一次同步的目录结束操作编号直接确定与该目录结束操作编号接续的目录操作编号所对应文件的信息，进而确定待同步文件的信息，对待同步文件进行同步，不需要扫描整个文件目录，从而提高了文件同步的效率。而且上述的同步方案中，可以是由源数据节点30B直接将待同步文件同步到目的数据节点30C，不需要经过目标复制执行设备中转，进一步提高了同步效率。As can be seen from the above description, this synchronization can directly determine the information of the file corresponding to the directory operation number connected to the directory end operation number according to the previous synchronized directory end operation number, thereby determining the information of the file to be synchronized, and the file to be synchronized. Synchronization does not require scanning the entire file directory, which improves the efficiency of file synchronization. Moreover, in the above synchronization scheme, the source data node 30B can directly synchronize the file to be synchronized to the destination data node 30C, and does not need to transit through the target copy execution device, thereby further improving the synchronization efficiency.

需要说明的是，待同步文件可以位于不同的数据节点中，本发明实施例中只是以只有一个源数据节点为例进行说明，实际上，源数据节点可能有多个，目的数据节点也可能有多个，不应将本发明实施例中途2所示的场景理解为是对源数据节点数量的限定。It should be noted that the file to be synchronized may be located in different data nodes. In the embodiment of the present invention, only one source data node is used as an example. In fact, there may be multiple source data nodes, and the destination data node may also have Multiple, the scenario shown in the middle of the embodiment of the present invention should not be understood as a limitation on the number of source data nodes.

图2所描述的过程是一个跨集群的文件同步过程，目的数据节点30C属于所述第二HDFS集群。实际上，上述文件同步过程可以应用于集群内文件同步，用于集群内文件同步时，不需要扫描整个文件目录，同样提高了文件同步的效率。The process depicted in Figure 2 is a cross-cluster file synchronization process in which the destination data node 30C belongs to the second HDFS cluster. In fact, the above file synchronization process can be applied to file synchronization in a cluster. When used for file synchronization in a cluster, it is not necessary to scan the entire file directory, which also improves the efficiency of file synchronization.

另外，主节点中还会存储有每个文件的大小和文件所在的数据块的相应信息。数据块的大小通常都是固定的，一般为64M或者128M，其中M表示兆的意思。数据块就是在存储资源中所划分出的用于存储数据的一块存储空间。每个数据块都有相应的地址信息。In addition, the size of each file and the corresponding information of the data block where the file is located are also stored in the master node. The size of the data block is usually fixed, generally 64M or 128M, where M means mega. A data block is a piece of storage space allocated for storing data in a storage resource. Each data block has corresponding address information.

因为每个同步任务中的至少一个文件的信息与存储该至少一个文件的源数据块的地址信息对应；所以，复制执行设备根据目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，可以包括：Because the information of the at least one file in each synchronization task corresponds to the address information of the source data block storing the at least one file; therefore, the copy execution device sets the at least one file according to the information of the at least one file in the target synchronization task. Synchronizing from the source data node to the destination data node can include:

目标复制执行设备50根据至少一个文件的信息从第一HDFS集群的主节点20A获取与至少一个文件的信息对应的源数据块的地址信息；The target copy execution device 50 acquires address information of the source data block corresponding to the information of the at least one file from the primary node 20A of the first HDFS cluster according to the information of the at least one file;

目标复制执行设备50根据源数据块的地址信息，确定源数据块的数量；The target copy execution device 50 determines the number of source data blocks according to the address information of the source data block;

目标复制执行设备50向第二HDFS集群的主节点20B发送源数据块的数量。The target duplication execution device 50 transmits the number of source data blocks to the primary node 20B of the second HDFS cluster.

第二HDFS集群的主节点20B根据源数据块的数量为至少一个文件创建目的数据块，并为目的数据块分配地址信息。The master node 20B of the second HDFS cluster creates a destination data block for at least one file according to the number of source data blocks. And assign address information to the destination data block.

第二HDFS集群的主节点20B向目标复制执行设备发送目的数据块的地址信息。The master node 20B of the second HDFS cluster transmits the address information of the destination data block to the target copy execution device.

目标复制执行设备50接收到目的数据块的地址信息后，向源数据节点30B发送同步指示消息，该同步指示消息包含源数据块的地址信息和目的数据块的地址信息。After receiving the address information of the destination data block, the target copy execution device 50 transmits a synchronization indication message to the source data node 30B, the synchronization indication message including the address information of the source data block and the address information of the destination data block.

源数据节点30B根据源数据块的地址信息和目的数据块的地址信息将至少一个文件从源数据块同步到目的数据块。The source data node 30B synchronizes at least one file from the source data block to the destination data block based on the address information of the source data block and the address information of the destination data block.

而且，其中，当所述源数据块的地址信息指示所述源数据块有多个时，复制执行设备针对每个源数据块向源数据节点并行发送所述同步指示消息。由源数据节点30B并向将源数据块中的文件内容同步到目的数据块。数据块中文件内容的并行同步过程可以参阅图3进行理解。如图3所示，源数据节点30B可以将每个源数据块中的文件内容并行同步到对应的目的数据块。由此可见，主节点可以依据源数据块的数量直接创建出相应数量的目的数据块，不需要逐个创建目的数据块，从而进一步提高了文件同步的效率。而且，可以实现源数据节点按照数据块进行并行同步，更进一步提高了文件同步的效率。而且，因本发明实施例中文件同步是直接从源数据节点同步到目的数据节点，不需要像现有技术中一定需要映射/归约设备10参与，一定需要一个map参与完成一次同步才能再接收下一个任务，如果该任务执行的慢就会造成长尾现象，而本发明实施例中，目标复制执行设备将同步任务发送给源数据节点会，源数据节点将待同步文件直接从源数据节点同步到目的数据节点，不需要目标复制执行设备再参与中转的过程，所以还进一步解决了现有技术中的长尾问题。Moreover, wherein when the address information of the source data block indicates that there are a plurality of the source data blocks, the copy execution device transmits the synchronization indication message in parallel to the source data node for each source data block. The content of the file in the source data block is synchronized by the source data node 30B to the destination data block. The parallel synchronization process of the file contents in the data block can be understood by referring to FIG. 3. As shown in FIG. 3, the source data node 30B can synchronize the file contents in each source data block in parallel to the corresponding destination data block. It can be seen that the master node can directly create a corresponding number of destination data blocks according to the number of source data blocks, and does not need to create the destination data blocks one by one, thereby further improving the efficiency of file synchronization. Moreover, the source data node can be parallel synchronized according to the data block, which further improves the efficiency of file synchronization. Moreover, since the file synchronization is directly synchronized from the source data node to the destination data node in the embodiment of the present invention, there is no need to participate in the mapping/reduction device 10 as in the prior art, and a map must be involved to complete a synchronization before receiving. The next task, if the task is executed slowly, causes a long tail phenomenon. In the embodiment of the present invention, the target copy execution device sends the synchronization task to the source data node, and the source data node directly synchronizes the file to the source data node. Synchronizing to the destination data node does not require the target copy execution device to participate in the transfer process, so the long tail problem in the prior art is further solved.

由图1至图2的变化对比中可以看出，本发明实施例中文件的同步路径由原来的需要映射/归约设备中转变成了由源数据节点直接到目的数据节点的同步，缩短了文件同步路径，提高了文件同步效率。It can be seen from the comparison of the changes in FIG. 1 to FIG. 2 that the synchronization path of the file in the embodiment of the present invention is changed from the original mapping/reduction device to the synchronization from the source data node to the destination data node, which shortens the synchronization path. File synchronization path improves file synchronization efficiency.

本发明实施例还提供了另一种Hadoop分布式文件***的网络结构。如图4所示，若Hadoop分布式文件***还采用图1所示的映射/归约设备10，还是要通过扫描文件整个目录才能确定待同步文件的信息，但是在文件同步略作了调整，映射/归约设备10可以指示源数据节点将待同步文件直接同步到目的数据节点，不再需要映射/归约设备10中转，也是可以提高文件同步效率的。The embodiment of the invention further provides a network structure of another Hadoop distributed file system. As shown in FIG. 4, if the Hadoop distributed file system also uses the mapping/reduction device 10 shown in FIG. 1, it is still necessary to scan the entire directory of the file to determine the information of the file to be synchronized, but the file synchronization is slightly adjusted. The mapping/reduction device 10 can instruct the source data node to directly synchronize the files to be synchronized to the destination data node, and the mapping/reduction device 10 is no longer required to be relayed, and the file synchronization efficiency can also be improved.

以上是参阅Hadoop分布式文件***的网络结构图对文件同步过程的介绍，下面基于上述图2部分的网络架构，结合分布式文件***中各设备的交互过程介绍本发明实施例中的文件同步的方法。The above is an introduction to the file synchronization process of the network structure diagram of the Hadoop distributed file system. The following describes the file synchronization in the embodiment of the present invention based on the network architecture of the above-mentioned FIG. 2 and the interaction process of each device in the distributed file system. method.

如图5所示，本发明实施例提供的文件同步的方法的一实施例包括：As shown in FIG. 5, an embodiment of a method for file synchronization provided by an embodiment of the present invention includes:

601、复制管理设备向第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号。601. The replication management device sends, to the primary node of the first HDFS cluster, a directory end operation number of the previously synchronized file.

602、主节点接收所述复制管理设备发送的前一次所同步数据文件的目录结束操作编号后，从文件的目录操作编号中确定与所述目录结束操作编号接续的目录操作编号，并确定所述接续的目录操作编号所对应的待同步文件的信息。602. After receiving the directory end operation number of the previous synchronized data file sent by the copy management device, the master node determines, from the directory operation number of the file, a directory operation number that is connected to the directory end operation number, and determines the The information of the file to be synchronized corresponding to the serial directory operation number.

603、复制管理设备接收第一HDFS集群的主节点发送的待同步文件的信息。603. The replication management device receives information about the file to be synchronized sent by the primary node of the first HDFS cluster.

604、复制管理设备根据该待同步文件的信息，确定至少一个同步任务。604. The replication management device determines at least one synchronization task according to the information of the file to be synchronized.

605、复制管理设备接收目标复制执行设备发送的任务请求。 605. The replication management device receives the task request sent by the target replication execution device.

606、复制管理设备向目标复制执行设备发送目标同步任务。606. The replication management device sends a target synchronization task to the target replication execution device.

607、目标复制执行设备接收复制管理设备发送的目标同步任务后，向源数据节点发送同步指示消息，该同步指示消息包含至少一个文件的信息，源数据节点属于所述第一HDFS集群。607. After receiving the target synchronization task sent by the replication management device, the target replication execution device sends a synchronization indication message to the source data node, where the synchronization indication message includes information about at least one file, and the source data node belongs to the first HDFS cluster.

608、源数据节点根据同步指示消息，将至少一个文件从源数据节点同步到目的数据节点。608. The source data node synchronizes the at least one file from the source data node to the destination data node according to the synchronization indication message.

以上步骤601至608是基于上述图2所示的Hadoop分布式文件***的文件同步过程。步骤601至608中所涉及到的特征以及过程可以参阅图2部分的相应描述和示例进行理解，本处不再重复赘述。The above steps 601 to 608 are based on the file synchronization process of the Hadoop distributed file system shown in FIG. 2 described above. The features and processes involved in the steps 601 to 608 can be understood by referring to the corresponding descriptions and examples in the FIG. 2, and the details are not repeated here.

以上所描述的文件同步的方法可以应用于HDFS集群内部，也可以应用于HDFS集群之间，下面结合图6介绍本发明实施例中文件同步的另一实施例。The file synchronization method described above can be applied to the HDFS cluster or to the HDFS cluster. Another embodiment of the file synchronization in the embodiment of the present invention is described below with reference to FIG.

如图6所示，本发明实施例提供的文件同步的另一实施例包括：As shown in FIG. 6, another embodiment of file synchronization provided by the embodiment of the present invention includes:

步骤701至706与上述实施例中的步骤601至606相同，可以参阅步骤601至606进行理解。Steps 701 to 706 are the same as steps 601 to 606 in the above embodiment, and can be understood by referring to steps 601 to 606.

因主节点中还会存储有每个文件的大小和文件所在的数据块的相应信息。因为每个同步任务中的至少一个文件的信息与存储该至少一个文件的源数据块的地址信息对应。所以在步骤706之后还可以包括：Because the size of each file and the corresponding information of the data block where the file is located are also stored in the master node. Because the information of at least one file in each synchronization task corresponds to the address information of the source data block in which the at least one file is stored. Therefore, after step 706, the method may further include:

707、目标复制执行设备向第一集群的主节点发送至少一个文件的数据块的查询请求。707. The target copy execution device sends a query request for the data block of the at least one file to the primary node of the first cluster.

该至少一个文件是目标同步任务中所包含的至少一个文件的信息所指示的文件。The at least one file is a file indicated by information of at least one file included in the target synchronization task.

708、第一集群的主节点确定存储该至少一个文件的源数据块的地址信息。708. The primary node of the first cluster determines address information of a source data block storing the at least one file.

709、第一集群的主节点向目标复制执行设备发送源数据块的地址信息。709. The primary node of the first cluster sends the address information of the source data block to the target copy execution device.

710、目标复制执行设备根据源数据块的地址信息，确定源数据块的数量。710. The target copy execution device determines the number of source data blocks according to the address information of the source data block.

该步骤中，如果第一集群的主节点直接向目标复制执行设备返回了至少一个文件的大小，则可以根据至少一个文件的大小确定源数据块的数量。In this step, if the primary node of the first cluster directly returns the size of at least one file to the target copy execution device, the number of the source data blocks may be determined according to the size of the at least one file.

711、目标复制执行设备向第二HDFS集群的主节点发送源数据块的数量。711. The target replication execution device sends the number of source data blocks to the primary node of the second HDFS cluster.

712、第二HDFS集群的主节点根据源数据块的数量为至少一个文件创建目的数据块，并为目的数据块分配地址信息。712. The master node of the second HDFS cluster creates a destination data block for at least one file according to the number of source data blocks, and allocates address information for the destination data block.

713、第二HDFS集群的主节点向目标复制执行设备发送目的数据块的地址信息。713. The primary node of the second HDFS cluster sends the address information of the destination data block to the target replication execution device.

714、目标复制执行设备接收到目的数据块的地址信息后，针对每个源数据块向源数据节点并行发送同步指示消息，该同步指示消息包含源数据块的地址信息和目的数据块的地址信息。714. After receiving the address information of the destination data block, the target copy execution device sends a synchronization indication message to the source data node in parallel for each source data block, where the synchronization indication message includes address information of the source data block and address information of the destination data block. .

715、源数据节点根据源数据块的地址信息和目的数据块的地址信息将每个元数据块中所包含的至少一个文件的文件内容从源数据块同步到目的数据块。715. The source data node synchronizes the file content of the at least one file included in each metadata block from the source data block to the destination data block according to the address information of the source data block and the address information of the destination data block.

针对数据块的并行同步过程可以参阅图3进行理解。The parallel synchronization process for data blocks can be understood with reference to FIG.

由此可见，本发明实施例提供的文件同步方案，不需要扫描整个文件目录，可以通过目录结束操作编号直接确定待同步文件的信息，提高了文件同步的效率，而且主节点可以依据源数据块的数量直接创建出相应数量的目的数据块，不需要逐个创建目的数据块，从而进一步提高了文件同步的效率。而且，可以实现源数据节点按照数据块进行并行同步，更进一步提高了文件同步的效率。而且，因本发明实施例中文件同步是直接从源数据节点同步到目的数据节点，不需要像现有技术中一定需要映射/归约设备10参与，一定需要一个map参与完成一次同步才能再接收下一个任务，如果该任务执行的慢就会造成长尾现象，而本发明实施例中，目标复制执行设备将同步任务发送给源数据节点会，源数据节点将待同步文件直接从源数据节点同步到目的数据节点，不需要目标复制执行设备再参与中转的过程，所以还进一步解决了现有技术中的长尾问题。It can be seen that the file synchronization scheme provided by the embodiment of the present invention does not need to scan the entire file directory, and can directly determine the information of the file to be synchronized through the directory end operation number, thereby improving the efficiency of file synchronization, and the master node can be based on the source data block. The number directly creates a corresponding number of destination data blocks, eliminating the need to create destination data blocks one by one, further improving the efficiency of file synchronization. Moreover, the source data node can be implemented according to the data Parallel synchronization of blocks further improves the efficiency of file synchronization. Moreover, since the file synchronization is directly synchronized from the source data node to the destination data node in the embodiment of the present invention, there is no need to participate in the mapping/reduction device 10 as in the prior art, and a map must be involved to complete a synchronization before receiving. The next task, if the task is executed slowly, causes a long tail phenomenon. In the embodiment of the present invention, the target copy execution device sends the synchronization task to the source data node, and the source data node directly synchronizes the file to the source data node. Synchronizing to the destination data node does not require the target copy execution device to participate in the transfer process, so the long tail problem in the prior art is further solved.

以上从***和方法的角度介绍了本发明实施例中的文件同步的过程，实际上，本发明实施例还提供了相应的文件同步的装置，该文件同步的装置被配置实现上述复制管理设备、复制执行设备或主节点所执行的方法的功能，由软件实现，其软件包括与上述功能相应的单元，与上述功能相应的单元可以包括接收单元、处理单元和发送单元，该接收单元、处理单元和发送单元通信连接，接收单元用于实现相应接收的功能，发送单元用于实现相应发送的功能，处理单元用于实现相应处理的功能。The foregoing describes the process of file synchronization in the embodiment of the present invention from the perspective of the system and the method. In fact, the embodiment of the present invention further provides a device for synchronizing files, and the device for synchronizing the file is configured to implement the foregoing copy management device. The function of copying the method performed by the execution device or the master node is implemented by software, and the software includes a unit corresponding to the above function, and the unit corresponding to the above function may include a receiving unit, a processing unit, and a sending unit, and the receiving unit and the processing unit The communication unit is connected to the transmitting unit, the receiving unit is configured to implement a corresponding receiving function, the sending unit is configured to implement a corresponding sending function, and the processing unit is configured to implement a corresponding processing function.

当该文件同步的装置被配置实现上述复制管理设备的功能时，发送单元可以执行图5对应实施例中的步骤601和步骤606，以及图6所对应实施例中的步骤701和步骤706。接收单元可以执行图5对应实施例中的步骤603和步骤605，以及图6所对应实施例中的步骤703和步骤705。处理单元可以执行图5对应实施例中的步骤604，以及图6所对应实施例中的步骤704。When the device for synchronizing the file is configured to implement the function of the copy management device, the sending unit may perform steps 601 and 606 in the corresponding embodiment of FIG. 5, and steps 701 and 706 in the embodiment corresponding to FIG. 6. The receiving unit may perform step 603 and step 605 in the corresponding embodiment of FIG. 5, and step 703 and step 705 in the embodiment corresponding to FIG. 6. The processing unit may perform step 604 in the corresponding embodiment of FIG. 5 and step 704 in the embodiment corresponding to FIG. 6.

当该文件同步的装置被配置实现上述复制执行设备的功能时，接收单元可以执行图5对应实施例中的步骤607和步骤714。发送单元可以执行图6对应实施例中的步骤707和步骤711。处理单元可以执行图6对应实施例中的步骤710。When the device for synchronizing the files is configured to implement the functions of the copy execution device, the receiving unit may perform steps 607 and 714 in the corresponding embodiment of FIG. The transmitting unit may perform step 707 and step 711 in the corresponding embodiment of FIG. 6. The processing unit may perform step 710 in the corresponding embodiment of FIG.

当该文件同步的装置被配置实现上述主节点的功能时，接收单元可以执行图5对应实施例中的步骤602，以及图6所对应实施例中的步骤702。发送单元可以执行图6对应实施例中的步骤707、步骤709和步骤713。处理单元可以执行图6对应实施例中的步骤708和步骤712。When the device for synchronizing the file is configured to implement the function of the master node, the receiving unit may perform step 602 in the corresponding embodiment of FIG. 5 and step 702 in the embodiment corresponding to FIG. 6. The transmitting unit may perform step 707, step 709, and step 713 in the corresponding embodiment of FIG. 6. The processing unit may perform steps 708 and 712 in the corresponding embodiment of FIG.

进一步的，上述实施例中的复制管理设备、复制执行设备或主节点可以是以功能模块的形式来呈现。这里的“模块”可以指特定应用集成电路(application-specific integrated circuit，ASIC)，电路，执行一个或多个软件或固件程序的处理器和存储器，集成逻辑电路，和/或其他可以提供上述功能的器件。在一个简单的实施例中，各模块还可以通过图7中的主机800来实现。Further, the copy management device, the copy execution device, or the master node in the above embodiment may be presented in the form of a function module. A "module" herein may refer to an application-specific integrated circuit (ASIC), circuitry, a processor and memory that executes one or more software or firmware programs, integrated logic circuitry, and/or other functions that provide the functionality described above. Device. In a simple embodiment, each module can also be implemented by host 800 in FIG.

该主机800可以为服务器，大型机，小型机等。图7是本发明实施例提供的一种主机结构示意图。该主机800可因配置或性能不同而产生比较大的差异，可以包括一个或一个以***处理器(central processing units，CPU)822(例如，一个或一个以上处理器)、收发器860、存储器832，一个或一个以上存储应用程序842或数据的存储介质830(例如一个或一个以上海量存储设备)。其中，存储器832可以由易失性存储介质构成，存储介质830可以由非易失性存储介质构成。存储在存储介质830的程序可以包括一个或一个以上模块实现(图示没标出)，每个模块可以包括对主机中的一系列指令操作。更进一步地，中央处理器822可以设置为与存储介质830通信，在主机800上执行存储介质830中存储的一系列指令操作。 The host 800 can be a server, a mainframe, a minicomputer, or the like. FIG. 7 is a schematic structural diagram of a host according to an embodiment of the present invention. The host 800 can vary considerably depending on configuration or performance, and can include one or more central processing units (CPUs) 822 (eg, one or more processors), transceivers 860, and memory 832. One or more storage media 830 storing storage applications 842 or data (eg, one or one storage device in Shanghai). The memory 832 may be composed of a volatile storage medium, and the storage medium 830 may be composed of a non-volatile storage medium. Programs stored on storage medium 830 may include one or more module implementations (not shown), each of which may include a series of instruction operations on the host. Still further, central processor 822 can be configured to communicate with storage medium 830, executing a series of instruction operations stored in storage medium 830 on host 800.

主机800还可以包括一个或一个以上电源826，一个或一个以上有线或无线网络接口850，和/或，一个或一个以上操作***841，例如Windows ServerTM，Mac OS XTM，UnixTM,LinuxTM，FreeBSDTM等等，还可以包括应用程序842。Host 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, and/or one or more operating systems 841 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and the like. An application 842 can also be included.

上述实施例中由复制管理设备、复制执行设备或主节点所执行的步骤可以基于该图7所示的主机结构。The steps performed by the copy management device, the copy execution device, or the master node in the above embodiment may be based on the host structure shown in FIG.

处理器820执行程序指令，用于使主机执行图2、图5和图6所对应的实施例中复制管理设备、复制执行设备或主节点所执行的方法。The processor 820 executes program instructions for causing the host to perform the method performed by the copy management device, the copy execution device, or the master node in the embodiment corresponding to FIGS. 2, 5, and 6.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的存储***，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the storage system, the device and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and details are not described herein again.

在本申请所提供的几个实施例中，应该理解到，所揭露的***，设备和方法，可以通过其它的方式实现。例如，以上所描述的设备实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个***，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，设备或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：ROM、RAM、磁盘或光盘等。A person skilled in the art may understand that all or part of the various steps of the foregoing embodiments may be performed by a program to instruct related hardware. The program may be stored in a computer readable storage medium, and the storage medium may include: ROM, RAM, disk or CD.

以上对本发明实施例所提供的文件同步的方法、设备以及***进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。 The method, device, and system for file synchronization provided by the embodiments of the present invention are described in detail. The principles and implementation manners of the present invention are described in the following. The description of the foregoing embodiments is only for helping to understand the present invention. The method of the invention and its core idea; at the same time, for the person of ordinary skill in the art, according to the idea of the present invention, there are some changes in the specific embodiment and the scope of application. In summary, the content of the specification should not be understood. To limit the invention.

Claims

一种文件同步的方法，其特征在于，所述方法应用于Hadoop分布式文件***HDFS，所述HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群，所述方法包括：A method for file synchronization, the method is applied to a Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary a node and at least two data nodes, the at least two data nodes storing files for each HDFS cluster, the master node maintaining information of files stored by the at least two data nodes in the cluster, The HDFS cluster includes a first HDFS cluster, and the method includes:

所述复制管理设备向所述第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号；Transmitting, by the copy management device, a directory end operation number of the previously synchronized file to the primary node of the first HDFS cluster;

所述复制管理设备接收所述第一HDFS集群的主节点发送的待同步文件的信息，所述待同步文件的信息是所述第一HDFS集群的主节点确定的与所述目录结束操作编号接续的目录操作编号所对应文件的信息；The information about the file to be synchronized sent by the primary node of the first HDFS cluster is received by the primary management node of the first HDFS cluster, and the information about the directory ending operation number determined by the primary node of the first HDFS cluster is The information of the file corresponding to the directory operation number;

所述复制管理设备根据所述待同步文件的信息，确定至少一个同步任务，其中，每个同步任务包含所述待同步文件中至少一个文件的信息；The copy management device determines at least one synchronization task according to the information of the file to be synchronized, where each synchronization task includes information of at least one file in the file to be synchronized;

所述复制管理设备接收目标复制执行设备发送的任务请求后，向所述目标复制执行设备发送目标同步任务，所述目标同步任务用于所述目标复制执行设备根据所述目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，所述源数据节点属于所述第一HDFS集群。After receiving the task request sent by the target copy execution device, the copy management device sends a target synchronization task to the target copy execution device, where the target synchronization task is used by the target copy execution device according to at least one of the target synchronization tasks. Information of a file that synchronizes the at least one file from a source data node to a destination data node, the source data node belonging to the first HDFS cluster.
根据权利要求1所述的方法，其特征在于，所述多个HDFS集群还包括第二HDFS集群，所述目的数据节点属于所述第二HDFS集群；The method of claim 1, wherein the plurality of HDFS clusters further comprise a second HDFS cluster, and the destination data node belongs to the second HDFS cluster;

所述至少一个文件的信息与存储所述至少一个文件的源数据块的地址信息对应；The information of the at least one file corresponds to address information of a source data block storing the at least one file;

所述至少一个文件的信息用于所述目标复制执行设备从所述第一HDFS集群的主节点获取与所述至少一个文件的信息对应的源数据块的地址信息，所述源数据块的地址信息用于所述目标复制执行设备确定所述源数据块的数量，所述源数据块的数量用于所述目标复制执行设备从所述第二HDFS集群的主节点获取目的数据块的地址信息，所述目的数据块的地址信息是所述第二HDFS集群的主节点根据所述源数据块的数量为所述至少一个文件分配的，所述源数据块的地址信息和所述目的数据块的地址信息用于所述目标复制执行设备指示所述源数据节点将所述至少一个文件从所述源数据块同步到所述目的数据块。The information of the at least one file is used by the target copy execution device to acquire address information of a source data block corresponding to information of the at least one file from a primary node of the first HDFS cluster, an address of the source data block The information is used by the target copy execution device to determine the number of the source data blocks, where the number of the source data blocks is used by the target copy execution device to obtain address information of the destination data block from the primary node of the second HDFS cluster The address information of the destination data block is that the primary node of the second HDFS cluster allocates the at least one file according to the number of the source data blocks, the address information of the source data block, and the destination data block. Address information for the target copy execution device instructing the source data node to synchronize the at least one file from the source data block to the destination data block.
一种文件同步的方法，其特征在于，所述方法应用于Hadoop分布式文件***HDFS，所述HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群，所述方法包括：A method for file synchronization, the method is applied to a Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary a node and at least two data nodes, the at least two data nodes storing files for each HDFS cluster, the master node maintaining information of files stored by the at least two data nodes in the cluster, The HDFS cluster includes a first HDFS cluster, and the method includes:

所述复制执行设备接收所述复制管理设备发送的目标同步任务，所述目标同步任务是所述复制管理设备根据待同步文件的信息确定的至少一个同步任务中的一个，其中，每个同步任务包含所述待同步文件中至少一个文件的信息，所述待同步文件的信息是所述复制管理设备向所述第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号后，由所述第一HDFS集群的主节点所确定的与所述目录结束操作编号接续的目录操作编号所对应文件的信息； The copy execution device receives a target synchronization task sent by the replication management device, where the target synchronization task is one of at least one synchronization task determined by the replication management device according to information of a file to be synchronized, where each synchronization task And the information of the at least one file to be synchronized, where the information of the file to be synchronized is sent by the copy management device to the primary node of the first HDFS cluster after the directory end operation number of the previously synchronized file is The information of the file corresponding to the directory operation number determined by the master node of the first HDFS cluster and connected to the directory end operation number;

所述复制执行设备根据所述目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，所述源数据节点属于所述第一HDFS集群。And the copy execution device synchronizes the at least one file from the source data node to the destination data node according to the information of the at least one file in the target synchronization task, where the source data node belongs to the first HDFS cluster.
根据权利要求3所述的方法，其特征在于，所述多个HDFS集群包括第二HDFS集群，所述目的数据节点属于所述第二HDFS集群；The method of claim 3, wherein the plurality of HDFS clusters comprise a second HDFS cluster, and the destination data node belongs to the second HDFS cluster;

所述至少一个文件的信息与存储所述至少一个文件的源数据块的地址信息对应；The information of the at least one file corresponds to address information of a source data block storing the at least one file;

所述复制执行设备根据所述目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，包括：And the copying execution device synchronizes the at least one file from the source data node to the destination data node according to the information of the at least one file in the target synchronization task, including:

所述目标复制执行设备根据所述至少一个文件的信息从所述第一HDFS集群的主节点获取与所述至少一个文件的信息对应的源数据块的地址信息；And the target copy execution device acquires address information of the source data block corresponding to the information of the at least one file from the primary node of the first HDFS cluster according to the information of the at least one file;

所述目标复制执行设备根据所述源数据块的地址信息，确定所述源数据块的数量；Determining, by the target copy execution device, the number of the source data blocks according to address information of the source data block;

所述目标复制执行设备根据所述源数据块的数量从所述第二HDFS集群的主节点获取目的数据块的地址信息，所述目的数据块的地址信息是所述第二HDFS集群的主节点根据所述源数据块的数量为所述至少一个文件分配的；The target copy execution device acquires address information of the destination data block from the primary node of the second HDFS cluster according to the number of the source data blocks, where the address information of the destination data block is the master node of the second HDFS cluster Allocating the at least one file according to the number of the source data blocks;

所述目标复制执行设备向所述源数据节点发送同步指示消息，所述同步指示消息包含源数据块的地址信息和所述目的数据块的地址信息，所述源数据块的地址信息和所述目的数据块的地址信息用于所述源数据节点将所述至少一个文件从所述源数据块同步到所述目的数据块。The target copy execution device sends a synchronization indication message to the source data node, where the synchronization indication message includes address information of a source data block and address information of the destination data block, address information of the source data block, and the The address information of the destination data block is used by the source data node to synchronize the at least one file from the source data block to the destination data block.
根据权利要求4所述的方法，其特征在于，所述复制执行设备向所述源数据节点发送同步指示消息，包括：The method according to claim 4, wherein the copy execution device sends a synchronization indication message to the source data node, including:

当所述源数据块的地址信息指示所述源数据块有多个时，所述复制执行设备针对每个源数据块向所述源数据节点并行发送所述同步指示消息。When the address information of the source data block indicates that there are multiple source data blocks, the copy execution device sends the synchronization indication message to the source data node in parallel for each source data block.
一种文件同步的方法，其特征在于，所述方法应用于Hadoop分布式文件***HDFS，所述HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群，当所述主节点属于所述第一HDFS集群时，所述方法包括：A method for file synchronization, the method is applied to a Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary a node and at least two data nodes, the at least two data nodes storing files for each HDFS cluster, the master node maintaining information of files stored by the at least two data nodes in the cluster, The HDFS cluster includes a first HDFS cluster. When the primary node belongs to the first HDFS cluster, the method includes:

所述主节点接收所述复制管理设备发送的前一次所同步数据文件的目录结束操作编号；Receiving, by the master node, a directory end operation number of a previously synchronized data file sent by the copy management device;

所述主节点从文件的目录操作编号中确定与所述目录结束操作编号接续的目录操作编号，并确定所述接续的目录操作编号所对应的待同步文件的信息；Determining, by the directory operation number of the file, a directory operation number that is connected to the directory end operation number, and determining information about the file to be synchronized corresponding to the subsequent directory operation number;

所述主节点向所述复制管理设备发送所述待同步文件的信息，所述待同步文件的信息用于所述复制管理设备确定至少一个同步任务，其中，每个同步任务包含所述待同步文件中至少一个文件的信息，所述至少一个文件的信息用于所述目标复制执行设备将所述至少一个文件从源数据节点同步到目的数据节点，所述源数据节点属于所述第一HDFS集群。Sending, by the master node, the information of the file to be synchronized to the copy management device, where the information of the file to be synchronized is used by the copy management device to determine at least one synchronization task, where each synchronization task includes the to-be-synchronized Information of at least one file in the file, the information of the at least one file being used by the target copy execution device to synchronize the at least one file from a source data node to a destination data node, the source data node belonging to the first HDFS Cluster.
根据权利要求6所述的方法，其特征在于，所述至少一个文件的信息与存储所述至少一个文件的源数据块的地址信息对应；The method according to claim 6, wherein the information of the at least one file corresponds to address information of a source data block storing the at least one file;

所述主节点向所述复制管理设备发送所述待同步文件的信息之后，所述方法还包括：After the primary node sends the information of the file to be synchronized to the replication management device, the method further includes include:

所述主节点接收所述目标复制执行设备发送的至少一个文件的信息；The master node receives information of at least one file sent by the target copy execution device;

所述主节点确定与所述至少一个文件的信息对应的源数据块的地址信息；Determining, by the master node, address information of a source data block corresponding to information of the at least one file;

所述主节点向所述目标复制执行设备发送所述源数据块的地址信息，所述源数据块的地址信息用于所述目标复制执行设备确定所述源数据块的数量。And the master node sends the address information of the source data block to the target copy execution device, where the address information of the source data block is used by the target copy execution device to determine the number of the source data blocks.
根据权利要求7所述的方法，其特征在于，所述多个HDFS集群还包括第二HDFS集群，所述目的数据节点属于所述第二HDFS集群，当所述主节点属于所述第二HDFS集群时，所述方法还包括：The method of claim 7, wherein the plurality of HDFS clusters further comprise a second HDFS cluster, the destination data node belongs to the second HDFS cluster, and when the primary node belongs to the second HDFS When clustering, the method further includes:

所述主节点接收所述目标复制执行设备发送的源数据块的数量；Receiving, by the primary node, a quantity of source data blocks sent by the target copy execution device;

所述主节点根据所述源数据块的数量为所述至少一个文件创建目的数据块，并为所述目的数据块分配地址信息；The master node creates a destination data block for the at least one file according to the number of the source data blocks, and allocates address information for the destination data block;

所述主节点向所述目标复制执行设备发送所述目的数据块的地址信息，所述源数据块的地址信息和所述目的数据块的地址信息用于所述目标复制执行设备指示所述源数据节点将所述至少一个文件从所述源数据块同步到所述目的数据块。Sending, by the master node, address information of the destination data block to the target copy execution device, where the address information of the source data block and the address information of the destination data block are used by the target copy execution device to indicate the source The data node synchronizes the at least one file from the source data block to the destination data block.
一种复制管理设备，其特征在于，所述复制管理设备应用于Hadoop分布式文件***HDFS，所述HDFS还包括至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群，所述复制管理设备包括：收发器、处理器和存储器，所述收发器、所述处理器和所述存储器通过总线连接；A replication management device is characterized in that the replication management device is applied to a Hadoop distributed file system HDFS, and the HDFS further includes at least one replication execution device and multiple HDFS clusters, each of which includes a master node and At least two data nodes, for each HDFS cluster, the at least two data nodes storing files, the master node maintaining information of files stored by the at least two data nodes in the cluster, the plurality of HDFS The cluster includes a first HDFS cluster, the replication management device includes: a transceiver, a processor, and a memory, the transceiver, the processor, and the memory are connected by a bus;

其中，所述存储器用于存储所述处理器执行文件同步的程序；Wherein the memory is configured to store a program for the processor to perform file synchronization;

所述收发器用于向所述第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号，接收所述第一HDFS集群的主节点发送的待同步文件的信息，所述待同步文件的信息是所述第一HDFS集群的主节点确定的与所述目录结束操作编号接续的目录操作编号所对应文件的信息；The transceiver is configured to send, to the primary node of the first HDFS cluster, a directory end operation number of the previously synchronized file, and receive information about the file to be synchronized sent by the primary node of the first HDFS cluster, where the file to be synchronized The information is the information of the file corresponding to the directory operation number determined by the primary node of the first HDFS cluster and connected to the directory end operation number;

所述处理器用于根据所述待同步文件的信息，确定至少一个同步任务，其中，每个同步任务包含所述待同步文件中至少一个文件的信息；The processor is configured to determine, according to the information of the file to be synchronized, at least one synchronization task, where each synchronization task includes information of at least one file in the file to be synchronized;

所述收发器还用于接收目标复制执行设备发送的任务请求后，向所述目标复制执行设备发送目标同步任务，所述目标同步任务用于所述目标复制执行设备根据所述目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，所述源数据节点属于所述第一HDFS集群。The transceiver is further configured to: after receiving the task request sent by the target copy execution device, send a target synchronization task to the target copy execution device, where the target synchronization task is used by the target copy execution device according to the target synchronization task The at least one file information is synchronized from the source data node to the destination data node, the source data node belonging to the first HDFS cluster.
一种复制执行设备，其特征在于，所述复制执行设备应用于Hadoop分布式文件***HDFS，所述HDFS还包括复制管理设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群，所述复制执行设备包括：收发器、处理器和存储器，所述收发器、所述处理器和所述存储器通过总线连接；A replication execution device is characterized in that the replication execution device is applied to a Hadoop distributed file system HDFS, and the HDFS further includes a replication management device and multiple HDFS clusters, each of which includes a primary node and at least two Data nodes, for each HDFS cluster, the at least two data nodes store files, the master node maintains information about files stored by the at least two data nodes in the cluster, and the plurality of HDFS clusters includes a first HDFS cluster, the copy execution device comprising: a transceiver, a processor, and a memory, the transceiver, the processor, and the memory being connected by a bus;

其中，所述存储器用于存储所述处理器执行文件同步的程序；Wherein the memory is configured to store a program for the processor to perform file synchronization;

所述收发器用于接收所述复制管理设备发送的目标同步任务，所述目标同步任务是所述复制管理设备根据待同步文件的信息确定的至少一个同步任务中的一个，其中，每个同步任务包含所述待同步文件中至少一个文件的信息，所述待同步文件的信息是所述复制管理设备向所述第一HDFS集群的主节点发送前一次所同步文件的目录结束操作编号后，由所述第一HDFS集群的主节点所确定的与所述目录结束操作编号接续的目录操作编号所对应文件的信息；The transceiver is configured to receive a target synchronization task sent by the replication management device, where the target synchronization task Is one of the at least one synchronization task determined by the replication management device according to the information of the file to be synchronized, where each synchronization task includes information of at least one file in the file to be synchronized, and the information of the file to be synchronized is After the copy management device sends the directory end operation number of the previously synchronized file to the primary node of the first HDFS cluster, the directory determined by the primary node of the first HDFS cluster is connected to the directory end operation number. Information about the file corresponding to the operation number;

所述处理器用于根据所述目标同步任务中的至少一个文件的信息，将所述至少一个文件从源数据节点同步到目的数据节点，所述源数据节点属于所述第一HDFS集群。The processor is configured to synchronize the at least one file from a source data node to a destination data node according to information of at least one file in the target synchronization task, where the source data node belongs to the first HDFS cluster.
根据权利要求10所述的复制执行设备，其特征在于，所述多个HDFS集群包括第二HDFS集群，所述目的数据节点属于所述第二HDFS集群，所述至少一个文件的信息与存储所述至少一个文件的源数据块的地址信息对应The replication execution device according to claim 10, wherein the plurality of HDFS clusters comprise a second HDFS cluster, the destination data node belongs to the second HDFS cluster, and the information and storage of the at least one file Corresponding to the address information of the source data block of the at least one file

所述处理器具体用于：The processor is specifically configured to:

根据所述至少一个文件的信息从所述第一HDFS集群的主节点获取与所述至少一个文件的信息对应的源数据块的地址信息；Acquiring address information of the source data block corresponding to the information of the at least one file from the primary node of the first HDFS cluster according to the information of the at least one file;

根据所述源数据块的地址信息，确定所述源数据块的数量；Determining the number of the source data blocks according to the address information of the source data block;

根据所述源数据块的数量从所述第二HDFS集群的主节点获取目的数据块的地址信息，所述目的数据块的地址信息是所述第二HDFS集群的主节点根据所述源数据块的数量为所述至少一个文件分配的；Acquiring the address information of the destination data block from the master node of the second HDFS cluster according to the number of the source data blocks, where the address information of the destination data block is the master node of the second HDFS cluster according to the source data block The number is allocated for the at least one file;

所述收发器还用于向所述源数据节点发送同步指示消息，所述同步指示消息包含源数据块的地址信息和所述目的数据块的地址信息，所述源数据块的地址信息和所述目的数据块的地址信息用于所述源数据节点将所述至少一个文件从所述源数据块同步到所述目的数据块。The transceiver is further configured to send a synchronization indication message to the source data node, where the synchronization indication message includes address information of a source data block and address information of the destination data block, address information and location of the source data block The address information of the destination data block is used by the source data node to synchronize the at least one file from the source data block to the destination data block.
根据权利要求11所述的复制执行设备，其特征在于，A copy execution device according to claim 11, wherein

所述收发器具体用于当所述源数据块的地址信息指示所述源数据块有多个时，针对每个源数据块向所述源数据节点并行发送所述同步指示消息。The transceiver is specifically configured to: when the address information of the source data block indicates that the source data block has multiple, send the synchronization indication message to the source data node in parallel for each source data block.
一种主节点，其特征在于，所述主节点应用于Hadoop分布式文件***HDFS，所述HDFS包括复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括所述主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群，所述主节点包括：收发器、处理器和存储器，所述收发器、所述处理器和所述存储器通过总线连接；A master node is configured to apply to the Hadoop distributed file system HDFS, where the HDFS includes a replication management device, at least one replication execution device, and multiple HDFS clusters, and each HDFS cluster includes the foregoing a primary node and at least two data nodes, the at least two data nodes storing files for each HDFS cluster, the primary node maintaining information of files stored by the at least two data nodes in the cluster, The plurality of HDFS clusters include a first HDFS cluster, the master node includes: a transceiver, a processor, and a memory, and the transceiver, the processor, and the memory are connected by a bus;

当所述主节点属于所述第一HDFS集群时，When the primary node belongs to the first HDFS cluster,

所述收发器用于接收所述复制管理设备发送的前一次所同步数据文件的目录结束操作编号；The transceiver is configured to receive a directory end operation number of a previously synchronized data file sent by the replication management device;

所述处理器用于从文件的目录操作编号中确定与所述目录结束操作编号接续的目录操作编号，并确定所述接续的目录操作编号所对应的待同步文件的信息；The processor is configured to determine, from a directory operation number of the file, a directory operation number that is connected to the directory end operation number, and determine information about the file to be synchronized corresponding to the subsequent directory operation number;

所述收发器还用于向所述复制管理设备发送所述待同步文件的信息，所述待同步文件的信息用于所述复制管理设备确定至少一个同步任务，其中，每个同步任务包含所述待同步文件中至少一个文件的信息，所述至少一个文件的信息用于所述目标复制执行设备将所述至少一个文件从源数据节点同步到目的数据节点，所述源数据节点属于所述第一HDFS集群。The transceiver is further configured to send information about the file to be synchronized to the copy management device, where the information of the file to be synchronized is used by the copy management device to determine at least one synchronization task, where each synchronization task includes Determining information of at least one file in the synchronization file, the information of the at least one file being used by the target copy execution device to synchronize the at least one file from a source data node to a destination data node, the source data node In the first HDFS cluster.
根据权利要求13所述的主节点，其特征在于，The master node according to claim 13, wherein

所述收发器还用于在向所述复制管理设备发送所述待同步文件的信息之后，接收所述目标复制执行设备发送的至少一个文件的信息；The transceiver is further configured to: after receiving the information about the file to be synchronized to the copy management device, receive information about at least one file sent by the target copy execution device;

所述处理器还用于确定与所述至少一个文件的信息对应的源数据块的地址信息，其中，所述至少一个文件的信息与存储所述至少一个文件的源数据块的地址信息对应；The processor is further configured to determine address information of a source data block corresponding to information of the at least one file, where information of the at least one file corresponds to address information of a source data block storing the at least one file;

所述收发器还用于向所述目标复制执行设备发送所述源数据块的地址信息，所述源数据块的地址信息用于所述目标复制执行设备确定所述源数据块的数量。The transceiver is further configured to send address information of the source data block to the target copy execution device, where the address information of the source data block is used by the target copy execution device to determine the number of the source data blocks.
根据权利要求13所述的主节点，其特征在于，所述多个HDFS集群还包括第二HDFS集群，所述目的数据节点属于所述第二HDFS集群，当所述主节点属于所述第二HDFS集群时；The master node according to claim 13, wherein the plurality of HDFS clusters further comprise a second HDFS cluster, and the destination data node belongs to the second HDFS cluster, and when the master node belongs to the second HDFS cluster time;

所述收发器还用于接收所述目标复制执行设备发送的源数据块的数量；The transceiver is further configured to receive a quantity of source data blocks sent by the target copy execution device;

所述处理器还用于根据所述源数据块的数量为所述至少一个文件创建目的数据块，并为所述目的数据块分配地址信息；The processor is further configured to create a destination data block for the at least one file according to the number of the source data blocks, and allocate address information for the destination data block;

所述收发器还用于向所述目标复制执行设备发送所述目的数据块的地址信息，所述源数据块的地址信息和所述目的数据块的地址信息用于所述目标复制执行设备指示所述源数据节点将所述至少一个文件从所述源数据块同步到所述目的数据块。The transceiver is further configured to send address information of the destination data block to the target copy execution device, where address information of the source data block and address information of the destination data block are used by the target copy execution device to indicate The source data node synchronizes the at least one file from the source data block to the destination data block.
一种Hadoop分布式文件***，其特征在于，包括：复制管理设备、至少一个复制执行设备和多个HDFS集群，每个HDFS集群都分别包括主节点和至少两个数据节点，针对每个HDFS集群，所述至少两个数据节点存储有文件，所述主节点维护有该集群中所述至少两个数据节点所存储文件的信息，所述多个HDFS集群包括第一HDFS集群；A Hadoop distributed file system, comprising: a replication management device, at least one replication execution device, and multiple HDFS clusters, each of which includes a primary node and at least two data nodes, respectively, for each HDFS cluster The at least two data nodes store files, and the primary node maintains information about files stored by the at least two data nodes in the cluster, where the plurality of HDFS clusters include a first HDFS cluster;

所述复制管理设备为上述权利要求9所述的复制管理设备；The copy management device is the copy management device according to claim 9;

所述复制执行设备为上述权利要求10-12任一所述的复制执行设备；The copy execution device is the copy execution device of any one of the preceding claims 10-12;

所述主节点为上述权利要求13-15任一所述的主节点。 The master node is the master node of any of the preceding claims 13-15.