CN106649676B - HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files - Google Patents

HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files Download PDF

Info

Publication number
CN106649676B
CN106649676B CN201611159251.XA CN201611159251A CN106649676B CN 106649676 B CN106649676 B CN 106649676B CN 201611159251 A CN201611159251 A CN 201611159251A CN 106649676 B CN106649676 B CN 106649676B
Authority
CN
China
Prior art keywords
file
identifier
storage node
deduplicated
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611159251.XA
Other languages
Chinese (zh)
Other versions
CN106649676A (en
Inventor
张为锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201611159251.XA priority Critical patent/CN106649676B/en
Publication of CN106649676A publication Critical patent/CN106649676A/en
Application granted granted Critical
Publication of CN106649676B publication Critical patent/CN106649676B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a device for removing duplicate files based on an HDFS (Hadoop distributed File System) storage file. The method comprises the following steps: comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file; if the comparison result is the same, calculating a link identifier according to the file identifier of the file to be deduplicated; and replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node. By the technical scheme, the files with repeated contents are effectively removed, the number of the files is reduced, the storage space is saved, and the system performance is improved.

Description

HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files
Technical Field
The embodiment of the invention relates to an unstructured data storage technology, in particular to a method and a device for removing duplicate files based on an HDFS (Hadoop distributed File System) storage file.
Background
A Hadoop Distributed File System (HDFS) is a System that provides a reliable storage function for a very large-scale data set, and is established to provide a high-bandwidth input/output data stream for a user application program on the basis of a task of "write once, read many times" in response. The HDFS has high fault tolerance and can be operated on a low-cost hardware cluster. By adopting the MASTER-slave architecture of MASTER/SLAVES, an HDFS cluster is composed of a Namenode node (management node) and a plurality of Datanode nodes (storage nodes). The management node is a central server and is responsible for managing the metadata of the file system and the access of the client to the file. The management node stores the metadata of the files, so the memory capacity of the management node limits the number of files. HDFS by default will split a file into blocks (memory blocks), e.g. 64M into 1 memory block. And then storing each storage block in a storage node of the HDFS in a key-value pair mode, and storing the mapping of the key-value pair into a memory. Each file, storage block, and index directory is stored in memory as objects, each object occupying about 150 bytes. For example, if there are 1000000 small files, each file occupies one storage block, the management node needs at least 300M of memory; if 1 hundred million or more files are stored, 20G or more memory capacity is needed, and the solution is to build a memory database supporting the cluster, but the system cost is increased. If the number of the small files is too many, too many memory resources are occupied, cluster performance is affected, the small files need to be combined, and the number of the files is reduced.
However, in actual internet applications, there are a huge amount of small files, and especially with the rise of social websites such as blogs, microblogs, Facebook, etc., the way of storing contents in the internet is changed. Users have basically become creators of internet contents, and data thereof has the characteristics of mass, diversity, dynamic change and the like, so that mass small files such as state files, user data, head portraits and the like are generated. These data can be divided into structured data and unstructured data according to the storage format of the data. The structured data has the same hierarchical and grid structure and can be described by numbers or words; some information cannot be represented numerically or by a uniform structure, such as scanned images, faxes, photographs, computer-generated reports, word-processed documents, spreadsheets, presentations, voice and video, etc., which are unstructured data. After the unstructured data is extracted in a structured manner, the original file needs to be saved for subsequent use.
In many areas, the proportion of unstructured data is much higher than the proportion of structured data. The amount of unstructured data information is very large, and if the unstructured data information is directly stored in a database, the capacity of the database is greatly increased, and the efficiency of maintenance and application is reduced. Particularly, unstructured data obtained on the internet often has repeatability, and a hot event brings a lot of netizen attention in a short time, so that a small amount of unstructured data is largely recycled in a short time, and a system storage space is occupied. In the prior art, data is compressed according to a certain proportion by adopting a compression technology, but unstructured data does not have a strict structure, is more difficult to standardize than structured information and is more difficult to manage. Aiming at the characteristics, after massive unstructured small files stored in the HDFS are combined into a large file by adopting a Mapfile technology at present, compression processing is not carried out, and the occupied storage space is large, so that the problem of how to remove repeated contents in massive unstructured data and save the storage space is urgently needed to be solved.
Disclosure of Invention
The embodiment of the invention provides a method and a device for removing duplicate files stored based on an HDFS (Hadoop distributed File System), so that when the HDFS processes a large amount of small unstructured files stored, duplicate files are effectively removed, and the storage space is saved.
In a first aspect, an embodiment of the present invention provides a method for removing duplicate files stored on an HDFS, including:
comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;
if the comparison result is the same, calculating a link identifier according to the file identifier of the file to be deduplicated;
and replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node.
Preferably, before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, the method further includes:
storing the received file into a set area in the storage node, and marking the file as an unremoved processing area;
and acquiring files one by one from the non-duplication processing area as files to be duplicated.
Preferably, the storing the received file in the set area of the storage node includes:
generating a primary key for the received file as a file identifier;
and converting the file content of the file into binary data, and storing the binary data and the file identifier into a set area in the storage node correspondingly.
Preferably, the storing the received file in the set area of the storage node includes:
and storing the received file into different set areas in the storage node according to the receiving date of the file.
Preferably, calculating the link identifier according to the file identifier of the file to be deduplicated includes:
and calculating a 32-bit MD5 value of the file identifier of the file to be deduplicated as the link identifier.
Preferably, after replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node, the method further includes:
and rewriting the index file of the storage node according to the file identifications and the storage positions of the corresponding key values in the storage node.
Preferably, the method further comprises:
acquiring a file identifier of a file to be read according to a received file reading request;
calculating a corresponding link identifier according to the file identifier;
reading the set bit data of the corresponding key value from the storage node according to the file identification;
if the link identification is matched with the set bit data by comparison, reading a storage address from the key value;
and positioning and searching a corresponding file in the storage node according to the storage address, and responding to the file reading request after reading.
In a second aspect, an embodiment of the present invention further provides a deduplication apparatus for storing files based on an HDFS, including:
the fingerprint comparison module is used for comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;
the link identifier calculation module is used for calculating a link identifier according to the file identifier of the file to be deduplicated if the comparison results are the same;
and the content replacing module is used for replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated into the storage node.
Preferably, the apparatus further comprises:
the file storage module is used for storing the received file into a set area in the storage node before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, and marking the file as an undeduplicated area;
and the file acquisition module is used for acquiring files one by one from the unremoved processing area as files to be subjected to duplication elimination.
Preferably, the file storage module includes:
a primary key generating unit, configured to generate a primary key for the received file as a file identifier;
and the content conversion unit is used for converting the file content of the file into binary data and storing the binary data and the file identifier into a set area in the storage node correspondingly.
Preferably, the file storage module is specifically configured to:
and storing the received file into different set areas in the storage node according to the receiving date of the file.
Preferably, the link identifier calculation module is specifically configured to:
and calculating a 32-bit MD5 value of the file identifier of the file to be deduplicated as the link identifier.
Preferably, the apparatus further comprises:
and the rewriting index module is used for replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node, and rewriting the index file of the storage node according to the file identifiers in the storage node and the storage positions corresponding to the key values.
Preferably, the apparatus further comprises:
the file identification reading module is used for acquiring the file identification of the file to be read according to the received file reading request;
the corresponding identification calculating module is used for calculating corresponding link identifications according to the file identifications;
the setting bit data reading module is used for reading the setting bit data of the corresponding key value from the storage node according to the file identification;
the matching module is used for reading a storage address from the key value if the link identifier is matched with the set bit data by comparison;
and the file searching module is used for positioning and searching a corresponding file in the storage node according to the storage address and responding to the file reading request after reading.
The embodiment of the invention only reserves one file with the same content aiming at massive unstructured files with the same file content in the HDFS, deletes the file content with the same fingerprint as the stored file, and replaces the file content with the link identifier and the link address, thereby effectively removing the file with repeated content, reducing the number of files, saving a large amount of storage space, releasing memory resources, improving the system performance, and simultaneously meeting the requirements of quick storage and correct reading.
Drawings
Fig. 1A is a flowchart of a method for removing duplicate files based on an HDFS storage file according to a first embodiment of the present invention;
fig. 1B is a schematic diagram of a HDFS storage file-based deduplication method according to a first embodiment of the present invention;
fig. 2 is a flowchart of a method for removing duplicate files based on an HDFS storage file according to a second embodiment of the present invention;
fig. 3 is a flowchart of a method for removing duplicate files based on an HDFS storage file according to a third embodiment of the present invention;
fig. 4A is a schematic structural diagram of a HDFS storage file-based deduplication apparatus according to a fourth embodiment of the present invention;
fig. 4B is a schematic structural diagram of a HDFS storage file-based deduplication apparatus according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1A is a flowchart of a HDFS-based file storage deduplication method according to an embodiment of the present invention, where this embodiment may be applied to a Hadoop distributed file system, and the system may generally include a management node and a plurality of storage nodes. The method can be executed by a HDFS (Hadoop distributed file system) -based file storage duplication removal device, which can be realized in a software and/or hardware mode and is generally integrated in a management node in a Hadoop distributed file system.
The method of the first embodiment of the invention specifically comprises the following steps:
s101, comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file.
The file to be deduplicated is a received file, and the file may be stored in the storage node first, and then the deduplication operation of this embodiment is performed in an offline state, or an online deduplication operation may be performed when the file to be deduplicated is received. Since online deduplication requires large resource occupation, slow operation speed and long response time, offline deduplication is preferred. And extracting the file which is not subjected to the past re-processing from the storage node as the file to be de-duplicated.
Specifically, the file fingerprint is calculated according to the content of each file, and the calculated file fingerprint is the same regardless of changes in the file name as long as the content of the file does not change. If the file content of the file to be deduplicated is the same as the file content of the stored file, the calculated file fingerprint is the same. The file fingerprint may be calculated by a fifth version of Message Digest Algorithm (Message-Digest Algorithm 5, MD5 value for short), Secure hash Algorithm (Secure hash Algorithm 1, SHA1 value for short) or Cyclic Redundancy Check (Cyclic Redundancy Check, CRC32 value for short) for calculating the file. The MD5 value has high discreteness, and the MD5 value is greatly changed due to small change of original information content, so that the reliability is high. In this embodiment, it is preferable to obtain the first 1K binary data of the file and the last 1K binary data of the file to perform MD5 value calculation, and the calculation result is used as the file fingerprint.
And under an off-line state, regularly comparing the file fingerprints of the files to be deduplicated with the file fingerprints of the stored files. After 0 point every day, offline comparing the file fingerprints of the files to be deduplicated with the file fingerprints of the stored files by using a MapReduce computing model in a Hadoop distributed file system, screening the files to be deduplicated with the stored files with the same content, and acquiring the corresponding stored files and the storage addresses of the stored files in the data storage nodes.
And S102, if the comparison results are the same, calculating a link identifier according to the file identifier of the file to be deduplicated.
Specifically, when a file is written into the Hadoop distributed file system, the file is stored in a mapping file (Mapfile) in a Key-Value pair (Key-Value) form, and the primary Key is a file identifier and is a character string which is allocated to the file during file storage and can uniquely identify the file. The Value is the binary Value corresponding to Key, i.e. all binary data corresponding to the file content. And if the comparison results of the file fingerprints are the same, calculating a link identifier according to the file identifier Key of the file to be deduplicated, wherein the link identifier plays a special identification role in the file subjected to deduplication processing. In the file reading stage, if the link identifier is read from the key value of the file instead of the actual binary data, it indicates that the file is subjected to the deduplication processing. If the comparison result of the file fingerprints is different, the file contents of the file are different from those of the stored file, the file contents of the file are reserved, and duplicate removal processing is not performed.
Preferably, step S102 includes:
and calculating a 32-bit MD5 value of the file identifier of the file to be deduplicated as the link identifier.
In this embodiment, a 32-bit MD5 value is calculated according to the file identifier Key of the file to be deduplicated, and is used as the link identifier. Similar to the encryption process, the file subjected to the deduplication processing is encrypted and identified, and a 32-bit MD5 value is calculated and decrypted in response to a file reading request. And in the stage of reading the file, the link identifier can be calculated according to the file identifier, so that whether the file is subjected to duplicate removal processing or not is identified.
S103, replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node.
In this embodiment, the key value of the file subjected to the deduplication processing does not store binary data of the file content, but replaces the binary data with the link identifier and the storage address, and the content stored in the storage address is completely the same as the content of the file subjected to the deduplication processing. As shown in fig. 1B, assuming that the content of the file corresponding to the Key2 is the same as the content of the stored file, binary data corresponding to the content of the Key2 file is read out, and the link identifier and the storage address, i.e. the 32-bit MD5 value and the actual address of the same file storage, are written in the original location, so as to complete the replacement of the content of the file to be deduplicated. The contents of the files corresponding to the keys 1 and 3 are different from those of the stored files, and binary data corresponding to the contents of the keys 1 and 3 are reserved.
Preferably, step S103 includes:
and rewriting the index file of the storage node according to the file identifications and the storage positions of the corresponding key values in the storage node.
Specifically, data can be quickly located according to an index file, key value data corresponding to each file identifier in the storage node has been replaced, an original index file cannot correctly represent a new mapping relationship, and the index file of the storage node needs to be rewritten according to each file identifier and a storage position corresponding to a key value in the storage node after replacement.
According to the method for removing the duplicate of the file stored based on the HDFS, provided by the embodiment of the invention, the fingerprint of the file is compared in an off-line state, and data is removed and processed, so that the processing time can be properly prolonged, the reliability of the system is increased, the memory resource is saved, the requirement on hardware equipment is lowered, a large amount of equipment cost is saved, the file with repeated content can be effectively removed, the number of the files is reduced, and the storage space is saved.
Example two
Fig. 2 is a flowchart of a HDFS storage file-based deduplication method according to a second embodiment of the present invention, which is optimized and improved based on the first embodiment of the present invention, and further illustrates how to perform offline deduplication operations, as shown in fig. 2, the second embodiment of the present invention specifically includes:
s201, storing the received file into a set area in the storage node, and marking the file as an unremoved processing area.
In this embodiment, the Hadoop distributed file system includes a plurality of mapping files, and the mapping files are used to archive a large amount of unstructured small files and generate mapping relationships corresponding to the archived files. And the system continuously receives the files and caches the files, and when the occupied space of the cache reaches a capacity threshold or the receiving time reaches a preset time limit, the system sequentially writes the files into the mapping files of the storage nodes according to the sequence of receiving the unstructured files and marks the files as unremoved processing areas. The range of the capacity threshold value can be set to be 128M to 2G, the range of the preset time limit can be set to be 5 minutes to 20 minutes, and the writing mode can be realized by multi-thread concurrent writing so as to ensure the writing speed.
Preferably, step S201 includes:
generating a primary key for the received file as a file identifier;
and converting the file content of the file into binary data, and storing the binary data and the file identifier into a set area in the storage node correspondingly.
Specifically, the system generates a primary Key for a received file, uses the primary Key as a file identifier, performs index storage according to the primary Key, converts the file content of the file into binary data, and uses the binary data as a Key Value corresponding to the primary Key, where the primary Key and the corresponding Key Value are stored in a mapping file of the storage node in a Key Value pair form.
Preferably, step S201 further includes:
and storing the received file into different set areas in the storage node according to the receiving date of the file.
Specifically, when storing the file, the received file is stored into different mapping files in the storage node according to the receiving date. And aiming at the files written every day, a new directory is built in the Hadoop distributed file system for storage, and the files are stored in a partition mode by taking the day as a unit.
S202, acquiring files one by one from the unremoved processing area as files to be subjected to duplication elimination.
S203, comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file.
And S204, if the comparison results are the same, calculating a link identifier according to the file identifier of the file to be deduplicated.
S205, replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node.
According to the method for removing the duplicate of the file stored based on the HDFS, provided by the embodiment of the invention, the file is stored in a partition mode according to the receiving date of the file, so that offline processing is facilitated, duplicate removal processing is not performed for the stored file on the same day temporarily, the storage efficiency of data can be ensured, the requirement for rapidly storing the data is met, and the real-time property of data storage is improved.
EXAMPLE III
Fig. 3 is a flowchart of a deduplication method for storing a file based on an HDFS according to a third embodiment of the present invention, which is optimized and improved based on the second embodiment of the present invention, and further illustrates a process of obtaining file content after a file is subjected to deduplication processing, as shown in fig. 3, the third embodiment of the present invention specifically includes:
s301, acquiring a file identifier of a file to be read according to the received file reading request.
And S302, calculating a corresponding link identifier according to the file identifier.
And S303, reading the set bit data of the corresponding key value from the storage node according to the file identifier.
S304, if the link identification is matched with the set bit data through comparison, reading a storage address from the key value.
S305, positioning and searching a corresponding file in the storage node according to the storage address, and responding to the file reading request after reading.
In this embodiment, the process of obtaining the file content shields an internal processing flow for a file user, the system obtains a file identifier main Key of a file to be read according to a received file reading request, and calculates a link identifier corresponding to the file identifier main Key to be read according to the main Key, where the link identifier can be obtained by calculating an MD5 value. Reading the MD5 value of the first 32 bits of the corresponding Key value from the storage node according to the primary Key, comparing the MD5 value of the link identifier with the read MD5 value of the first 32 bits, if the values are consistent, indicating that the file to be read is subjected to reprocessing, storing the storage address of the stored file with the same content in the storage node instead of the real content of the file, removing the first 32 bits of data in the file content, reading the storage address from the Key value, locating and searching the corresponding file in the storage node according to the storage address, and responding to the file reading request after reading. And comparing the MD5 value of the link identifier with the read front 32-bit MD5 value, if the values are not consistent, indicating that the file to be read is not subjected to reprocessing, and responding to the file reading request after the file content is read from the key value.
According to the method for removing the duplicate of the HDFS-based storage file, provided by the third embodiment of the invention, only the corresponding storage address is stored for the repeated unstructured file, the internal processing flow is shielded for an accessor when the file is read, the requirement for correct reading can be met, the storage space is saved, and the system performance is improved.
Example four
Fig. 4A is a schematic structural diagram of a HDFS-based file storage deduplication apparatus according to a fourth embodiment of the present invention, where the apparatus is applied to a Hadoop distributed file system. As shown in fig. 4A, the apparatus includes:
the fingerprint comparison module 401 is configured to compare a file fingerprint of a file to be deduplicated with a file fingerprint of a stored file;
a link identifier calculating module 402, configured to calculate a link identifier according to the file identifier of the file to be deduplicated if the comparison results are the same;
a content replacing module 403, configured to replace, with the link identifier and a storage address of the same stored file in a storage node, the file content of the file to be deduplicated, and store the file content in the storage node as a key value of the file identifier of the file to be deduplicated.
Preferably, the link identifier calculation module is specifically configured to:
and calculating a 32-bit MD5 value of the file identifier of the file to be deduplicated as the link identifier.
Preferably, the apparatus further comprises:
and a rewriting index module 404, configured to replace file contents of the file to be deduplicated with the link identifier and a storage address of the same stored file in a storage node, store the file contents as a key value of the file identifier of the file to be deduplicated in the storage node, and rewrite the index file of the storage node according to each file identifier and a storage location corresponding to the key value in the storage node.
Specifically, in an off-line state, a fingerprint comparison module is used for comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, screening out the file to be deduplicated with the stored file and having the same content as the stored file, and acquiring the corresponding stored file and the storage address of the stored file in the data storage node. If the file fingerprint comparison results are the same, according to the file identification Key of the file to be deduplicated, a 32-bit MD5 value is calculated in the link identification calculation module and is used as a link identification, and the link identification plays a role in identifying the file which is subjected to deduplication processing. And replacing the file content of the file to be deduplicated by using the link identifier and the storage address of the same stored file in the storage node through a content replacement module, and storing the file content as a key value of the file identifier of the file to be deduplicated in the storage node. And rewriting the index file of the storage node in a rewriting index module according to the file identifications and the storage positions of the corresponding key values in the storage node.
Preferably, as shown in fig. 4A, the apparatus further comprises:
a file storage module 405, configured to store the received file in a set area in the storage node before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, and mark the file as an undeduplicated area;
a file obtaining module 406, configured to obtain files one by one from the unremoved processing area as files to be deduplicated.
Preferably, the file storage module includes:
a primary key generating unit, configured to generate a primary key for the received file as a file identifier;
and the content conversion unit is used for converting the file content of the file into binary data and storing the binary data and the file identifier into a set area in the storage node correspondingly.
Preferably, the file storage module is specifically configured to:
and storing the received file into different set areas in the storage node according to the receiving date of the file.
Specifically, the file access module continuously receives and caches the files, and when the occupied space of the cache reaches a capacity threshold or the receiving time reaches a preset time limit, the system performs multithreading concurrent writing into the mapping files of the storage nodes according to the receiving date of the files and the sequence of receiving the unstructured files. Wherein the capacity threshold may be set to be in a range of 128M to 2G, and the preset time limit may be set to be in a range of 5 minutes to 20 minutes. The main Key generation unit generates a main Key for the received file, the main Key is used as a file identifier, the file content of the file is converted into binary data by using the content conversion unit, the binary data is used as a Key Value corresponding to the main Key, and the main Key and the corresponding Key Value are stored into a mapping file of the storage node in a Key Value pair mode. And acquiring files one by one from the non-duplicate removal processing area according to a file acquisition module to serve as files to be duplicated.
Preferably, as shown in fig. 4B, the apparatus further includes:
the file identifier reading module 407 is configured to obtain a file identifier of a file to be read according to the received file reading request;
a corresponding identifier calculating module 408, configured to calculate a corresponding link identifier according to the file identifier;
a set bit data reading module 409, configured to read set bit data of a corresponding key value from a storage node according to the file identifier;
a matching module 410, configured to read a storage address from the key value if the link identifier is matched with the set bit data by comparison;
the file searching module 411 is configured to locate and search a corresponding file in the storage node according to the storage address, and respond to the file reading request after reading the file.
Specifically, a file identification reading module is used for acquiring a file identification main Key of a file to be read according to a received file reading request, and a corresponding identification calculating module is used for calculating an MD5 value corresponding to the file identification main Key to be read according to the main Key. Reading the first 32-bit MD5 value of the corresponding Key value from the storage node by using a set bit data reading module according to the primary Key Key, comparing the MD5 value of the link identifier with the read first 32-bit MD5 value in a matching module, if the values are consistent, indicating that the file to be read is subjected to repeated processing, storing the file content at the storage address of the stored file with the same content in the storage node instead of the real content of the file, removing the first 32-bit data in the file content, reading the storage address from the Key value by using a file searching module, locating and searching the corresponding file in the storage node according to the storage address, and responding to the file reading request after reading. If not, the file to be read is not subjected to reprocessing, and the file reading request is responded after the file content is read from the key value.
The HDFS-based file storage deduplication device provided by the fourth embodiment of the invention can effectively remove files with repeated contents, reduce the number of files, save a large amount of storage space, release memory resources, improve system performance, and meet the requirements of quick storage and correct reading.
The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (12)

1. A method for removing duplicate files stored based on an HDFS (Hadoop distributed File System) is characterized by comprising the following steps:
comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;
if the comparison result is the same, calculating a link identifier according to the file identifier of the file to be deduplicated;
replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as a key value of the file identifier of the file to be deduplicated in the storage node;
before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, the method further comprises the following steps:
storing the received file into a set area in the storage node, and marking the file as an unremoved processing area;
and acquiring files one by one from the non-duplication processing area as files to be duplicated.
2. The method of claim 1, wherein storing the received file in a set area in the storage node comprises:
generating a primary key for the received file as a file identifier;
and converting the file content of the file into binary data, and storing the binary data and the file identifier into a set area in the storage node correspondingly.
3. The method of claim 1, wherein storing the received file in a set area in the storage node comprises:
and storing the received file into different set areas in the storage node according to the receiving date of the file.
4. The method of claim 1, wherein calculating the link identifier according to the file identifier of the file to be deduplicated comprises:
and calculating a 32-bit MD5 value of the file identifier of the file to be deduplicated as the link identifier.
5. The method according to claim 1, wherein, after replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the link identifier and the storage address as a key value of the file identifier of the file to be deduplicated in the storage node, the method further comprises:
and rewriting the index file of the storage node according to the file identifications and the storage positions of the corresponding key values in the storage node.
6. The method of any of claims 1-5, further comprising:
acquiring a file identifier of a file to be read according to a received file reading request;
calculating a corresponding link identifier according to the file identifier;
reading the set bit data of the corresponding key value from the storage node according to the file identification;
if the link identification is matched with the set bit data by comparison, reading a storage address from the key value;
and positioning and searching a corresponding file in the storage node according to the storage address, and responding to the file reading request after reading.
7. An HDFS-based file storage deduplication apparatus, comprising:
the fingerprint comparison module is used for comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;
the link identifier calculation module is used for calculating a link identifier according to the file identifier of the file to be deduplicated if the comparison results are the same;
the content replacement module is used for replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and storing the file content as the key value of the file identifier of the file to be deduplicated into the storage node;
the file storage module is used for storing the received file into a set area in the storage node before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, and marking the file as an undeduplicated area;
and the file acquisition module is used for acquiring files one by one from the unremoved processing area as files to be subjected to duplication elimination.
8. The apparatus of claim 7, wherein the file storage module comprises:
a primary key generating unit, configured to generate a primary key for the received file as a file identifier;
and the content conversion unit is used for converting the file content of the file into binary data and storing the binary data and the file identifier into a set area in the storage node correspondingly.
9. The apparatus of claim 7, wherein the file storage module is specifically configured to:
and storing the received file into different set areas in the storage node according to the receiving date of the file.
10. The apparatus of claim 7, wherein the link identifier computation module is specifically configured to:
and calculating a 32-bit MD5 value of the file identifier of the file to be deduplicated as the link identifier.
11. The apparatus of claim 7, further comprising:
and the rewriting index module is used for replacing the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, storing the file content as the key value of the file identifier of the file to be deduplicated in the storage node, and rewriting the index file of the storage node according to the file identifiers in the storage node and the storage positions corresponding to the key values.
12. The apparatus of any of claims 7-11, further comprising:
the file identification reading module is used for acquiring the file identification of the file to be read according to the received file reading request;
the corresponding identification calculating module is used for calculating corresponding link identifications according to the file identifications;
the setting bit data reading module is used for reading the setting bit data of the corresponding key value from the storage node according to the file identification;
the matching module is used for reading a storage address from the key value if the link identifier is matched with the set bit data by comparison;
and the file searching module is used for positioning and searching a corresponding file in the storage node according to the storage address and responding to the file reading request after reading.
CN201611159251.XA 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files Active CN106649676B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611159251.XA CN106649676B (en) 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611159251.XA CN106649676B (en) 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files

Publications (2)

Publication Number Publication Date
CN106649676A CN106649676A (en) 2017-05-10
CN106649676B true CN106649676B (en) 2020-06-19

Family

ID=58822292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611159251.XA Active CN106649676B (en) 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files

Country Status (1)

Country Link
CN (1) CN106649676B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590109A (en) * 2017-07-24 2018-01-16 深圳市元征科技股份有限公司 A kind of text handling method and electronic equipment
CN108563649B (en) * 2017-12-12 2021-12-07 南京富士通南大软件技术有限公司 Offline duplicate removal method based on GlusterFS distributed file system
CN111522502B (en) * 2019-02-01 2022-04-29 阿里巴巴集团控股有限公司 Data deduplication method and device, electronic equipment and computer-readable storage medium
CN110413960B (en) * 2019-06-19 2023-03-28 平安银行股份有限公司 File comparison method and device, computer equipment and computer readable storage medium
CN110442845B (en) * 2019-07-08 2022-12-20 新华三信息安全技术有限公司 File repetition rate calculation method and device
CN110535835A (en) * 2019-08-09 2019-12-03 西藏宁算科技集团有限公司 It is a kind of to support cloudy shared cloud storage method and system based on Message Digest 5
CN111522791B (en) * 2020-04-30 2023-05-30 电子科技大学 Distributed file repeated data deleting system and method
CN112084179B (en) * 2020-09-02 2023-11-07 北京锐安科技有限公司 Data processing method, device, equipment and storage medium
WO2023070462A1 (en) * 2021-10-28 2023-05-04 华为技术有限公司 File deduplication method and apparatus, and device
CN115203159B (en) * 2022-07-25 2024-06-04 北京字跳网络技术有限公司 Data storage method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825B (en) * 2009-12-10 2011-04-20 华中科技大学 Replicated data deleting method based on file content types
US9367397B1 (en) * 2011-12-20 2016-06-14 Emc Corporation Recovering data lost in data de-duplication system
CN104410692B (en) * 2014-11-28 2019-03-22 上海爱数信息技术股份有限公司 A kind of method and system uploaded for duplicate file

Also Published As

Publication number Publication date
CN106649676A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649676B (en) HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
US20200356901A1 (en) Target variable distribution-based acceptance of machine learning test data sets
US9251160B1 (en) Data transfer between dissimilar deduplication systems
US10949405B2 (en) Data deduplication device, data deduplication method, and data deduplication program
US8631052B1 (en) Efficient content meta-data collection and trace generation from deduplicated storage
WO2017167171A1 (en) Data operation method, server, and storage system
KR102187127B1 (en) Deduplication method using data association and system thereof
US8103621B2 (en) HSM two-way orphan reconciliation for extremely large file systems
CN113986873B (en) Method for processing, storing and sharing data modeling of mass Internet of things
JP2013541083A (en) System and method for scalable reference management in a storage system based on deduplication
CN103955530A (en) Data reconstruction and optimization method of on-line repeating data deletion system
US10515055B2 (en) Mapping logical identifiers using multiple identifier spaces
CN109783457B (en) CGI interface management method, device, computer equipment and storage medium
Siddiqui et al. Pseudo-cache-based IoT small files management framework in HDFS cluster
CN104965835B (en) A kind of file read/write method and device of distributed file system
CN103716384A (en) Method and device for realizing cloud storage data synchronization in cross-data-center manner
CN115203159B (en) Data storage method, device, computer equipment and storage medium
CN112965939A (en) File merging method, device and equipment
CN115543198A (en) Method and device for lake entering of unstructured data, electronic equipment and storage medium
CN116578746A (en) Object de-duplication method and device
CN109542860B (en) Service data management method based on HDFS and terminal equipment
CN111767287A (en) Data import method, device, equipment and computer storage medium
CN116049306A (en) Data synchronization method, device, electronic equipment and readable storage medium
CN109684331A (en) A kind of object storage meta data management device and method based on Kudu

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant