WO2020140622A1 - Distributed storage system, storage node device and data duplicate deletion method - Google Patents

Distributed storage system, storage node device and data duplicate deletion method Download PDF

Info

Publication number
WO2020140622A1
WO2020140622A1 PCT/CN2019/118009 CN2019118009W WO2020140622A1 WO 2020140622 A1 WO2020140622 A1 WO 2020140622A1 CN 2019118009 W CN2019118009 W CN 2019118009W WO 2020140622 A1 WO2020140622 A1 WO 2020140622A1
Authority
WO
WIPO (PCT)
Prior art keywords
fingerprint
data
node device
written
fingerprints
Prior art date
Application number
PCT/CN2019/118009
Other languages
French (fr)
Chinese (zh)
Inventor
宋小兵
姜文峰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140622A1 publication Critical patent/WO2020140622A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the present application relates to the field of distributed storage technology, and in particular, to a distributed storage system, storage node device, data deduplication method, and computer-readable storage medium.
  • Data deduplication also known as data deduplication (Data Deduplication)
  • Data deduplication is a technology used to globally identify and eliminate redundant data in storage systems, and has become a hotspot in storage system research in recent years.
  • Data deduplication uniquely identifies the data block by calculating the safe hash digest (such as SHA1 fingerprint) of the data block, avoiding the character-by-character matching of the data, and the storage system only needs to simply maintain the index table of the safe hash digest. Recognize duplicate data quickly and easily, with good scalability. The repeated data content only needs to record the corresponding data pointer information to achieve the purpose of saving storage space. Therefore, data deduplication technology can greatly save storage space and improve the resource utilization of storage devices.
  • the safe hash digest such as SHA1 fingerprint
  • the deduplication process of a data piece in a storage node in a distributed storage system usually includes the following steps: calculating the fingerprint of the data piece, and then querying whether the fingerprint exists in the fingerprint database of the storage node, if not, Then query the fingerprint database of other storage nodes in the distributed storage system for the existence of the fingerprint to confirm whether the data piece exists in the distributed storage system.
  • the disadvantage of this method is that the number of storage nodes in a distributed storage system is usually large. If a storage node needs to query fingerprints in the fingerprint database of other multiple storage nodes, it needs to communicate with multiple storage nodes one by one, which is slow and low efficiency.
  • the main purpose of the present application is to provide a distributed storage system, storage node device, data deduplication method and computer readable storage medium, aiming to improve the deduplication efficiency of the distributed storage system.
  • the distributed storage system includes multiple storage node devices and multiple shared fingerprint libraries.
  • the storage node devices and the shared fingerprint library are communicatively connected.
  • a local fingerprint library is provided in the storage node device, or the storage node device is in communication connection with the corresponding local fingerprint library, and the storage node device is used to: receive a data slice write request, and the data slice write request includes several Data slices to be written and the fingerprints of each data slice to be written; determine the fingerprints to be deduplicated among the fingerprints of the data slices to be written, and find whether each fingerprint to be deduplicated exists in the local fingerprint database ,
  • the local fingerprint database includes fingerprints of data pieces stored in the storage node device; when one or more fingerprints to be deduplicated exist in the local fingerprint library, the one or more fingerprints to be deduplicated Corresponding pieces of data to be written are deleted; when one or more fingerprints to be deduplicated do not exist in the local fingerprint database, the one or more fingerprints
  • the present application also proposes a data deduplication method, which is applicable to a distributed storage system.
  • the distributed storage system includes multiple storage node devices and several shared fingerprint libraries.
  • the storage nodes Communication connection between the device and the shared fingerprint library, the storage node device is provided with a local fingerprint library, or the storage node device is in communication connection with the corresponding local fingerprint library, the method includes the steps of: receiving step: storage node device Receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written; query step: the storage node device determines the data slice to be written The fingerprint to be deduplicated in the fingerprint, and find whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the fingerprint of the data piece stored in the storage node device; the first deduplication step: when there is When one or more fingerprints to be deduplicated exist in the local fingerprint database, the storage node device deletes the data piece to be written
  • the present application also proposes a storage node device that communicates with the shared fingerprint library, and the storage node device is provided with a local fingerprint library, or the storage node device Communicating with the corresponding local fingerprint database, the storage node device includes a memory and a processor, and a data deduplication program is stored on the memory.
  • a receiving step Receive a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written; query step: determine the fingerprint of the data slice to be written The fingerprint to be deduplicated, and find whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the fingerprints of the data pieces stored in the storage node device; the first deduplication step: when there is one or When multiple fingerprints to be deduplicated exist in the local fingerprint database, delete the data piece to be written corresponding to the one or more fingerprints to be deduplicated; the second deduplication step: when there are one or more deduplicated fingerprints When the fingerprint does not exist in the local fingerprint database, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, and each fingerprint to be processed is searched in the shared fingerprint database.
  • the shared database includes all the storage node devices.
  • the present application also proposes a computer-readable storage medium, which is suitable for a storage node device, and a communication connection between the storage node device and a shared fingerprint library, and a local fingerprint library is provided in the storage node device
  • the storage node device is in communication connection with a corresponding local fingerprint library
  • the computer-readable storage medium stores a data deduplication program
  • the data deduplication program may be executed by at least one processor to enable the at least one
  • a processor performs the following steps: a receiving step: receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written; an inquiry step: determining the location Describe the fingerprints to be deduplicated in the fingerprints of the data pieces to be written, and find out whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the fingerprints of the data pieces stored in the storage node device;
  • One deduplication step when one or more fingerprints
  • a storage node device performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library, the fingerprint can be directly queried in the shared fingerprint library Whether it is a duplicate fingerprint, there is no need to communicate and query with other storage node devices one by one, therefore, the data de-duplication efficiency of the distributed storage system is improved.
  • FIG. 1 is a schematic diagram of a system architecture of an embodiment of a distributed storage system of the present application
  • FIG. 2 is a schematic diagram of an operating environment of an embodiment of a data deduplication program of the application
  • 3 is a program module diagram of an embodiment of the data deduplication program of the application.
  • FIG. 4 is a schematic flowchart of an embodiment of a data deduplication method according to this application.
  • FIG. 1 it is a system architecture diagram of an embodiment of a distributed storage system of the present application.
  • the distributed storage system includes a plurality of storage node devices 1 and a plurality of shared fingerprint libraries 2, and the communication connection between the storage node devices 1 and the shared fingerprint library 2 (for example, through a network 4 communication connection ), the storage node device 1 is provided with a local fingerprint library 3, or the storage node device 1 is in communication connection with the corresponding local fingerprint library 3.
  • the local fingerprint database 3 includes fingerprints corresponding to stored data pieces in the storage node device 1
  • the shared fingerprint database 2 includes fingerprints of all stored data pieces in the storage node device 1.
  • the storage node device 1 is used to:
  • the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written;
  • the one or more fingerprints to be deduplicated are used as fingerprints to be processed, and each fingerprint to be processed is searched in the shared fingerprint library 2, When one or more fingerprints to be processed are found in the shared fingerprint library 2, the data pieces to be written corresponding to the found one or more fingerprints to be processed are deleted.
  • the storage node device 1 receives a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
  • the data piece to be written is obtained by dividing the data to be written (the data type of the data to be written includes block-level data and file-level data).
  • the segmentation operation may be performed by the storage node device 1, or any other suitable device (eg, client), and the segmentation method includes:
  • M is a natural number greater than 1, determine the size of the data slice corresponding to the data file to be written, and divide the M of the same size one by one according to the determined size of the data slice -1 data block, the rest is the Mth data block.
  • the size of the data piece to be written may be 4KB, 8KB, 12KB, 16KB or other granularity sizes.
  • the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc. Calculate the fingerprint of each piece of data to be written, and at the same time, record the arrangement order of each piece of data to be written (ie, the data piece fingerprint sequence), which is used for subsequent reading of the data to be written Assemble the data piece to be written into the data to be written according to the fingerprint sequence of the data piece.
  • the storage node device 1 can also save the data sheet fingerprint sequence to the local fingerprint library 3 and the shared fingerprint library 2.
  • the storage node device 1 determines the fingerprint to be deduplicated in the fingerprint of the data piece to be written, and the method for determining the fingerprint to be deduplicated includes: determining whether there is a redundant fingerprint in the fingerprint of the data piece to be written If there is, delete the redundant fingerprints, and use the remaining fingerprints as fingerprints to be deduplicated; if not, use all fingerprints of the data pieces to be written as fingerprints to be deduplicated.
  • the storage node device 1 determines whether the same fingerprint exists in all fingerprints of the data piece to be written. If there are the same fingerprints, the same fingerprints will be used as a fingerprint group. After finding all fingerprint groups, select one fingerprint to keep in each fingerprint group, delete the unselected fingerprints as redundant fingerprints, and judge Whether there are ungrouped fingerprints, if so, each ungrouped fingerprint is taken as the fingerprint to be deduplicated, if not, the process is ended. If the same fingerprint does not exist, all the fingerprints of the data pieces to be written are regarded as ungrouped fingerprints, and each ungrouped fingerprint is regarded as the fingerprint to be deduplicated.
  • the storage node device 1 After identifying the fingerprint to be deduplicated, the storage node device 1 searches the local fingerprint library 3 for the existence of each fingerprint to be deduplicated. When one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the storage node device 1 deletes the data pieces to be written corresponding to the one or more fingerprints to be deduplicated. When one or more fingerprints to be deduplicated exist in the local fingerprint database 3, it means that the data pieces to be written corresponding to these fingerprints to be deduplicated are duplicate data pieces. In order to save storage space, these duplicate data pieces delete.
  • the storage node device 1 takes the one or more fingerprints to be deduplicated as fingerprints to be processed and stores them in the shared fingerprint library 2 Find each fingerprint to be processed, and when one or more fingerprints to be processed are found in the shared fingerprint library 2, delete the data piece to be written corresponding to the found one or more fingerprints to be processed.
  • the storage node device 1 Since the shared fingerprint database 2 has the full amount of fingerprint data, if the storage node device 1 does not query a fingerprint to be deduplicated in the local fingerprint library 3, it continues to query the shared fingerprint library 2 whether the fingerprint to be deduplicated exists If it exists, it is determined that the data piece to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes and belongs to a duplicate data piece, and there is no need to perform storage processing on the data piece to be written.
  • a storage node device 1 when a storage node device 1 performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library 3, it can be directly in the shared fingerprint library 2 To query whether the fingerprint is a duplicate fingerprint, there is no need to communicate with other storage node devices 1 one by one, so the data de-duplication efficiency of the distributed storage system is improved.
  • the storage node device 1 is also used to:
  • the storage node device 1 saves all remaining data pieces to be written (ie, non-repetitive data pieces), and saves the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library 3 and the shared fingerprint library 2 in.
  • the distributed storage system further includes a control node device 5, the control node device 5 is in communication connection with the storage node device 1 and the shared fingerprint library 2 (e.g., communication through the network 4 connection).
  • the shared fingerprint library 2 may be set in a shared disk (such as an NVME disk mounted through NVMEOF), and the shared disk may be set in the control node device 5 or may be set independently of the control node device 5.
  • the storage node device 1 is also used to:
  • the control node device 5 is used to:
  • the cumulative reference count of the fingerprint of each data piece to be written (the cumulative reference count of a fingerprint represents the total number of times the data piece corresponding to the fingerprint is referenced by the stored data ).
  • the storage node device 1 is also used to:
  • Receiving a deletion request for data to be deleted acquiring the data piece fingerprint sequence of the data to be deleted, and determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence (for example, determining each data piece in the data piece fingerprint sequence
  • the reference count change value of a fingerprint is -1
  • the reference count change value of each fingerprint in the fingerprint sequence of the data piece is sent to the control node device 5.
  • the control node device 5 is also used to:
  • the storage node device 1 cannot directly delete the data piece of the data to be deleted because the storage node device 1 cannot determine whether the data piece of the data to be deleted is also referenced by other data at the same time. Deleting the data piece of the data to be deleted may cause data loss. Therefore, it is only necessary to update the cumulative reference count of each fingerprint in the data slice sequence of the data to be deleted, and delete the data slice fingerprint sequence of the data to be deleted.
  • control node device 5 is also used to:
  • the fingerprint is deleted, and the corresponding storage node device 1 is notified to delete the fingerprint and the data piece corresponding to the fingerprint.
  • the control node device 5 when detecting that the cumulative reference count of a fingerprint is zero, the control node device 5 needs to go through a preset period of time before deleting the data piece corresponding to the fingerprint, and receives each real-time within the preset period of time The reference count change value of the fingerprint reported by the storage node device 1 to avoid erroneous deletion of data caused by the storage node device 1 not reporting the reference count change value in time.
  • the storage node device 1 is also used to:
  • This application proposes a data deduplication procedure.
  • FIG. 2 is a schematic diagram of an operating environment of an embodiment of the data deduplication program 10 of the present application.
  • the data deduplication program 10 is installed and runs in the storage node device 1.
  • the storage node device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
  • the storage node device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • FIG. 2 only shows the storage node device 1 having the components 11-13, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the memory 11 may be an internal storage unit of the storage node device 1 in some embodiments, such as a hard disk or a memory of the storage node device 1. In other embodiments, the memory 11 may also be an external storage device of the storage node device 1, for example, a plug-in hard disk equipped on the storage node device 1, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both the internal storage unit of the storage node device 1 and the external storage device. The memory 11 is used to store application software installed on the storage node device 1 and various types of data, such as program codes of the data deduplication program 10. The memory 11 may also be used to temporarily store data that has been output or will be output.
  • an external storage device of the storage node device 1 for example, a plug-in hard disk equipped on the storage node device 1, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card
  • the processor 12 may be a central processing unit (CPU), microprocessor, or other data processing chip, which is used to run program codes or process data stored in the memory 11, such as performing data deduplication Procedure 10 etc.
  • CPU central processing unit
  • microprocessor or other data processing chip, which is used to run program codes or process data stored in the memory 11, such as performing data deduplication Procedure 10 etc.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, or the like.
  • the display 13 is used to display information processed in the storage node device 1 and to display a visual user interface.
  • the components 11-13 of the storage node device 1 communicate with each other through a program bus.
  • FIG. 3 is a program module diagram of an embodiment of the data deduplication program 10 of the present application.
  • the data deduplication program 10 may be divided into one or more modules, and the one or more modules are stored in the memory 11 and are processed by one or more processors (the processor 12 in this embodiment) Executed to complete this application.
  • the data deduplication program 10 may be divided into a receiving module 101, a preprocessing module 102, a query module 103, a first deduplication module 104, and a second deduplication module 105.
  • the module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than the program for describing the execution process of the data deduplication program 10 in the storage node device 1, wherein:
  • the receiving module 101 is configured to receive a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
  • the data piece to be written is obtained by dividing the data to be written (the data type of the data to be written includes block-level data and file-level data), and the dividing operation may be performed by the receiving module 101, or by any other Applicable equipment (for example, client) executes, and its segmentation methods include:
  • M is a natural number greater than 1, determine the size of the data slice corresponding to the data file to be written, and divide the M of the same size one by one according to the determined size of the data slice -1 data block, the rest is the Mth data block.
  • the size of the data piece to be written may be 4KB, 8KB, 12KB, 16KB or other granularity sizes.
  • the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc.
  • MD5 Message-Digest Algorithm 5
  • SSL1 secure hash algorithm
  • the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc.
  • MD5 Message-Digest Algorithm 5
  • SSL1 secure hash algorithm
  • the preprocessing module 102 is configured to determine the fingerprint to be deduplicated among the fingerprints of the data piece to be written.
  • the method for the preprocessing module 102 to determine the fingerprint to be deduplicated includes:
  • the query module 103 is used to find whether each fingerprint to be deduplicated exists in the local fingerprint library 3.
  • the first deduplication module 104 is configured to delete the piece of data to be written corresponding to the one or more fingerprints to be deduplicated when one or more fingerprints to be deduplicated exist in the local fingerprint library 3.
  • the data pieces to be written corresponding to these fingerprints to be deduplicated are duplicate data pieces. In order to save storage space, these duplicate data pieces delete.
  • the second deduplication module 105 is configured to use the one or more fingerprints to be deduplicated as to-be-processed fingerprints when one or more fingerprints to be deduplicated do not exist in the local fingerprint library 3, and share the fingerprints Look up each fingerprint to be processed in the library 2, and when one or more fingerprints to be processed are found in the shared fingerprint library 2, delete the to-be-written data piece corresponding to the found one or more fingerprints to be processed .
  • the second deduplication module 105 continues in the shared fingerprint library 2 Query whether the fingerprint to be deduplicated exists. If it exists, determine that the data to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes and belongs to a duplicate data piece. Storage processing.
  • a storage node device 1 when a storage node device 1 performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library 3, it can be directly in the shared fingerprint library 2 To query whether the fingerprint is a duplicate fingerprint, there is no need to communicate with other storage node devices 1 one by one, so the data de-duplication efficiency of the distributed storage system is improved.
  • the data deduplication program 10 further includes a storage module (not shown in the figure) for:
  • the data deduplication program 10 further includes a reference update module (not shown in the figure) for:
  • the data deduplication program 10 further includes a deletion module (not shown in the figure), which is used to:
  • Receiving a deletion request for data to be deleted acquiring the data piece fingerprint sequence of the data to be deleted, and determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence (for example, determining each data piece in the data piece fingerprint sequence
  • the reference count change value of a fingerprint is -1
  • the reference count change value of each fingerprint in the fingerprint sequence of the data piece is sent to the control node device 5.
  • the control node device 5 updates the cumulative reference count of each fingerprint in the data sheet fingerprint sequence according to the reference count change value of each fingerprint in the data sheet fingerprint sequence, and removes the data sheet fingerprint sequence of the data to be deleted from all
  • the shared fingerprint library 2 is deleted, and the storage node device 1 is notified to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint library 3.
  • the storage node device 1 cannot directly delete the data piece of the data to be deleted because the storage node device 1 cannot determine whether the data piece of the data to be deleted is also referenced by other data at the same time. Deleting the data piece of the data to be deleted may cause data loss. Therefore, it is only necessary to update the cumulative reference count of each fingerprint in the data slice sequence of the data to be deleted, and delete the data slice fingerprint sequence of the data to be deleted.
  • the control node device 5 When it is detected in the shared fingerprint library 2 that the cumulative reference count of a fingerprint is zero (that is, the data piece corresponding to the fingerprint is not referenced by any data), the control node device 5 records that the fingerprint keeps the cumulative reference count of zero The duration of the state. When the duration is greater than the preset duration, the fingerprint is deleted, and the corresponding storage node device 1 is notified to delete the fingerprint and the data piece corresponding to the fingerprint. When the duration is less than or equal to the preset duration, no deletion process is performed.
  • the control node device 5 when detecting that the cumulative reference count of a fingerprint is zero, the control node device 5 needs to go through a preset period of time before deleting the data piece corresponding to the fingerprint, and receives each real-time within the preset period of time The reference count change value of the fingerprint reported by the storage node device 1 to avoid erroneous deletion of data caused by the storage node device 1 not reporting the reference count change value in time.
  • the data deduplication program 10 further includes a reading module (not shown in the figure) for:
  • this application proposes a data deduplication method. This method is applicable to the above distributed storage system.
  • FIG. 4 is a schematic flowchart of an embodiment of a data deduplication method according to this application.
  • the method includes:
  • step S10 the storage node device 1 receives a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
  • the data piece to be written is obtained by dividing the data to be written (the data type of the data to be written includes block-level data and file-level data), and the dividing operation may be performed by the storage node device 1, or by other Any suitable device (for example, client) executes, and the segmentation method includes:
  • M is a natural number greater than 1, determine the size of the data slice corresponding to the data file to be written, and divide the M of the same size one by one according to the determined size of the data slice -1 data block, the rest is the Mth data block.
  • the size of the data piece to be written may be 4KB, 8KB, 12KB, 16KB or other granularity sizes.
  • the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc.
  • MD5 Message-Digest Algorithm 5
  • SSL1 secure hash algorithm
  • the fingerprint of each piece of data to be written is calculated, and at the same time, record the arrangement order of each piece of data to be written (ie, the data piece fingerprint sequence), which is used for subsequent reading of the data to be written Assemble the data piece to be written into the data to be written according to the fingerprint sequence of the data piece.
  • the data sheet fingerprint sequence can also be saved in the local fingerprint library 3 and the shared fingerprint library 2.
  • step S20 the storage node device 1 determines the fingerprint to be deduplicated among the fingerprints of the data piece to be written.
  • the method for determining the fingerprint to be deduplicated includes: judging whether there is a redundant fingerprint in the fingerprint to be written into the data piece, if there is, deleting the redundant fingerprint, and using the remaining fingerprint as the fingerprint to be deduplicated, If it does not exist, all fingerprints of the data pieces to be written are used as fingerprints to be deduplicated.
  • the step S20 includes steps S21 to S26 (not shown in the figure). among them:
  • step S21 it is determined whether the same fingerprint exists in all fingerprints of the data piece to be written.
  • Step S22 if the same fingerprint exists, the same fingerprint is used as a fingerprint group, after finding all fingerprint groups, select one fingerprint to retain in each fingerprint group, and delete the unselected fingerprints as redundant fingerprints And determine whether there are ungrouped fingerprints. If yes, each ungrouped fingerprint is taken as the fingerprint to be deduplicated. If not, the process ends.
  • step S23 if the same fingerprint does not exist, all fingerprints of the data piece to be written are regarded as ungrouped fingerprints, and each ungrouped fingerprint is regarded as a fingerprint to be deduplicated.
  • step S30 the storage node device 1 searches the local fingerprint library 3 for the existence of each fingerprint to be deduplicated.
  • step S40 when one or more fingerprints to be deduplicated exist in the local fingerprint database 3, the storage node device 1 deletes the piece of data to be written corresponding to the one or more fingerprints to be deduplicated.
  • the data pieces to be written corresponding to these fingerprints to be deduplicated are duplicate data pieces. In order to save storage space, these duplicate data pieces delete.
  • Step S50 when one or more fingerprints to be deduplicated do not exist in the local fingerprint library 3, the storage node device 1 uses the one or more fingerprints to be deduplicated as fingerprints to be processed, and shares it in the shared fingerprint library 2 Search for each fingerprint to be processed, and when one or more fingerprints to be processed are found in the shared fingerprint library 2, delete the piece of data to be written corresponding to the found one or more fingerprints to be processed.
  • the storage node device 1 Since the shared fingerprint database 2 has the full amount of fingerprint data, if the storage node device 1 does not query a fingerprint to be deduplicated in the local fingerprint library 3, it continues to query the shared fingerprint library 2 whether the fingerprint to be deduplicated exists If it exists, it is determined that the data piece to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes and belongs to a duplicate data piece, and there is no need to perform storage processing on the data piece to be written.
  • a storage node device 1 when a storage node device 1 performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library 3, it can be directly in the shared fingerprint library 2 To query whether the fingerprint is a duplicate fingerprint, there is no need to communicate with other storage node devices 1 one by one, so the data de-duplication efficiency of the distributed storage system is improved.
  • step S60 the method further includes:
  • the storage node device 1 saves all remaining data pieces to be written (ie, non-repetitive data pieces), and saves the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library 3 and the shared fingerprint library 2 in.
  • the distributed storage system further includes a control node device 5 communicatively connected to each storage node device 1 and shared fingerprint library 2, and after the step S20, the method further includes:
  • the storage node device 1 determines the reference count change value of each fingerprint of the data piece to be written (for example, determines that the reference count change value of each fingerprint to be deduplicated is +1), and refers to the reference of the fingerprint of each data piece to be written
  • the count change value is sent to the control node device 5.
  • control node device 5 updates the cumulative reference count of the fingerprint of each data slice to be written according to the reference count change value of the fingerprint of each data slice to be written (the cumulative reference count of a fingerprint represents that the data slice corresponding to the fingerprint has been Total number of stored data references).
  • the method further includes steps S60 to S80 (not shown in the figure). among them:
  • step S60 the storage node device 1 receives a deletion request of data to be deleted.
  • Step S70 The storage node device 1 acquires the data piece fingerprint sequence of the data to be deleted, and determines the reference count change value of each fingerprint in the acquired data piece fingerprint sequence (for example, determines each fingerprint in the data piece fingerprint sequence The reference count change values of are all -1), and send the reference count change values of each fingerprint in the data sheet fingerprint sequence to the control node device 5.
  • Step S80 The control node device 5 updates the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the reference count change value of each fingerprint in the data piece fingerprint sequence, and converts the data piece fingerprint sequence of the data to be deleted Delete from the shared fingerprint library 2 and notify the storage node device 1 to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint library 3.
  • the storage node device 1 cannot directly delete the data piece of the data to be deleted because the storage node device 1 cannot determine whether the data piece of the data to be deleted is also referenced by other data at the same time. Deleting the data piece of the data to be deleted may cause data loss. Therefore, it is only necessary to update the cumulative reference count of each fingerprint in the data slice sequence of the data to be deleted, and delete the data slice fingerprint sequence of the data to be deleted.
  • the method further includes:
  • the control node device 5 When it is detected in the shared fingerprint library 2 that the cumulative reference count of a fingerprint is zero (that is, the data piece corresponding to the fingerprint is not referenced by any data), the control node device 5 records that the fingerprint keeps the cumulative reference count of zero The duration of the state.
  • the fingerprint is deleted, and the corresponding storage node device 1 is notified to delete the fingerprint and the data piece corresponding to the fingerprint.
  • the control node device 5 when detecting that the cumulative reference count of a fingerprint is zero, the control node device 5 needs to go through a preset period of time before deleting the data piece corresponding to the fingerprint, and receives each real-time within the preset period of time The reference count change value of the fingerprint reported by the storage node device 1 to avoid erroneous deletion of data caused by the storage node device 1 not reporting the reference count change value in time.
  • the method further includes step S90 (not shown in the figure).
  • Step S90 when receiving a read request for data to be read, the storage node device 1 acquires a data piece fingerprint sequence of the data to be read, and acquires the storage of the data piece corresponding to each fingerprint in the data piece fingerprint sequence For the location information, according to the acquired storage location information, acquire a data piece corresponding to each fingerprint in the data piece fingerprint sequence, and then assemble the acquired data piece into the data to be read according to the data piece fingerprint sequence.
  • the present application also proposes a computer-readable storage medium that stores a data deduplication program 10, and embodiments of the data deduplication program 10 have been described in detail in the foregoing content, and are described here Do not repeat them.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to distributed storage technology, and disclosed therein are a distributed storage system, a storage node device and a data duplicate deletion method. When the storage node device of the present application carries out data duplicate deletion, if a fingerprint having a data shard to be written is not found in a local fingerprint library, it may be directly queried in a shared fingerprint library whether said fingerprint is a duplicate fingerprint without needing to further carry out a communication query with another storage node device, thus improving the efficiency of deleting data duplicates in the distributed storage system.

Description

分布式存储***、存储节点设备和数据去重方法Distributed storage system, storage node equipment and data deduplication method
本申请要求于2019年1月4日提交中国专利局,申请号为201910007367.9、发明名称为“分布式存储***、存储节点设备和数据去重方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on January 4, 2019, with the application number 201910007367.9 and the invention titled "distributed storage system, storage node equipment and data deduplication method", all of which are approved by The reference is incorporated in this application.
技术领域Technical field
本申请涉及分布式存储技术领域,特别涉及一种分布式存储***、存储节点设备、数据去重方法和计算机可读存储介质。The present application relates to the field of distributed storage technology, and in particular, to a distributed storage system, storage node device, data deduplication method, and computer-readable storage medium.
背景技术Background technique
数据去重又称重复数据删除(Data Deduplication),是一种应用在存储***中的全局地识别和消除冗余数据的技术,成为近些年来存储***研究的热点。数据去重通过计算数据块的安全哈希摘要(比如SHA1指纹)来唯一识别数据块,避免了数据的逐个字符的匹配,而且存储***只需要简单地维护安全哈希摘要的索引表,就可以实现快速方便地识别重复数据,具有良好的可扩展性。重复的数据内容只需要记录相应的数据指针信息即可达到节省存储空间的目的。所以数据去重技术能够极大地节省存储空间从而提高存储设备的资源利用率。Data deduplication, also known as data deduplication (Data Deduplication), is a technology used to globally identify and eliminate redundant data in storage systems, and has become a hotspot in storage system research in recent years. Data deduplication uniquely identifies the data block by calculating the safe hash digest (such as SHA1 fingerprint) of the data block, avoiding the character-by-character matching of the data, and the storage system only needs to simply maintain the index table of the safe hash digest. Recognize duplicate data quickly and easily, with good scalability. The repeated data content only needs to record the corresponding data pointer information to achieve the purpose of saving storage space. Therefore, data deduplication technology can greatly save storage space and improve the resource utilization of storage devices.
目前,分布式存储***中一存储节点在对一数据片的去重过程通常包括如下步骤:计算该数据片的指纹,再在该存储节点的指纹库中查询该指纹是否存在,若不存在,则在该分布式存储***中其他存储节点的指纹库中查询该指纹是否存在,以此确认该数据片是否存在于分布式存储***中。该方法的缺陷在于,分布式存储***中存储节点的数量通常较多,若一存储节点需要在其他多个存储节点的指纹库中查询指纹,则需要与多个存储节点逐一通信,速度慢且效率低。At present, the deduplication process of a data piece in a storage node in a distributed storage system usually includes the following steps: calculating the fingerprint of the data piece, and then querying whether the fingerprint exists in the fingerprint database of the storage node, if not, Then query the fingerprint database of other storage nodes in the distributed storage system for the existence of the fingerprint to confirm whether the data piece exists in the distributed storage system. The disadvantage of this method is that the number of storage nodes in a distributed storage system is usually large. If a storage node needs to query fingerprints in the fingerprint database of other multiple storage nodes, it needs to communicate with multiple storage nodes one by one, which is slow and low efficiency.
因此,如何提高分布式存储***的去重效率成为一个亟待解决的问题。Therefore, how to improve the deduplication efficiency of the distributed storage system has become an urgent problem to be solved.
发明内容Summary of the invention
本申请的主要目的是提供一种分布式存储***、存储节点设备、数据去重方法和计算机可读存储介质,旨在提高分布式存储***的去重效率。The main purpose of the present application is to provide a distributed storage system, storage node device, data deduplication method and computer readable storage medium, aiming to improve the deduplication efficiency of the distributed storage system.
为实现上述目的,本申请提出一种分布式存储***,所述分布式存储***包括多个存储节点设备及若干个共享指纹库,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述存储节点设备用于:接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;确定出所述待写入数据片的指纹中的待去重指纹,并于 本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;当有一个或多个待去重指纹存在于所述本地指纹库时,将所述一个或多个待去重指纹对应的待写入数据片删除;当有一个或多个待去重指纹不存在于所述本地指纹库时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。In order to achieve the above object, the present application proposes a distributed storage system. The distributed storage system includes multiple storage node devices and multiple shared fingerprint libraries. The storage node devices and the shared fingerprint library are communicatively connected. A local fingerprint library is provided in the storage node device, or the storage node device is in communication connection with the corresponding local fingerprint library, and the storage node device is used to: receive a data slice write request, and the data slice write request includes several Data slices to be written and the fingerprints of each data slice to be written; determine the fingerprints to be deduplicated among the fingerprints of the data slices to be written, and find whether each fingerprint to be deduplicated exists in the local fingerprint database , The local fingerprint database includes fingerprints of data pieces stored in the storage node device; when one or more fingerprints to be deduplicated exist in the local fingerprint library, the one or more fingerprints to be deduplicated Corresponding pieces of data to be written are deleted; when one or more fingerprints to be deduplicated do not exist in the local fingerprint database, the one or more fingerprints to be deduplicated are used as fingerprints to be processed and shared in the fingerprint database Search for each fingerprint to be processed, the shared database includes the fingerprints of the data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint database, the one found Or multiple pieces of data to be written corresponding to the fingerprints to be processed are deleted.
此外,为实现上述目的,本申请还提出一种数据去重方法,该方法适用于分布式存储***,所述分布式存储***包括多个存储节点设备及若干个共享指纹库,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述方法包括步骤:接收步骤:存储节点设备接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;查询步骤:存储节点设备确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;第一去重步骤:当有一个或多个待去重指纹存在于所述本地指纹库时,存储节点设备将所述一个或多个待去重指纹对应的待写入数据片删除;第二去重步骤:当有一个或多个待去重指纹不存在于所述本地指纹库时,存储节点设备将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。In addition, in order to achieve the above object, the present application also proposes a data deduplication method, which is applicable to a distributed storage system. The distributed storage system includes multiple storage node devices and several shared fingerprint libraries. The storage nodes Communication connection between the device and the shared fingerprint library, the storage node device is provided with a local fingerprint library, or the storage node device is in communication connection with the corresponding local fingerprint library, the method includes the steps of: receiving step: storage node device Receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written; query step: the storage node device determines the data slice to be written The fingerprint to be deduplicated in the fingerprint, and find whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the fingerprint of the data piece stored in the storage node device; the first deduplication step: when there is When one or more fingerprints to be deduplicated exist in the local fingerprint database, the storage node device deletes the data piece to be written corresponding to the one or more fingerprints to be deduplicated; the second deduplication step: when there is one or When multiple fingerprints to be deduplicated do not exist in the local fingerprint library, the storage node device uses the one or more fingerprints to be deduplicated as fingerprints to be processed, and finds each fingerprint to be processed in the shared fingerprint library, and the shared The database includes the fingerprints of the data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint library, the one or more fingerprints to be processed corresponding to the found fingerprints to be written are to be written The input data piece is deleted.
此外,为实现上述目的,本申请还提出一种存储节点设备,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述存储节点设备包括存储器和处理器,所述存储器上存储有数据去重程序,所述数据去重程序被所述处理器执行时实现如下步骤:接收步骤:接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;查询步骤:确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;第一去重步骤:当有一个或多个待去重指纹存在于所述本地指纹库时,将所述一个或多个待去重指纹对应的待写入数据片删除;第二去重步骤:当有一个或多个待去重指纹不存在于所述本地指纹库时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。In addition, in order to achieve the above object, the present application also proposes a storage node device that communicates with the shared fingerprint library, and the storage node device is provided with a local fingerprint library, or the storage node device Communicating with the corresponding local fingerprint database, the storage node device includes a memory and a processor, and a data deduplication program is stored on the memory. When the data deduplication program is executed by the processor, the following steps are implemented: a receiving step : Receive a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written; query step: determine the fingerprint of the data slice to be written The fingerprint to be deduplicated, and find whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the fingerprints of the data pieces stored in the storage node device; the first deduplication step: when there is one or When multiple fingerprints to be deduplicated exist in the local fingerprint database, delete the data piece to be written corresponding to the one or more fingerprints to be deduplicated; the second deduplication step: when there are one or more deduplicated fingerprints When the fingerprint does not exist in the local fingerprint database, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, and each fingerprint to be processed is searched in the shared fingerprint database. The shared database includes all the storage node devices. When storing the fingerprint of the data piece, when one or more fingerprints to be processed are found in the shared fingerprint database, the data piece to be written corresponding to the found one or more fingerprints to be processed is deleted.
此外,为实现上述目的,本申请还提出一种计算机可读存储介质,适用于存储节点设备,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述计算机可读存储介质存储有数据去重程序,所述数据去重程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:接收步骤:接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;查询步骤:确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;第一去重步骤:当有一个或多个待去重指纹存在于所述本地指纹库时,将所述一个或多个待去重指纹对应的待写入数据片删除;第二去重步骤:当有一个或多个待去重指纹不存在于所述本地指纹库时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。In addition, in order to achieve the above object, the present application also proposes a computer-readable storage medium, which is suitable for a storage node device, and a communication connection between the storage node device and a shared fingerprint library, and a local fingerprint library is provided in the storage node device Or, the storage node device is in communication connection with a corresponding local fingerprint library, and the computer-readable storage medium stores a data deduplication program, and the data deduplication program may be executed by at least one processor to enable the at least one A processor performs the following steps: a receiving step: receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written; an inquiry step: determining the location Describe the fingerprints to be deduplicated in the fingerprints of the data pieces to be written, and find out whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the fingerprints of the data pieces stored in the storage node device; One deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint database, delete the data piece to be written corresponding to the one or more fingerprints to be deduplicated; the second deduplication step: When one or more fingerprints to be deduplicated do not exist in the local fingerprint database, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, and each fingerprint to be processed is searched for in the shared fingerprint database. The shared database includes fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint database, the fingerprints corresponding to the found one or more fingerprints to be processed The written data piece is deleted.
与现有技术相比,本实施例一存储节点设备在进行数据去重时,若在本地指纹库中未查询到一待写入数据片的指纹,则可直接在共享指纹库中查询该指纹是否为重复的指纹,无需与其他存储节点设备逐一进行通信查询,因此,提高了分布式存储***的数据去重效率。Compared with the prior art, when a storage node device performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library, the fingerprint can be directly queried in the shared fingerprint library Whether it is a duplicate fingerprint, there is no need to communicate and query with other storage node devices one by one, therefore, the data de-duplication efficiency of the distributed storage system is improved.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained according to the structures shown in these drawings.
图1为本申请分布式存储***一实施例的***架构示意图;1 is a schematic diagram of a system architecture of an embodiment of a distributed storage system of the present application;
图2为本申请数据去重程序一实施例的运行环境示意图;2 is a schematic diagram of an operating environment of an embodiment of a data deduplication program of the application;
图3为本申请数据去重程序一实施例的程序模块图;3 is a program module diagram of an embodiment of the data deduplication program of the application;
图4为本申请数据去重方法一实施例的流程示意图。FIG. 4 is a schematic flowchart of an embodiment of a data deduplication method according to this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.
具体实施方式detailed description
以下结合附图对本申请的原理和特征进行描述,所举实例只用于解释本申请,并非用于限定本申请的范围。The principles and features of the present application are described below in conjunction with the drawings. The examples given are only used to explain the present application, not to limit the scope of the present application.
参阅图1所示,是本申请分布式存储***一实施例的***架构图。Referring to FIG. 1, it is a system architecture diagram of an embodiment of a distributed storage system of the present application.
在本实施例中,所述分布式存储***包括多个存储节点设备1及若干个共享指纹库2,所述存储节点设备1及共享指纹库2之间通信连接(例如,通过网络4通信连接),所述存储节点设备1中设置有本地指纹库3,或者,所述存储节点设备1与对应的本地指纹库3通信连接。所述本地指纹库3包括对应存储节点设备1中已存储数据片的指纹,所述共享指纹库2包括所有存储节点设备1中已存储数据片的指纹。In this embodiment, the distributed storage system includes a plurality of storage node devices 1 and a plurality of shared fingerprint libraries 2, and the communication connection between the storage node devices 1 and the shared fingerprint library 2 (for example, through a network 4 communication connection ), the storage node device 1 is provided with a local fingerprint library 3, or the storage node device 1 is in communication connection with the corresponding local fingerprint library 3. The local fingerprint database 3 includes fingerprints corresponding to stored data pieces in the storage node device 1, and the shared fingerprint database 2 includes fingerprints of all stored data pieces in the storage node device 1.
所述存储节点设备1用于:The storage node device 1 is used to:
接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;Receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written;
确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库3中查找各个待去重指纹是否存在;Determine the fingerprints to be deduplicated among the fingerprints of the data pieces to be written, and find whether each fingerprint to be deduplicated exists in the local fingerprint library 3;
当有一个或多个待去重指纹存在于所述本地指纹库3时,将所述一个或多个待去重指纹对应的待写入数据片删除;When one or more fingerprints to be deduplicated exist in the local fingerprint database 3, delete the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
当有一个或多个待去重指纹不存在于所述本地指纹库3时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库2中查找各个待处理指纹,当在所述共享指纹库2中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。When one or more fingerprints to be deduplicated do not exist in the local fingerprint library 3, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, and each fingerprint to be processed is searched in the shared fingerprint library 2, When one or more fingerprints to be processed are found in the shared fingerprint library 2, the data pieces to be written corresponding to the found one or more fingerprints to be processed are deleted.
本实施例中,存储节点设备1接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹。所述待写入数据片由待写入数据(所述待写入数据的数据类型包括块级数据、文件级数据)切分得到。该切分的操作可由存储节点设备1执行,或由其他任何适用的设备(例如,客户端)执行,其切分方法包括:In this embodiment, the storage node device 1 receives a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written. The data piece to be written is obtained by dividing the data to be written (the data type of the data to be written includes block-level data and file-level data). The segmentation operation may be performed by the storage node device 1, or any other suitable device (eg, client), and the segmentation method includes:
将待写入数据文件切分成相同数据大小的预设数量的数据片。或者,将预设数量记为M,当M为大于1的自然数时,确定切分所述待写入数据文件对应的数据片大小,根据确定的数据片大小,逐一切分出大小相同的M-1个数据块,切分后余下的是第M个数据块。其中,待写入数据片的大小可以是4KB、8KB、12KB、16KB或其他粒度大小。Divide the data file to be written into a preset number of data pieces of the same data size. Alternatively, record the preset number as M. When M is a natural number greater than 1, determine the size of the data slice corresponding to the data file to be written, and divide the M of the same size one by one according to the determined size of the data slice -1 data block, the rest is the Mth data block. The size of the data piece to be written may be 4KB, 8KB, 12KB, 16KB or other granularity sizes.
在将待写入数据切分成若干个待写入数据片后,计算每个待写入数据片的指纹,例如,通过信息-摘要算法5(Message-Digest Algorithm 5,MD5)、安全哈希算法(Secure Hash Algorithm,SHA1)等计算每个待写入数据片的指纹,同时,记录各待写入数据片的排列顺序(即数据片指纹序列),用于后续读取该待写入数据时,将待写入数据片按照该数据片指纹序列组装成该待写入数据。此外,存储节点设备1还可将所述数据片指纹序列保存至所述本地指纹库3及共享指纹库2中。After the data to be written is divided into several pieces of data to be written, the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc. Calculate the fingerprint of each piece of data to be written, and at the same time, record the arrangement order of each piece of data to be written (ie, the data piece fingerprint sequence), which is used for subsequent reading of the data to be written Assemble the data piece to be written into the data to be written according to the fingerprint sequence of the data piece. In addition, the storage node device 1 can also save the data sheet fingerprint sequence to the local fingerprint library 3 and the shared fingerprint library 2.
接着,存储节点设备1确定出所述待写入数据片的指纹中的待去重指纹,确定待去重指纹的方法包括:判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹。Next, the storage node device 1 determines the fingerprint to be deduplicated in the fingerprint of the data piece to be written, and the method for determining the fingerprint to be deduplicated includes: determining whether there is a redundant fingerprint in the fingerprint of the data piece to be written If there is, delete the redundant fingerprints, and use the remaining fingerprints as fingerprints to be deduplicated; if not, use all fingerprints of the data pieces to be written as fingerprints to be deduplicated.
例如,存储节点设备1判断所有待写入数据片的指纹中是否存在相同的指纹。若存在相同的指纹,则将相同的指纹作为一个指纹组,在查找出所有指纹组后,在每一个指纹组中选择一个指纹保留,将未被选择的指纹作为冗余的指纹删除,并判断是否存在未分组的指纹,若是,则将各个未分组的指纹作为待去重指纹,若否,则结束流程。若不存在相同的指纹,则将所有待写入数据片的指纹作为未分组的指纹,则将各个未分组的指纹作为待去重指纹。For example, the storage node device 1 determines whether the same fingerprint exists in all fingerprints of the data piece to be written. If there are the same fingerprints, the same fingerprints will be used as a fingerprint group. After finding all fingerprint groups, select one fingerprint to keep in each fingerprint group, delete the unselected fingerprints as redundant fingerprints, and judge Whether there are ungrouped fingerprints, if so, each ungrouped fingerprint is taken as the fingerprint to be deduplicated, if not, the process is ended. If the same fingerprint does not exist, all the fingerprints of the data pieces to be written are regarded as ungrouped fingerprints, and each ungrouped fingerprint is regarded as the fingerprint to be deduplicated.
在识别出待去重指纹后,存储节点设备1于本地指纹库3中查找各个待去重指纹是否存在。当有一个或多个待去重指纹存在于所述本地指纹库3时,存储节点设备1将所述一个或多个待去重指纹对应的待写入数据片删除。当有一个或多个待去重指纹存在于所述本地指纹库3时,代表这些待去重指纹对应的待写入数据片是重复的数据片,为了节省存储空间,将这些重复的数据片删除。After identifying the fingerprint to be deduplicated, the storage node device 1 searches the local fingerprint library 3 for the existence of each fingerprint to be deduplicated. When one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the storage node device 1 deletes the data pieces to be written corresponding to the one or more fingerprints to be deduplicated. When one or more fingerprints to be deduplicated exist in the local fingerprint database 3, it means that the data pieces to be written corresponding to these fingerprints to be deduplicated are duplicate data pieces. In order to save storage space, these duplicate data pieces delete.
最后,当有一个或多个待去重指纹不存在于所述本地指纹库3时,存储节点设备1将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库2中查找各个待处理指纹,当在所述共享指纹库2中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。Finally, when one or more fingerprints to be deduplicated do not exist in the local fingerprint library 3, the storage node device 1 takes the one or more fingerprints to be deduplicated as fingerprints to be processed and stores them in the shared fingerprint library 2 Find each fingerprint to be processed, and when one or more fingerprints to be processed are found in the shared fingerprint library 2, delete the data piece to be written corresponding to the found one or more fingerprints to be processed.
由于共享指纹库2中拥有全量的指纹数据,因此,存储节点设备1若在本地指纹库3中未查询到一待去重指纹,则在共享指纹库2中继续查询该待去重指纹是否存在,若存在,则确定该待去重指纹对应的待写入数据片已存在于其他存储节点中,属于重复的数据片,无需再对该待写入数据片进行存储处理。Since the shared fingerprint database 2 has the full amount of fingerprint data, if the storage node device 1 does not query a fingerprint to be deduplicated in the local fingerprint library 3, it continues to query the shared fingerprint library 2 whether the fingerprint to be deduplicated exists If it exists, it is determined that the data piece to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes and belongs to a duplicate data piece, and there is no need to perform storage processing on the data piece to be written.
与现有技术相比,本实施例一存储节点设备1在进行数据去重时,若在本地指纹库3中未查询到一待写入数据片的指纹,则可直接在共享指纹库2中查询该指纹是否为重复的指纹,无需与其他存储节点设备1逐一进行通信查询,因此,提高了分布式存储***的数据去重效率。Compared with the prior art, when a storage node device 1 performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library 3, it can be directly in the shared fingerprint library 2 To query whether the fingerprint is a duplicate fingerprint, there is no need to communicate with other storage node devices 1 one by one, so the data de-duplication efficiency of the distributed storage system is improved.
进一步地,在本实施例中,所述存储节点设备1还用于:Further, in this embodiment, the storage node device 1 is also used to:
存储节点设备1保存所有剩余的待写入数据片(即未重复的数据片),并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库3及共享指纹库2中。The storage node device 1 saves all remaining data pieces to be written (ie, non-repetitive data pieces), and saves the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library 3 and the shared fingerprint library 2 in.
进一步地,在本实施例中,所述分布式存储***还包括控制节点设备5,所述控制节点设备5分别与所述存储节点设备1及共享指纹库2通信连接(例如,通过网络4通信连接)。所述共享指纹库2可设置于一共享磁盘(如通过NVMEOF挂载的NVME盘)中,该共享磁盘可设置于控制节点设备5中,也可独立于控制节点设备5设置。Further, in this embodiment, the distributed storage system further includes a control node device 5, the control node device 5 is in communication connection with the storage node device 1 and the shared fingerprint library 2 (e.g., communication through the network 4 connection). The shared fingerprint library 2 may be set in a shared disk (such as an NVME disk mounted through NVMEOF), and the shared disk may be set in the control node device 5 or may be set independently of the control node device 5.
所述存储节点设备1还用于:The storage node device 1 is also used to:
确定各个待写入数据片的指纹的引用计数变化值(例如,确定各个待去重指纹的引用计数变化值均为+1),并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备5。Determine the change in the reference count of the fingerprints of each piece of data to be written (for example, determine that the change in the reference count of each fingerprint to be deduplicated is +1), and send the change in the reference count of the fingerprint of each piece of data to be written To control node device 5.
所述控制节点设备5用于:The control node device 5 is used to:
根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数(一指纹的累计引用计数代表该指纹对应的数据片被已存储数据引用的总次数)。According to the change of the reference count of the fingerprint of each data piece to be written, update the cumulative reference count of the fingerprint of each data piece to be written (the cumulative reference count of a fingerprint represents the total number of times the data piece corresponding to the fingerprint is referenced by the stored data ).
进一步地,在本实施例中,所述存储节点设备1还用于:Further, in this embodiment, the storage node device 1 is also used to:
接收一待删除数据的删除请求,获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值(例如,确定所述数据片指纹序列中每一个指纹的引用计数变化值均为-1),并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备5。Receiving a deletion request for data to be deleted, acquiring the data piece fingerprint sequence of the data to be deleted, and determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence (for example, determining each data piece in the data piece fingerprint sequence The reference count change value of a fingerprint is -1), and the reference count change value of each fingerprint in the fingerprint sequence of the data piece is sent to the control node device 5.
所述控制节点设备5还用于:The control node device 5 is also used to:
根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库2删除,且通知所述存储节点设备1将所述待删除数据的数据片指纹序列从本地指纹库3删除。Update the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the change in the reference count of each fingerprint in the data piece fingerprint sequence, and remove the data piece fingerprint sequence of the data to be deleted from the shared fingerprint library 2 Delete, and notify the storage node device 1 to delete the fingerprint sequence of the data piece of the data to be deleted from the local fingerprint library 3.
本实施例中,若要删除数据,存储节点设备1不能直接将该待删除数据的数据片直接删除,因为存储节点设备1无法确定该待删除数据的数据片是否同时被其他数据引用,如果直接删除该待删除数据的数据片,则可能会造成数据丢失。因此,仅需更新该待删除数据的数据片序列中各个指纹的累计引用计数,并将该待删除数据的数据片指纹序列删除。In this embodiment, if data is to be deleted, the storage node device 1 cannot directly delete the data piece of the data to be deleted because the storage node device 1 cannot determine whether the data piece of the data to be deleted is also referenced by other data at the same time. Deleting the data piece of the data to be deleted may cause data loss. Therefore, it is only necessary to update the cumulative reference count of each fingerprint in the data slice sequence of the data to be deleted, and delete the data slice fingerprint sequence of the data to be deleted.
进一步地,在本实施例中,所述控制节点设备5还用于:Further, in this embodiment, the control node device 5 is also used to:
当在所述共享指纹库2中侦测到一指纹的累计引用计数为零(即该指纹对应的数据片未被任何数据引用)时,记录所述指纹保持累计引用计数为零的状态的持续时长。When it is detected in the shared fingerprint library 2 that the cumulative reference count of a fingerprint is zero (that is, the data piece corresponding to the fingerprint is not referenced by any data), record that the fingerprint keeps in a state where the cumulative reference count is zero duration.
当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备1删除所述指纹及所述指纹对应的数据片。When the duration is greater than the preset duration, the fingerprint is deleted, and the corresponding storage node device 1 is notified to delete the fingerprint and the data piece corresponding to the fingerprint.
当所述持续时长小于或等于预设时长时,不作删除处理。When the duration is less than or equal to the preset duration, no deletion process is performed.
本实施例中,控制节点设备5在侦测到一指纹的累计引用计数为零时,需经历一段预设时长后再将该指纹对应的数据片删除,且在该预设时长内实时接收各存储节点设备1上报的该指纹的引用计数变化值,以避免因存储节点设备1未及时上报引用计数变化值而造成的数据误删。In this embodiment, when detecting that the cumulative reference count of a fingerprint is zero, the control node device 5 needs to go through a preset period of time before deleting the data piece corresponding to the fingerprint, and receives each real-time within the preset period of time The reference count change value of the fingerprint reported by the storage node device 1 to avoid erroneous deletion of data caused by the storage node device 1 not reporting the reference count change value in time.
进一步地,在本实施例中,所述存储节点设备1还用于:Further, in this embodiment, the storage node device 1 is also used to:
在接收到一待读取数据的读取请求时,获取该待读取数据的数据片指纹序列,并获取所述数据片指纹序列中各指纹对应的数据片的存储位置信息,根据获取的所述存储位置信息,获取所述数据片指纹序列中各指纹对应的数据片,再将获取的数据片按照所述数据片指纹序列组装成所述待读取数据。When receiving a read request for data to be read, obtain the data piece fingerprint sequence of the data to be read, and obtain the storage location information of the data piece corresponding to each fingerprint in the data piece fingerprint sequence, according to the acquired The storage location information is used to obtain data pieces corresponding to each fingerprint in the data piece fingerprint sequence, and then assemble the acquired data pieces into the data to be read according to the data piece fingerprint sequence.
本申请提出一种数据去重程序。This application proposes a data deduplication procedure.
请参阅图2,是本申请数据去重程序10一实施例的运行环境示意图。Please refer to FIG. 2, which is a schematic diagram of an operating environment of an embodiment of the data deduplication program 10 of the present application.
在本实施例中,数据去重程序10安装并运行于存储节点设备1中。存储节点设备1可以是桌上型计算机、笔记本、掌上电脑及服务器等计算设备。所述存储节点设备1可包括,但不仅限于,存储器11、处理器12及显示器13。图2仅示出了具有组件11-13的存储节点设备1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。In this embodiment, the data deduplication program 10 is installed and runs in the storage node device 1. The storage node device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The storage node device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. FIG. 2 only shows the storage node device 1 having the components 11-13, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
存储器11在一些实施例中可以是存储节点设备1的内部存储单元,例如所述存储节点设备1的硬盘或内存。存储器11在另一些实施例中也可以是存储节点设备1的外部存储设备,例如存储节点设备1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括存储节点设备1的内部存储单元也包括外部存储设备。存储器11用于存储安装于存储节点设备1的应用软件及各类数据,例如数据去重程序10的程序代码等。存储器11还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 may be an internal storage unit of the storage node device 1 in some embodiments, such as a hard disk or a memory of the storage node device 1. In other embodiments, the memory 11 may also be an external storage device of the storage node device 1, for example, a plug-in hard disk equipped on the storage node device 1, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both the internal storage unit of the storage node device 1 and the external storage device. The memory 11 is used to store application software installed on the storage node device 1 and various types of data, such as program codes of the data deduplication program 10. The memory 11 may also be used to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行数据去重程序10等。In some embodiments, the processor 12 may be a central processing unit (CPU), microprocessor, or other data processing chip, which is used to run program codes or process data stored in the memory 11, such as performing data deduplication Procedure 10 etc.
显示器13在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器13用于显示在存储节点设备1中处理的信息以及用于显示可视化的用户界面。存储节点设备1的部件11-13通过程序总线相互通信。In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, or the like. The display 13 is used to display information processed in the storage node device 1 and to display a visual user interface. The components 11-13 of the storage node device 1 communicate with each other through a program bus.
请参阅图3,是本申请数据去重程序10一实施例的程序模块图。在本实施例中,数据去重程序10可以被分割成一个或多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行,以完成本申请。例如,在图3中,数据去重程序10可以被分割成接收模块101、预处理模块102、查询模块103、第一去重模块104及第二去重模块105。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述数据去重程序10在存储节点设备1中的执行过程,其中:Please refer to FIG. 3, which is a program module diagram of an embodiment of the data deduplication program 10 of the present application. In this embodiment, the data deduplication program 10 may be divided into one or more modules, and the one or more modules are stored in the memory 11 and are processed by one or more processors (the processor 12 in this embodiment) Executed to complete this application. For example, in FIG. 3, the data deduplication program 10 may be divided into a receiving module 101, a preprocessing module 102, a query module 103, a first deduplication module 104, and a second deduplication module 105. The module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than the program for describing the execution process of the data deduplication program 10 in the storage node device 1, wherein:
接收模块101,用于接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹。The receiving module 101 is configured to receive a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
所述待写入数据片由待写入数据(所述待写入数据的数据类型包括块级数据及文件级数据)切分得到,该切分的操作可由接收模块101执行,或由其他任何适用的设备(例如,客户端)执行,其切分方法包括:The data piece to be written is obtained by dividing the data to be written (the data type of the data to be written includes block-level data and file-level data), and the dividing operation may be performed by the receiving module 101, or by any other Applicable equipment (for example, client) executes, and its segmentation methods include:
将待写入数据文件切分成相同数据大小的预设数量的数据片。或者,将预设数量记为M,当M为大于1的自然数时,确定切分所述待写入数据文件对应的数据片大小,根据确定的数据片大小,逐一切分出大小相同的M-1个数据块,切分后余下的是第M个数据块。其中,待写入数据片的大小可以是4KB、8KB、12KB、16KB或其他粒度大小。Divide the data file to be written into a preset number of data pieces of the same data size. Alternatively, record the preset number as M. When M is a natural number greater than 1, determine the size of the data slice corresponding to the data file to be written, and divide the M of the same size one by one according to the determined size of the data slice -1 data block, the rest is the Mth data block. The size of the data piece to be written may be 4KB, 8KB, 12KB, 16KB or other granularity sizes.
在将待写入数据切分成若干个待写入数据片后,计算每个待写入数据片的指纹,例如,通过信息-摘要算法5(Message-Digest Algorithm 5,MD5)、 安全哈希算法(Secure Hash Algorithm,SHA1)等计算每个待写入数据片的指纹,同时,记录各待写入数据片的排列顺序(即数据片指纹序列),用于后续读取该待写入数据时,将待写入数据片按照该数据片指纹序列组装成该待写入数据。此外,还可将所述数据片指纹序列保存至所述本地指纹库3及共享指纹库2中。After dividing the data to be written into several pieces of data to be written, the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc. Calculate the fingerprint of each piece of data to be written, and at the same time, record the arrangement order of each piece of data to be written (ie, the data piece fingerprint sequence), which is used for subsequent reading of the data to be written Assemble the data piece to be written into the data to be written according to the fingerprint sequence of the data piece. In addition, the data sheet fingerprint sequence can also be saved in the local fingerprint library 3 and the shared fingerprint library 2.
预处理模块102,用于确定出所述待写入数据片的指纹中的待去重指纹。The preprocessing module 102 is configured to determine the fingerprint to be deduplicated among the fingerprints of the data piece to be written.
预处理模块102确定待去重指纹的方法包括:The method for the preprocessing module 102 to determine the fingerprint to be deduplicated includes:
判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹。Determine whether there is a redundant fingerprint in the fingerprint of the data piece to be written, if it exists, delete the redundant fingerprint, and use the remaining fingerprint as the fingerprint to be deduplicated, if it does not exist, then replace all the fingerprints The fingerprint of the data piece to be written is regarded as the fingerprint to be deduplicated.
例如,判断所有待写入数据片的指纹中是否存在相同的指纹。若存在相同的指纹,则将相同的指纹作为一个指纹组,在查找出所有指纹组后,在每一个指纹组中选择一个指纹保留,将未被选择的指纹作为冗余的指纹删除,并判断是否存在未分组的指纹,若是,则将各个未分组的指纹作为待去重指纹,若否,则结束流程。若不存在相同的指纹,则将所有待写入数据片的指纹作为未分组的指纹,则将各个未分组的指纹作为待去重指纹。For example, it is determined whether the same fingerprint exists in all fingerprints of the data piece to be written. If there are the same fingerprints, the same fingerprints will be used as a fingerprint group. After finding all fingerprint groups, select one fingerprint to keep in each fingerprint group, delete the unselected fingerprints as redundant fingerprints, and judge Whether there are ungrouped fingerprints, if so, each ungrouped fingerprint is taken as the fingerprint to be deduplicated, if not, the process is ended. If the same fingerprint does not exist, all the fingerprints of the data pieces to be written are regarded as ungrouped fingerprints, and each ungrouped fingerprint is regarded as the fingerprint to be deduplicated.
查询模块103,用于在本地指纹库3中查找各个待去重指纹是否存在。The query module 103 is used to find whether each fingerprint to be deduplicated exists in the local fingerprint library 3.
第一去重模块104,用于当有一个或多个待去重指纹存在于所述本地指纹库3时,将所述一个或多个待去重指纹对应的待写入数据片删除。The first deduplication module 104 is configured to delete the piece of data to be written corresponding to the one or more fingerprints to be deduplicated when one or more fingerprints to be deduplicated exist in the local fingerprint library 3.
当有一个或多个待去重指纹存在于所述本地指纹库3时,代表这些待去重指纹对应的待写入数据片是重复的数据片,为了节省存储空间,将这些重复的数据片删除。When one or more fingerprints to be deduplicated exist in the local fingerprint database 3, it means that the data pieces to be written corresponding to these fingerprints to be deduplicated are duplicate data pieces. In order to save storage space, these duplicate data pieces delete.
第二去重模块105,用于当有一个或多个待去重指纹不存在于所述本地指纹库3时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库2中查找各个待处理指纹,当在所述共享指纹库2中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。The second deduplication module 105 is configured to use the one or more fingerprints to be deduplicated as to-be-processed fingerprints when one or more fingerprints to be deduplicated do not exist in the local fingerprint library 3, and share the fingerprints Look up each fingerprint to be processed in the library 2, and when one or more fingerprints to be processed are found in the shared fingerprint library 2, delete the to-be-written data piece corresponding to the found one or more fingerprints to be processed .
由于共享指纹库2中拥有全量的指纹数据,因此,若第一去重模块104在本地指纹库3中未查询到一待去重指纹,则第二去重模块105在共享指纹库2中继续查询该待去重指纹是否存在,若存在,则确定该待去重指纹对应的待写入数据片已存在于其他存储节点中,属于重复的数据片,无需再对该待写入数据片进行存储处理。Since the shared fingerprint database 2 has a full amount of fingerprint data, if the first deduplication module 104 does not query a fingerprint to be deduplicated in the local fingerprint library 3, the second deduplication module 105 continues in the shared fingerprint library 2 Query whether the fingerprint to be deduplicated exists. If it exists, determine that the data to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes and belongs to a duplicate data piece. Storage processing.
与现有技术相比,本实施例一存储节点设备1在进行数据去重时,若在本地指纹库3中未查询到一待写入数据片的指纹,则可直接在共享指纹库2中查询该指纹是否为重复的指纹,无需与其他存储节点设备1逐一进行通信查询,因此,提高了分布式存储***的数据去重效率。Compared with the prior art, when a storage node device 1 performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library 3, it can be directly in the shared fingerprint library 2 To query whether the fingerprint is a duplicate fingerprint, there is no need to communicate with other storage node devices 1 one by one, so the data de-duplication efficiency of the distributed storage system is improved.
进一步地,在本实施例中,所述数据去重程序10还包括存储模块(图中未示出),用于:Further, in this embodiment, the data deduplication program 10 further includes a storage module (not shown in the figure) for:
保存所有剩余的待写入数据片(即未重复的数据片),并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库3及共享指纹库2中。Save all remaining data pieces to be written (ie, non-repetitive data pieces), and save the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library 3 and the shared fingerprint library 2.
进一步地,在本实施例中,所述数据去重程序10还包括引用更新模块(图中未示出),用于:Further, in this embodiment, the data deduplication program 10 further includes a reference update module (not shown in the figure) for:
确定各个待写入数据片的指纹的引用计数变化值(例如,确定各个待去重指纹的引用计数变化值均为+1),并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备5,供控制节点设备5根据各个待写入数据片的指纹的引用计数变化值,且更新各个待写入数据片的指纹的累计引用计数(一指纹的累计引用计数代表该指纹对应的数据片被已存储数据引用的总次数)。Determine the change in the reference count of the fingerprints of each piece of data to be written (for example, determine that the change in the reference count of each fingerprint to be deduplicated is +1), and send the change in the reference count of the fingerprint of each piece of data to be written To the control node device 5, for the control node device 5 to update the cumulative reference count of the fingerprint of each data slice to be written according to the reference count change value of the fingerprint of each data slice to be written (the cumulative reference count of a fingerprint represents the fingerprint The total number of times the corresponding data piece is referenced by stored data).
进一步地,在本实施例中,所述数据去重程序10还包括删除模块(图中未示出),用于:Further, in this embodiment, the data deduplication program 10 further includes a deletion module (not shown in the figure), which is used to:
接收一待删除数据的删除请求,获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值(例如,确定所述数据片指纹序列中每一个指纹的引用计数变化值均为-1),并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备5。供控制节点设备5根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库2删除,且通知所述存储节点设备1将所述待删除数据的数据片指纹序列从本地指纹库3删除。Receiving a deletion request for data to be deleted, acquiring the data piece fingerprint sequence of the data to be deleted, and determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence (for example, determining each data piece in the data piece fingerprint sequence The reference count change value of a fingerprint is -1), and the reference count change value of each fingerprint in the fingerprint sequence of the data piece is sent to the control node device 5. The control node device 5 updates the cumulative reference count of each fingerprint in the data sheet fingerprint sequence according to the reference count change value of each fingerprint in the data sheet fingerprint sequence, and removes the data sheet fingerprint sequence of the data to be deleted from all The shared fingerprint library 2 is deleted, and the storage node device 1 is notified to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint library 3.
本实施例中,若要删除数据,存储节点设备1不能直接将该待删除数据的数据片直接删除,因为存储节点设备1无法确定该待删除数据的数据片是否同时被其他数据引用,如果直接删除该待删除数据的数据片,则可能会造成数据丢失。因此,仅需更新该待删除数据的数据片序列中各个指纹的累计引用计数,并将该待删除数据的数据片指纹序列删除。In this embodiment, if data is to be deleted, the storage node device 1 cannot directly delete the data piece of the data to be deleted because the storage node device 1 cannot determine whether the data piece of the data to be deleted is also referenced by other data at the same time. Deleting the data piece of the data to be deleted may cause data loss. Therefore, it is only necessary to update the cumulative reference count of each fingerprint in the data slice sequence of the data to be deleted, and delete the data slice fingerprint sequence of the data to be deleted.
当在所述共享指纹库2中侦测到一指纹的累计引用计数为零(即该指纹对应的数据片未被任何数据引用)时,控制节点设备5记录所述指纹保持累计引用计数为零的状态的持续时长。当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备1删除所述指纹及所述指纹对应的数据片。当所述持续时长小于或等于预设时长时,不作删除处理。When it is detected in the shared fingerprint library 2 that the cumulative reference count of a fingerprint is zero (that is, the data piece corresponding to the fingerprint is not referenced by any data), the control node device 5 records that the fingerprint keeps the cumulative reference count of zero The duration of the state. When the duration is greater than the preset duration, the fingerprint is deleted, and the corresponding storage node device 1 is notified to delete the fingerprint and the data piece corresponding to the fingerprint. When the duration is less than or equal to the preset duration, no deletion process is performed.
本实施例中,控制节点设备5在侦测到一指纹的累计引用计数为零时,需经历一段预设时长后再将该指纹对应的数据片删除,且在该预设时长内实时接收各存储节点设备1上报的该指纹的引用计数变化值,以避免因存储节点设备1未及时上报引用计数变化值而造成的数据误删。In this embodiment, when detecting that the cumulative reference count of a fingerprint is zero, the control node device 5 needs to go through a preset period of time before deleting the data piece corresponding to the fingerprint, and receives each real-time within the preset period of time The reference count change value of the fingerprint reported by the storage node device 1 to avoid erroneous deletion of data caused by the storage node device 1 not reporting the reference count change value in time.
进一步地,在本实施例中,所述数据去重程序10还包括读取模块(图中未示出),用于:Further, in this embodiment, the data deduplication program 10 further includes a reading module (not shown in the figure) for:
在接收到一待读取数据的读取请求时,获取该待读取数据的数据片指纹序列,并获取所述数据片指纹序列中各指纹对应的数据片的存储位置信息,根据获取的所述存储位置信息,获取所述数据片指纹序列中各指纹对应的数据片,再将获取的数据片按照所述数据片指纹序列组装成所述待读取数据。When receiving a read request for data to be read, obtain the data piece fingerprint sequence of the data to be read, and obtain the storage location information of the data piece corresponding to each fingerprint in the data piece fingerprint sequence, according to the acquired Obtaining the storage location information, acquiring data pieces corresponding to each fingerprint in the data piece fingerprint sequence, and then assembling the acquired data pieces into the data to be read according to the data piece fingerprint sequence.
此外,本申请提出一种数据去重方法。该方法适用于上述分布式存储***。In addition, this application proposes a data deduplication method. This method is applicable to the above distributed storage system.
如图4所示,图4为本申请数据去重方法一实施例的流程示意图。As shown in FIG. 4, FIG. 4 is a schematic flowchart of an embodiment of a data deduplication method according to this application.
本实施例中,所述方法包括:In this embodiment, the method includes:
步骤S10,存储节点设备1接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹。In step S10, the storage node device 1 receives a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
所述待写入数据片由待写入数据(所述待写入数据的数据类型包括块级数据、文件级数据)切分得到,该切分的操作可由存储节点设备1执行,或由其他任何适用的设备(例如,客户端)执行,其切分方法包括:The data piece to be written is obtained by dividing the data to be written (the data type of the data to be written includes block-level data and file-level data), and the dividing operation may be performed by the storage node device 1, or by other Any suitable device (for example, client) executes, and the segmentation method includes:
将待写入数据文件切分成相同数据大小的预设数量的数据片。或者,将预设数量记为M,当M为大于1的自然数时,确定切分所述待写入数据文件对应的数据片大小,根据确定的数据片大小,逐一切分出大小相同的M-1个数据块,切分后余下的是第M个数据块。其中,待写入数据片的大小可以是4KB、8KB、12KB、16KB或其他粒度大小。Divide the data file to be written into a preset number of data pieces of the same data size. Alternatively, record the preset number as M. When M is a natural number greater than 1, determine the size of the data slice corresponding to the data file to be written, and divide the M of the same size one by one according to the determined size of the data slice -1 data block, the rest is the Mth data block. The size of the data piece to be written may be 4KB, 8KB, 12KB, 16KB or other granularity sizes.
在将待写入数据切分成若干个待写入数据片后,计算每个待写入数据片的指纹,例如,通过信息-摘要算法5(Message-Digest Algorithm 5,MD5)、安全哈希算法(Secure Hash Algorithm,SHA1)等计算每个待写入数据片的指纹,同时,记录各待写入数据片的排列顺序(即数据片指纹序列),用于后续读取该待写入数据时,将待写入数据片按照该数据片指纹序列组装成该待写入数据。此外,还可将所述数据片指纹序列保存至所述本地指纹库3及共享指纹库2中。After the data to be written is divided into several pieces of data to be written, the fingerprint of each piece of data to be written is calculated, for example, through Message-Digest Algorithm 5 (MD5), secure hash algorithm (SecureHashAlgorithm, SHA1), etc. Calculate the fingerprint of each piece of data to be written, and at the same time, record the arrangement order of each piece of data to be written (ie, the data piece fingerprint sequence), which is used for subsequent reading of the data to be written Assemble the data piece to be written into the data to be written according to the fingerprint sequence of the data piece. In addition, the data sheet fingerprint sequence can also be saved in the local fingerprint library 3 and the shared fingerprint library 2.
步骤S20,存储节点设备1确定出所述待写入数据片的指纹中的待去重指纹。In step S20, the storage node device 1 determines the fingerprint to be deduplicated among the fingerprints of the data piece to be written.
确定待去重指纹的方法包括:判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹。The method for determining the fingerprint to be deduplicated includes: judging whether there is a redundant fingerprint in the fingerprint to be written into the data piece, if there is, deleting the redundant fingerprint, and using the remaining fingerprint as the fingerprint to be deduplicated, If it does not exist, all fingerprints of the data pieces to be written are used as fingerprints to be deduplicated.
例如,所述步骤S20包括步骤S21~S26(图中未示出)。其中:For example, the step S20 includes steps S21 to S26 (not shown in the figure). among them:
步骤S21,判断所有待写入数据片的指纹中是否存在相同的指纹。In step S21, it is determined whether the same fingerprint exists in all fingerprints of the data piece to be written.
步骤S22,若存在相同的指纹,则将相同的指纹作为一个指纹组,在查找出所有指纹组后,在每一个指纹组中选择一个指纹保留,将未被选择的指纹作为冗余的指纹删除,并判断是否存在未分组的指纹,若是,则将各个未分组的指纹作为待去重指纹,若否,则结束流程。Step S22, if the same fingerprint exists, the same fingerprint is used as a fingerprint group, after finding all fingerprint groups, select one fingerprint to retain in each fingerprint group, and delete the unselected fingerprints as redundant fingerprints And determine whether there are ungrouped fingerprints. If yes, each ungrouped fingerprint is taken as the fingerprint to be deduplicated. If not, the process ends.
步骤S23,若不存在相同的指纹,则将所有待写入数据片的指纹作为未分组的指纹,则将各个未分组的指纹作为待去重指纹。In step S23, if the same fingerprint does not exist, all fingerprints of the data piece to be written are regarded as ungrouped fingerprints, and each ungrouped fingerprint is regarded as a fingerprint to be deduplicated.
步骤S30,存储节点设备1于本地指纹库3中查找各个待去重指纹是否存在。In step S30, the storage node device 1 searches the local fingerprint library 3 for the existence of each fingerprint to be deduplicated.
步骤S40,当有一个或多个待去重指纹存在于所述本地指纹库3时,存储节点设备1将所述一个或多个待去重指纹对应的待写入数据片删除。In step S40, when one or more fingerprints to be deduplicated exist in the local fingerprint database 3, the storage node device 1 deletes the piece of data to be written corresponding to the one or more fingerprints to be deduplicated.
当有一个或多个待去重指纹存在于所述本地指纹库3时,代表这些待去 重指纹对应的待写入数据片是重复的数据片,为了节省存储空间,将这些重复的数据片删除。When one or more fingerprints to be deduplicated exist in the local fingerprint database 3, it means that the data pieces to be written corresponding to these fingerprints to be deduplicated are duplicate data pieces. In order to save storage space, these duplicate data pieces delete.
步骤S50,当有一个或多个待去重指纹不存在于所述本地指纹库3时,存储节点设备1将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库2中查找各个待处理指纹,当在所述共享指纹库2中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。Step S50, when one or more fingerprints to be deduplicated do not exist in the local fingerprint library 3, the storage node device 1 uses the one or more fingerprints to be deduplicated as fingerprints to be processed, and shares it in the shared fingerprint library 2 Search for each fingerprint to be processed, and when one or more fingerprints to be processed are found in the shared fingerprint library 2, delete the piece of data to be written corresponding to the found one or more fingerprints to be processed.
由于共享指纹库2中拥有全量的指纹数据,因此,存储节点设备1若在本地指纹库3中未查询到一待去重指纹,则在共享指纹库2中继续查询该待去重指纹是否存在,若存在,则确定该待去重指纹对应的待写入数据片已存在于其他存储节点中,属于重复的数据片,无需再对该待写入数据片进行存储处理。Since the shared fingerprint database 2 has the full amount of fingerprint data, if the storage node device 1 does not query a fingerprint to be deduplicated in the local fingerprint library 3, it continues to query the shared fingerprint library 2 whether the fingerprint to be deduplicated exists If it exists, it is determined that the data piece to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes and belongs to a duplicate data piece, and there is no need to perform storage processing on the data piece to be written.
与现有技术相比,本实施例一存储节点设备1在进行数据去重时,若在本地指纹库3中未查询到一待写入数据片的指纹,则可直接在共享指纹库2中查询该指纹是否为重复的指纹,无需与其他存储节点设备1逐一进行通信查询,因此,提高了分布式存储***的数据去重效率。Compared with the prior art, when a storage node device 1 performs data deduplication in this embodiment, if a fingerprint of a data piece to be written is not queried in the local fingerprint library 3, it can be directly in the shared fingerprint library 2 To query whether the fingerprint is a duplicate fingerprint, there is no need to communicate with other storage node devices 1 one by one, so the data de-duplication efficiency of the distributed storage system is improved.
进一步地,在本实施例中,于步骤S60之后,该方法还包括:Further, in this embodiment, after step S60, the method further includes:
存储节点设备1保存所有剩余的待写入数据片(即未重复的数据片),并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库3及共享指纹库2中。The storage node device 1 saves all remaining data pieces to be written (ie, non-repetitive data pieces), and saves the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library 3 and the shared fingerprint library 2 in.
进一步地,在本实施例中,所述分布式存储***还包括与各个存储节点设备1及共享指纹库2通信连接的控制节点设备5,在所述步骤S20之后,所述方法还包括:Further, in this embodiment, the distributed storage system further includes a control node device 5 communicatively connected to each storage node device 1 and shared fingerprint library 2, and after the step S20, the method further includes:
存储节点设备1确定各个待写入数据片的指纹的引用计数变化值(例如,确定各个待去重指纹的引用计数变化值均为+1),并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备5。The storage node device 1 determines the reference count change value of each fingerprint of the data piece to be written (for example, determines that the reference count change value of each fingerprint to be deduplicated is +1), and refers to the reference of the fingerprint of each data piece to be written The count change value is sent to the control node device 5.
接着,控制节点设备5根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数(一指纹的累计引用计数代表该指纹对应的数据片被已存储数据引用的总次数)。Next, the control node device 5 updates the cumulative reference count of the fingerprint of each data slice to be written according to the reference count change value of the fingerprint of each data slice to be written (the cumulative reference count of a fingerprint represents that the data slice corresponding to the fingerprint has been Total number of stored data references).
进一步地,在本实施例中,该方法还包括步骤S60~S80(图中未示出)。其中:Further, in this embodiment, the method further includes steps S60 to S80 (not shown in the figure). among them:
步骤S60,存储节点设备1接收一待删除数据的删除请求。In step S60, the storage node device 1 receives a deletion request of data to be deleted.
步骤S70,存储节点设备1获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值(例如,确定所述数据片指纹序列中每一个指纹的引用计数变化值均为-1),并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备5。Step S70: The storage node device 1 acquires the data piece fingerprint sequence of the data to be deleted, and determines the reference count change value of each fingerprint in the acquired data piece fingerprint sequence (for example, determines each fingerprint in the data piece fingerprint sequence The reference count change values of are all -1), and send the reference count change values of each fingerprint in the data sheet fingerprint sequence to the control node device 5.
步骤S80,控制节点设备5根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库2删除,且通知所述存储节点设 备1将所述待删除数据的数据片指纹序列从本地指纹库3删除。Step S80: The control node device 5 updates the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the reference count change value of each fingerprint in the data piece fingerprint sequence, and converts the data piece fingerprint sequence of the data to be deleted Delete from the shared fingerprint library 2 and notify the storage node device 1 to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint library 3.
本实施例中,若要删除数据,存储节点设备1不能直接将该待删除数据的数据片直接删除,因为存储节点设备1无法确定该待删除数据的数据片是否同时被其他数据引用,如果直接删除该待删除数据的数据片,则可能会造成数据丢失。因此,仅需更新该待删除数据的数据片序列中各个指纹的累计引用计数,并将该待删除数据的数据片指纹序列删除。In this embodiment, if data is to be deleted, the storage node device 1 cannot directly delete the data piece of the data to be deleted because the storage node device 1 cannot determine whether the data piece of the data to be deleted is also referenced by other data at the same time. Deleting the data piece of the data to be deleted may cause data loss. Therefore, it is only necessary to update the cumulative reference count of each fingerprint in the data slice sequence of the data to be deleted, and delete the data slice fingerprint sequence of the data to be deleted.
进一步地,在本实施例中,该方法还包括:Further, in this embodiment, the method further includes:
当在所述共享指纹库2中侦测到一指纹的累计引用计数为零(即该指纹对应的数据片未被任何数据引用)时,控制节点设备5记录所述指纹保持累计引用计数为零的状态的持续时长。When it is detected in the shared fingerprint library 2 that the cumulative reference count of a fingerprint is zero (that is, the data piece corresponding to the fingerprint is not referenced by any data), the control node device 5 records that the fingerprint keeps the cumulative reference count of zero The duration of the state.
当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备1删除所述指纹及所述指纹对应的数据片。When the duration is greater than the preset duration, the fingerprint is deleted, and the corresponding storage node device 1 is notified to delete the fingerprint and the data piece corresponding to the fingerprint.
当所述持续时长小于或等于预设时长时,不作删除处理。When the duration is less than or equal to the preset duration, no deletion process is performed.
本实施例中,控制节点设备5在侦测到一指纹的累计引用计数为零时,需经历一段预设时长后再将该指纹对应的数据片删除,且在该预设时长内实时接收各存储节点设备1上报的该指纹的引用计数变化值,以避免因存储节点设备1未及时上报引用计数变化值而造成的数据误删。In this embodiment, when detecting that the cumulative reference count of a fingerprint is zero, the control node device 5 needs to go through a preset period of time before deleting the data piece corresponding to the fingerprint, and receives each real-time within the preset period of time The reference count change value of the fingerprint reported by the storage node device 1 to avoid erroneous deletion of data caused by the storage node device 1 not reporting the reference count change value in time.
进一步地,在本实施例中,该方法还包括步骤S90(图中未示出)。Further, in this embodiment, the method further includes step S90 (not shown in the figure).
步骤S90,存储节点设备1在接收到一待读取数据的读取请求时,获取该待读取数据的数据片指纹序列,并获取所述数据片指纹序列中各指纹对应的数据片的存储位置信息,根据获取的所述存储位置信息,获取所述数据片指纹序列中各指纹对应的数据片,再将获取的数据片按照所述数据片指纹序列组装成所述待读取数据。Step S90, when receiving a read request for data to be read, the storage node device 1 acquires a data piece fingerprint sequence of the data to be read, and acquires the storage of the data piece corresponding to each fingerprint in the data piece fingerprint sequence For the location information, according to the acquired storage location information, acquire a data piece corresponding to each fingerprint in the data piece fingerprint sequence, and then assemble the acquired data piece into the data to be read according to the data piece fingerprint sequence.
进一步地,本申请还提出一种计算机可读存储介质,所述计算机可读存储介质存储有数据去重程序10,所述数据去重程序10的实施例在上述内容中已详细描述,在此不做赘述。Further, the present application also proposes a computer-readable storage medium that stores a data deduplication program 10, and embodiments of the data deduplication program 10 have been described in detail in the foregoing content, and are described here Do not repeat them.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是在本申请的发明构思下,利用本申请说明书及附图内容所作的等效结构变换,或直接/间接运用在其他相关的技术领域均包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and therefore do not limit the patent scope of the present application. Any equivalent structural transformation or direct/indirect use of the description and drawings of the present application under the inventive concept of the present application All other related technical fields are included in the patent protection scope of this application.

Claims (20)

  1. 一种分布式存储***,其特征在于,所述分布式存储***包括多个存储节点设备及若干个共享指纹库,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述存储节点设备用于:A distributed storage system, characterized in that the distributed storage system includes a plurality of storage node devices and a plurality of shared fingerprint libraries, and the communication connection between the storage node devices and the shared fingerprint library, the storage node devices A local fingerprint library is provided, or the storage node device is in communication connection with the corresponding local fingerprint library, and the storage node device is used to:
    接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;Receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written;
    确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;Determine the fingerprint to be deduplicated in the fingerprint of the data piece to be written, and find whether each fingerprint to be deduplicated exists in a local fingerprint library, the local fingerprint library includes the data pieces stored in the storage node device fingerprint;
    当有一个或多个待去重指纹存在于所述本地指纹库时,将所述一个或多个待去重指纹对应的待写入数据片删除;When one or more fingerprints to be deduplicated exist in the local fingerprint database, delete the data piece to be written corresponding to the one or more fingerprints to be deduplicated;
    当有一个或多个待去重指纹不存在于所述本地指纹库时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。When one or more fingerprints to be deduplicated do not exist in the local fingerprint database, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, and each fingerprint to be processed is searched for in the shared fingerprint database. The shared database includes fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint database, the fingerprints corresponding to the found one or more fingerprints to be processed The written data piece is deleted.
  2. 如权利要求1所述的分布式存储***,其特征在于,所述待写入数据片由待写入数据切分得到,所述数据片写入请求还包括数据片指纹序列,所述数据片指纹序列包括按顺序排列的各个待写入数据片的指纹;The distributed storage system according to claim 1, wherein the data piece to be written is obtained by dividing the data piece to be written, and the data piece writing request further includes a data piece fingerprint sequence, the data piece The fingerprint sequence includes the fingerprints of each data piece to be written in order;
    所述确定出所述待写入数据片的指纹中的待去重指纹包括:判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹;The determining that the fingerprint to be deduplicated in the fingerprint of the data piece to be written includes: determining whether there is a redundant fingerprint in the fingerprint of the data piece to be written, and if so, deleting the redundant fingerprint And use the remaining fingerprints as fingerprints to be deduplicated, if not, all fingerprints of the data sheet to be written are used as fingerprints to be deduplicated;
    所述存储节点设备还用于:The storage node device is also used for:
    将所述数据片指纹序列保存至所述本地指纹库及共享指纹库中;Save the data sheet fingerprint sequence to the local fingerprint library and the shared fingerprint library;
    所述存储节点设备还用于:The storage node device is also used for:
    保存所有剩余的待写入数据片,并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库及共享指纹库中。Save all remaining data pieces to be written, and save the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library and the shared fingerprint library.
  3. 如权利要求1所述的分布式存储***,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The distributed storage system according to claim 1, wherein the distributed storage system further comprises a control node device communicatively connected to each storage node device and a shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  4. 如权利要求2所述的分布式存储***,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The distributed storage system according to claim 2, wherein the distributed storage system further comprises a control node device communicatively connected to each storage node device and the shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  5. 如权利要求2所述的分布式存储***,其特征在于,所述存储节点设备还用于:The distributed storage system according to claim 2, wherein the storage node device is further used to:
    接收一待删除数据的删除请求;Receive a delete request for data to be deleted;
    获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值,并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备;Acquiring the data piece fingerprint sequence of the data to be deleted, determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence, and sending the reference count change value of each fingerprint in the data piece fingerprint sequence to the control node device ;
    所述控制节点设备还用于:The control node device is also used to:
    根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库删除,且通知所述存储节点设备将所述待删除数据的数据片指纹序列从本地指纹库删除;Update the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the change in the reference count of each fingerprint in the data piece fingerprint sequence, and delete the data piece fingerprint sequence of the data to be deleted from the shared fingerprint library And notify the storage node device to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint database;
    当在所述共享指纹库中侦测到一指纹的累计引用计数为零时,记录所述指纹保持累计引用计数为零的状态的持续时长,当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备删除所述指纹及所述指纹对应的数据片。When the cumulative reference count of a fingerprint is detected to be zero in the shared fingerprint library, record the duration of the state where the fingerprint keeps the cumulative reference count to zero, and delete the fingerprint when the duration is greater than the preset duration The fingerprint and notify the corresponding storage node device to delete the fingerprint and the data piece corresponding to the fingerprint.
  6. 一种数据去重方法,该方法适用于分布式存储***,其特征在于,所述分布式存储***包括多个存储节点设备及若干个共享指纹库,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述方法包括步骤:A data deduplication method, which is applicable to a distributed storage system, characterized in that the distributed storage system includes multiple storage node devices and several shared fingerprint libraries, between the storage node devices and the shared fingerprint library Communication connection, the storage node device is provided with a local fingerprint library, or the storage node device is in communication connection with the corresponding local fingerprint library, the method includes the steps of:
    接收步骤:存储节点设备接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;Receiving step: the storage node device receives a data slice write request, and the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written;
    查询步骤:存储节点设备确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;Query step: The storage node device determines the fingerprint to be deduplicated among the fingerprints of the data piece to be written, and finds whether each fingerprint to be deduplicated exists in a local fingerprint library, the local fingerprint library includes the storage node device The fingerprint of the data piece already stored in;
    第一去重步骤:当有一个或多个待去重指纹存在于所述本地指纹库时,存储节点设备将所述一个或多个待去重指纹对应的待写入数据片删除;The first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint database, the storage node device deletes the data piece to be written corresponding to the one or more fingerprints to be deduplicated;
    第二去重步骤:当有一个或多个待去重指纹不存在于所述本地指纹库时,存储节点设备将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。Second deduplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint database, the storage node device uses the one or more fingerprints to be deduplicated as fingerprints to be processed, and shares the fingerprint database Search for each fingerprint to be processed, the shared database includes the fingerprints of the data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint database, the one found Or multiple pieces of data to be written corresponding to the fingerprints to be processed are deleted.
  7. 如权利要求6所述的数据去重方法,其特征在于,所述待写入数据片由待写入数据切分得到,所述数据片写入请求还包括数据片指纹序列,所述数据片指纹序列包括按顺序排列的各个待写入数据片的指纹;The data deduplication method according to claim 6, wherein the data slice to be written is obtained by dividing the data to be written, and the data slice write request further includes a data slice fingerprint sequence, the data slice The fingerprint sequence includes the fingerprints of each data piece to be written in order;
    所述确定出所述待写入数据片的指纹中的待去重指纹包括:判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹;The determining that the fingerprint to be deduplicated in the fingerprint of the data piece to be written includes: determining whether there is a redundant fingerprint in the fingerprint of the data piece to be written, and if so, deleting the redundant fingerprint And use the remaining fingerprints as fingerprints to be deduplicated, if not, all fingerprints of the data sheet to be written are used as fingerprints to be deduplicated;
    所述存储节点设备还用于:The storage node device is also used for:
    将所述数据片指纹序列保存至所述本地指纹库及共享指纹库中;Save the data sheet fingerprint sequence to the local fingerprint library and the shared fingerprint library;
    所述存储节点设备还用于:The storage node device is also used for:
    保存所有剩余的待写入数据片,并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库及共享指纹库中。Save all remaining data pieces to be written, and save the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library and the shared fingerprint library.
  8. 如权利要求6所述的数据去重方法,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The data deduplication method according to claim 6, wherein the distributed storage system further comprises a control node device communicatively connected to each storage node device and shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  9. 如权利要求7所述的数据去重方法,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The data deduplication method according to claim 7, wherein the distributed storage system further includes a control node device communicatively connected to each storage node device and the shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  10. 如权利要求7所述的数据去重方法,其特征在于,所述存储节点设备还用于:The data deduplication method according to claim 7, wherein the storage node device is further used to:
    接收一待删除数据的删除请求;Receive a delete request for data to be deleted;
    获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值,并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备;Acquiring the data piece fingerprint sequence of the data to be deleted, determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence, and sending the reference count change value of each fingerprint in the data piece fingerprint sequence to the control node device ;
    所述控制节点设备还用于:The control node device is also used to:
    根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库删除,且通知所述存储节点设备将所述待删除数据的数据片指纹序列从本地指纹库删除;Update the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the change in the reference count of each fingerprint in the data piece fingerprint sequence, and delete the data piece fingerprint sequence of the data to be deleted from the shared fingerprint library And notify the storage node device to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint database;
    当在所述共享指纹库中侦测到一指纹的累计引用计数为零时,记录所述指纹保持累计引用计数为零的状态的持续时长,当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备删除所述指纹及所述指纹对应的数据片。When the cumulative reference count of a fingerprint is detected to be zero in the shared fingerprint library, record the duration of the state where the fingerprint keeps the cumulative reference count to zero, and delete the fingerprint when the duration is greater than the preset duration The fingerprint and notify the corresponding storage node device to delete the fingerprint and the data piece corresponding to the fingerprint.
  11. 一种存储节点设备,其特征在于,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述存储节点设备包括存储器和处理器,所述存储器上存储有数据去重程序,所述数据去重程序被所述处理器执行时实现如下步骤:A storage node device characterized by a communication connection between the storage node device and a shared fingerprint library, a local fingerprint library is set in the storage node device, or the storage node device communicates with a corresponding local fingerprint library Connected, the storage node device includes a memory and a processor, and a data deduplication program is stored on the memory. When the data deduplication program is executed by the processor, the following steps are implemented:
    接收步骤:接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;The receiving step: receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written;
    查询步骤:确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;Query step: determine the fingerprint to be deduplicated in the fingerprint of the data piece to be written, and find out whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the stored data in the storage node device The fingerprint of the data sheet;
    第一去重步骤:当有一个或多个待去重指纹存在于所述本地指纹库时,将所述一个或多个待去重指纹对应的待写入数据片删除;The first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint database, delete the data piece to be written corresponding to the one or more fingerprints to be deduplicated;
    第二去重步骤:当有一个或多个待去重指纹不存在于所述本地指纹库时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。Second deduplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, use the one or more fingerprints to be deduplicated as fingerprints to be processed, and look up each in the shared fingerprint library Fingerprints to be processed, the shared database includes fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint database, the one or more found fingerprints The data piece to be written corresponding to the fingerprint to be processed is deleted.
  12. 如权利要求11所述的存储节点设备,其特征在于,所述待写入数据片由待写入数据切分得到,所述数据片写入请求还包括数据片指纹序列,所述数据片指纹序列包括按顺序排列的各个待写入数据片的指纹;The storage node device according to claim 11, wherein the data piece to be written is obtained by dividing the data piece to be written, and the data piece writing request further includes a data piece fingerprint sequence, the data piece fingerprint The sequence includes the fingerprints of each data piece to be written in order;
    所述确定出所述待写入数据片的指纹中的待去重指纹包括:判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹;The determining that the fingerprint to be deduplicated in the fingerprint of the data piece to be written includes: determining whether there is a redundant fingerprint in the fingerprint of the data piece to be written, and if so, deleting the redundant fingerprint And use the remaining fingerprints as fingerprints to be deduplicated, if not, all fingerprints of the data sheet to be written are used as fingerprints to be deduplicated;
    所述存储节点设备还用于:The storage node device is also used for:
    将所述数据片指纹序列保存至所述本地指纹库及共享指纹库中;Save the data sheet fingerprint sequence to the local fingerprint library and the shared fingerprint library;
    所述存储节点设备还用于:The storage node device is also used for:
    保存所有剩余的待写入数据片,并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库及共享指纹库中。Save all remaining data pieces to be written, and save the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library and the shared fingerprint library.
  13. 如权利要求11所述的存储节点设备,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The storage node device according to claim 11, wherein the distributed storage system further comprises a control node device communicatively connected to each storage node device and shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  14. 如权利要求12所述的存储节点设备,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The storage node device according to claim 12, wherein the distributed storage system further comprises a control node device communicatively connected to each storage node device and shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  15. 如权利要求12所述的存储节点设备,其特征在于,所述存储节点设备还用于:The storage node device according to claim 12, wherein the storage node device is further used for:
    接收一待删除数据的删除请求;Receive a delete request for data to be deleted;
    获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值,并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备;Acquiring the data piece fingerprint sequence of the data to be deleted, determining the reference count change value of each fingerprint in the acquired data piece fingerprint sequence, and sending the reference count change value of each fingerprint in the data piece fingerprint sequence to the control node device ;
    所述控制节点设备还用于:The control node device is also used to:
    根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库删除,且通知所述存储节点设备将所述待删除数据的数据片指纹序列从本地指纹库删除;Update the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the change in the reference count of each fingerprint in the data piece fingerprint sequence, and delete the data piece fingerprint sequence of the data to be deleted from the shared fingerprint library And notify the storage node device to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint database;
    当在所述共享指纹库中侦测到一指纹的累计引用计数为零时,记录所述指纹保持累计引用计数为零的状态的持续时长,当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备删除所述指纹及所述指纹对应的数据片。When the cumulative reference count of a fingerprint is detected to be zero in the shared fingerprint library, record the duration of the state where the fingerprint keeps the cumulative reference count to zero, and delete the fingerprint when the duration is greater than the preset duration The fingerprint and notify the corresponding storage node device to delete the fingerprint and the data piece corresponding to the fingerprint.
  16. 一种计算机可读存储介质,适用于存储节点设备,其特征在于,所述存储节点设备与共享指纹库之间通信连接,所述存储节点设备中设置有本地指纹库,或者,所述存储节点设备与对应的本地指纹库通信连接,所述计算机可读存储介质存储有数据去重程序,所述数据去重程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A computer-readable storage medium suitable for a storage node device, characterized in that the storage node device is in communication connection with a shared fingerprint library, the storage node device is provided with a local fingerprint library, or the storage node The device is in communication connection with a corresponding local fingerprint library, and the computer-readable storage medium stores a data deduplication program, which may be executed by at least one processor, so that the at least one processor performs the following steps:
    接收步骤:接收数据片写入请求,所述数据片写入请求包括若干个待写入数据片及各个所述待写入数据片的指纹;The receiving step: receiving a data slice write request, the data slice write request includes a plurality of data slices to be written and fingerprints of each data slice to be written;
    查询步骤:确定出所述待写入数据片的指纹中的待去重指纹,并于本地指纹库中查找各个待去重指纹是否存在,所述本地指纹库包括所述存储节点设备中已存储数据片的指纹;Query step: determine the fingerprint to be deduplicated in the fingerprint of the data piece to be written, and find out whether each fingerprint to be deduplicated exists in the local fingerprint library, the local fingerprint library includes the stored data in the storage node device The fingerprint of the data sheet;
    第一去重步骤:当有一个或多个待去重指纹存在于所述本地指纹库时,将所述一个或多个待去重指纹对应的待写入数据片删除;The first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint database, delete the data piece to be written corresponding to the one or more fingerprints to be deduplicated;
    第二去重步骤:当有一个或多个待去重指纹不存在于所述本地指纹库时,将所述一个或多个待去重指纹作为待处理指纹,并于共享指纹库中查找各个 待处理指纹,所述共享数据库包括所有存储节点设备中已存储数据片的指纹,当在所述共享指纹库中查找到一个或多个待处理指纹时,将查找到的所述一个或多个待处理指纹对应的待写入数据片删除。Second deduplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, use the one or more fingerprints to be deduplicated as fingerprints to be processed, and look up each in the shared fingerprint library Fingerprints to be processed, the shared database includes fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint database, the one or more found fingerprints The data piece to be written corresponding to the fingerprint to be processed is deleted.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述待写入数据片由待写入数据切分得到,所述数据片写入请求还包括数据片指纹序列,所述数据片指纹序列包括按顺序排列的各个待写入数据片的指纹;The computer-readable storage medium according to claim 16, wherein the data piece to be written is obtained by dividing the data piece to be written, and the data piece writing request further includes a data piece fingerprint sequence, the data The chip fingerprint sequence includes the fingerprints of each data chip to be written in order;
    所述确定出所述待写入数据片的指纹中的待去重指纹包括:判断所述待写入数据片的指纹中是否存在冗余的指纹,若存在,则删除所述冗余的指纹,并将剩余的指纹作为待去重指纹,若不存在,则将所有所述待写入数据片的指纹作为待去重指纹;The determining that the fingerprint to be deduplicated in the fingerprint of the data piece to be written includes: determining whether there is a redundant fingerprint in the fingerprint of the data piece to be written, and if so, deleting the redundant fingerprint And use the remaining fingerprints as fingerprints to be deduplicated, if not, all fingerprints of the data sheet to be written are used as fingerprints to be deduplicated;
    所述存储节点设备还用于:The storage node device is also used for:
    将所述数据片指纹序列保存至所述本地指纹库及共享指纹库中;Save the data sheet fingerprint sequence to the local fingerprint library and the shared fingerprint library;
    所述存储节点设备还用于:The storage node device is also used for:
    保存所有剩余的待写入数据片,并将所述剩余的待写入数据片对应的存储位置信息保存至所述本地指纹库及共享指纹库中。Save all remaining data pieces to be written, and save the storage location information corresponding to the remaining data pieces to be written to the local fingerprint library and the shared fingerprint library.
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The computer-readable storage medium of claim 16, wherein the distributed storage system further includes a control node device communicatively connected to each storage node device and shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述分布式存储***还包括与各个存储节点设备及共享指纹库通信连接的控制节点设备,所述存储节点设备还用于:The computer-readable storage medium of claim 17, wherein the distributed storage system further comprises a control node device communicatively connected to each storage node device and shared fingerprint library, the storage node device is further used to:
    确定各个待写入数据片的指纹的引用计数变化值,并将各个待写入数据片的指纹的引用计数变化值发送至控制节点设备;Determine the reference count change value of the fingerprint of each data piece to be written, and send the reference count change value of the fingerprint of each data piece to be written to the control node device;
    所述控制节点设备用于:The control node device is used to:
    根据各个待写入数据片的指纹的引用计数变化值,更新各个待写入数据片的指纹的累计引用计数。According to the reference count change value of the fingerprint of each data piece to be written, the cumulative reference count of the fingerprint of each data piece to be written is updated.
  20. 如权利要求17所述的计算机可读存储介质,其特征在于,所述存储节点设备还用于:The computer-readable storage medium of claim 17, wherein the storage node device is further configured to:
    接收一待删除数据的删除请求;Receive a delete request for data to be deleted;
    获取所述待删除数据的数据片指纹序列,确定获取的所述数据片指纹序列中各指纹的引用计数变化值,并发送所述数据片指纹序列中各指纹的引用计数变化值至控制节点设备;Obtain the data piece fingerprint sequence of the data to be deleted, determine the reference count change value of each fingerprint in the acquired data piece fingerprint sequence, and send the reference count change value of each fingerprint in the data piece fingerprint sequence to the control node device ;
    所述控制节点设备还用于:The control node device is also used to:
    根据所述数据片指纹序列中各指纹的引用计数变化值,更新所述数据片 指纹序列中各指纹的累计引用计数,并将所述待删除数据的数据片指纹序列从所述共享指纹库删除,且通知所述存储节点设备将所述待删除数据的数据片指纹序列从本地指纹库删除;Update the cumulative reference count of each fingerprint in the data piece fingerprint sequence according to the change in the reference count of each fingerprint in the data piece fingerprint sequence, and delete the data piece fingerprint sequence of the data to be deleted from the shared fingerprint library And notify the storage node device to delete the data piece fingerprint sequence of the data to be deleted from the local fingerprint database;
    当在所述共享指纹库中侦测到一指纹的累计引用计数为零时,记录所述指纹保持累计引用计数为零的状态的持续时长,当所述持续时长大于预设时长时,删除所述指纹,并通知对应的存储节点设备删除所述指纹及所述指纹对应的数据片。When the cumulative reference count of a fingerprint is detected to be zero in the shared fingerprint library, record the duration of the fingerprint keeping the cumulative reference count to zero, and delete the fingerprint when the duration is greater than the preset duration The fingerprint and notify the corresponding storage node device to delete the fingerprint and the data piece corresponding to the fingerprint.
PCT/CN2019/118009 2019-01-04 2019-11-13 Distributed storage system, storage node device and data duplicate deletion method WO2020140622A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910007367.9 2019-01-04
CN201910007367.9A CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method

Publications (1)

Publication Number Publication Date
WO2020140622A1 true WO2020140622A1 (en) 2020-07-09

Family

ID=66558525

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118009 WO2020140622A1 (en) 2019-01-04 2019-11-13 Distributed storage system, storage node device and data duplicate deletion method

Country Status (2)

Country Link
CN (1) CN109800218B (en)
WO (1) WO2020140622A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800218B (en) * 2019-01-04 2024-04-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data deduplication method
CN110457305B (en) * 2019-08-13 2021-11-26 腾讯科技(深圳)有限公司 Data deduplication method, device, equipment and medium
CN111399768A (en) * 2020-02-21 2020-07-10 苏州浪潮智能科技有限公司 Data storage method, system, equipment and computer readable storage medium
CN111459928B (en) * 2020-03-27 2023-07-07 上海爱数信息技术股份有限公司 Data deduplication method applied to data backup scene in cluster range and application
CN111580755B (en) * 2020-05-09 2022-07-05 杭州海康威视***技术有限公司 Distributed data processing system and distributed data processing method
CN114138756B (en) * 2020-09-03 2023-03-24 金篆信科有限责任公司 Data deduplication method, node and computer-readable storage medium
CN114442931A (en) * 2021-12-23 2022-05-06 天翼云科技有限公司 Data deduplication method and system, electronic device and storage medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229420A (en) * 2017-05-27 2017-10-03 郑州云海信息技术有限公司 Date storage method, read method, delet method and data operation system
CN107391761A (en) * 2017-08-28 2017-11-24 郑州云海信息技术有限公司 A kind of data managing method and device based on data de-duplication technology
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN109800218A (en) * 2019-01-04 2019-05-24 平安科技(深圳)有限公司 Distributed memory system, memory node equipment and data duplicate removal method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495392B1 (en) * 2010-09-02 2013-07-23 Symantec Corporation Systems and methods for securely deduplicating data owned by multiple entities
CN103942292A (en) * 2014-04-11 2014-07-23 华为技术有限公司 Virtual machine mirror image document processing method, device and system
CN103944988A (en) * 2014-04-22 2014-07-23 南京邮电大学 Repeating data deleting system and method applicable to cloud storage
CN106063192A (en) * 2014-05-21 2016-10-26 华为技术有限公司 Transmission method for wireless ethernet interface hard disk, related device, and system
JP6677605B2 (en) * 2016-08-22 2020-04-08 株式会社東芝 Program, storage system, and storage system control method
CN108415669A (en) * 2018-03-15 2018-08-17 深信服科技股份有限公司 The data duplicate removal method and device of storage system, computer installation and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229420A (en) * 2017-05-27 2017-10-03 郑州云海信息技术有限公司 Date storage method, read method, delet method and data operation system
CN107391761A (en) * 2017-08-28 2017-11-24 郑州云海信息技术有限公司 A kind of data managing method and device based on data de-duplication technology
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN109800218A (en) * 2019-01-04 2019-05-24 平安科技(深圳)有限公司 Distributed memory system, memory node equipment and data duplicate removal method

Also Published As

Publication number Publication date
CN109800218B (en) 2024-04-09
CN109800218A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
WO2020140622A1 (en) Distributed storage system, storage node device and data duplicate deletion method
US9792306B1 (en) Data transfer between dissimilar deduplication systems
EP3026577B1 (en) Dual data storage using an in-memory array and an on-disk page structure
JP6026738B2 (en) System and method for improving scalability of a deduplication storage system
US8898120B1 (en) Systems and methods for distributed data deduplication
US10248676B2 (en) Efficient B-Tree data serialization
US10127242B1 (en) Data de-duplication for information storage systems
US9740422B1 (en) Version-based deduplication of incremental forever type backup
Meister et al. Block locality caching for data deduplication
US9946724B1 (en) Scalable post-process deduplication
US9152683B2 (en) Database-transparent near online archiving and retrieval of data
US10509780B2 (en) Maintaining I/O transaction metadata in log-with-index structure
US20200133719A1 (en) Method of efficiently migrating data from one tier to another with suspend and resume capability
US11392545B1 (en) Tracking access pattern of inodes and pre-fetching inodes
US11650967B2 (en) Managing a deduplicated data index
CN110352410B (en) Tracking access patterns of index nodes and pre-fetching index nodes
Yu et al. Pdfs: Partially dedupped file system for primary workloads
US11663166B2 (en) Post-processing global deduplication algorithm for scaled-out deduplication file system
US9773034B1 (en) Large-scale log index
US9111015B1 (en) System and method for generating a point-in-time copy of a subset of a collectively-managed set of data items
US10235293B2 (en) Tracking access pattern of inodes and pre-fetching inodes
US10795596B1 (en) Delayed deduplication using precalculated hashes
US10642789B2 (en) Extended attribute storage
CN111858652A (en) Cross-data-source query method and system based on message queue and server node
US11609909B2 (en) Zero copy optimization for select * queries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19906942

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19906942

Country of ref document: EP

Kind code of ref document: A1