CN106201771A - Data-storage system and data read-write method - Google Patents

Data-storage system and data read-write method Download PDF

Info

Publication number
CN106201771A
CN106201771A CN201510226830.0A CN201510226830A CN106201771A CN 106201771 A CN106201771 A CN 106201771A CN 201510226830 A CN201510226830 A CN 201510226830A CN 106201771 A CN106201771 A CN 106201771A
Authority
CN
China
Prior art keywords
finger print
data block
bucket
print information
multiple knot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510226830.0A
Other languages
Chinese (zh)
Other versions
CN106201771B (en
Inventor
蒋雄伟
吴锐
李勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510226830.0A priority Critical patent/CN106201771B/en
Publication of CN106201771A publication Critical patent/CN106201771A/en
Application granted granted Critical
Publication of CN106201771B publication Critical patent/CN106201771B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of data-storage system, including Centroid with remove multiple knot;Described Centroid removes multiple knot for according to preset strategy each Bucket is assigned to correspondence, according to Bucket and go the corresponding relation of multiple knot to create routing table, and synchronizes described routing table and removes multiple knot to each;Described go multiple knot for according to described routing table, store each described in the data block that represents of the finger print information corresponding to Bucket that is assigned to and described finger print information.Achieve the overall duplicate removal storage management of initial data other to 100PB higher level and the other finger print information of 100TB higher level.

Description

Data-storage system and data read-write method
Technical field
The invention belongs to Internet technical field, specifically, relate to a kind of date storage method, data Storage system and data read-write method, the client for reading and writing data and the system for reading and writing data.
Background technology
Internet firm needs backup, the data of filing to be outburst trend in recent years.For cost Considering, the main storage of tape always backup and filing system and virtual machine main storage system is situated between Matter.But the storage environment of tape requires high and the life-span is shorter, is typically accomplished by every 4-5 Dump on new tape.When the quantity of tape is accumulated to several ten thousand even after hundreds of thousands, and unloading works A nightmare will be become.
Along with the development of magnetic disc, capacity has reached 6T even 8T, its capacity price ratio Gradually close with tape, and disk is to support random access relative to the advantage of tape, and this makes weight The application of complex data deleting technique is possibly realized, by combining magnetic disc and data de-duplication skill Art, can be greatly saved the cost of backup filing.
The most existing data de-duplication commercial product, such as EMC DD990, HP StoreOnce B6200 equipment, SEPATON DeltaStor software etc., substantially belong to unit mould Formula, scalability is very limited, maximum available 1.6PB, maximum handling capacity 31TB/h (8.8G/s), no matter from capacity or performance, the storage of Internet firm cannot be met at all Demand.
" a kind of support mass data to back up expansible point of Inst. of Computing Techn. Academia Sinica Cloth data deduplication system ", for the deficiency of single cpu mode, in the autgmentability of machining system Propose distributed Bloom filter (bloomfilter) with deduplicated efficiency the two aspect to be used for being distributed The data removing multiple knot in formula machining system route, and propose fingerprint queries based on sampling mechanism In order to improve fingerprint queries speed, it is achieved that distributed data deduplication system 3D-deduper. Hereinafter referred to as scheme one.
EMC Inc. also have developed clustered on the basis of its unit (single-node) pattern Heavily store system (cluster deduplication storage system).Its way is to increase several Backup server, is responsible for data stream carries out stripping and slicing, the fingerprint of calculating data block, is then packaged into One superblock (super chunk) also goes multiple knot to carry out according to certain policybased routing to certain Process.Hereinafter referred to as scheme two.
Both the above scheme is from the point of view of stricti jurise, it is impossible to referred to as distributed system, but cluster system System.The basic ideas of group system are that to carry out the load of task between multiple reliable single node equal Weighing apparatus.And the basic ideas of distributed system are to carry out data between multiple insecure single node to divide Cloth (when data distributing equilibrium, the most naturally achieve the load balancing of task), and utilize the most secondary The means such as basis or check code ensure reliability.
In above-mentioned two scheme, the fingerprint base of group system is decentralized, although have employed Certain measure is tried one's best and is routed to the data block occurred before and fingerprint thereof be responsible for process before Go on multiple knot, it can be difficult to avoid being routed to one to remove multiple knot from this data block untreated On, thus be mistaken for new data block and repeated to preserve.Fingerprint be have employed by scheme one especially Sparse index based on sampling, has increased the weight of the probability that data block is misjudged further.Even only 2% Erroneous judgement, for the system of the order of magnitude the biggest for 100PB, be also unacceptable.
Summary of the invention
In view of this, the application provides a kind of date storage method, data-storage system and reading and writing data side Method, the client for reading and writing data and the system for reading and writing data, solve going at the big order of magnitude The technical problem that in weight system, the probability of miscarriage of justice that causes due to the Decentralization of finger print information is bigger.
In order to solve above-mentioned technical problem, this application discloses a kind of date storage method, be applied to include Centroid and the data-storage system removing multiple knot;Described date storage method, including: described center Node removes multiple knot according to what each bucket (Bucket) was assigned to correspondence by preset strategy;Described centromere Point creates routing table according to Bucket with the corresponding relation removing multiple knot, and synchronizes described routing table to each Remove multiple knot;Described go multiple knot according to described routing table, store each described in the Bucket that is assigned to The data block that corresponding finger print information and described finger print information represent.
Described go multiple knot according to described routing table, store each described in corresponding to the Bucket that is assigned to The data block that finger print information and described finger print information represent includes: described in remove multiple knot be each described distribution To Bucket be respectively created correspondence container (Container) file;Described go multiple knot in each institute State the finger print information preserving correspondence in the Bucket being assigned to, with each described in the Bucket that is assigned to Corresponding Container file preserves the data block that described finger print information represents.
Whether the described size going multiple knot to judge described Container file is more than predetermined threshold value;Work as institute When stating the size of Container file more than predetermined threshold value, described in go multiple knot by described Container literary composition Part is filed to background server.
Each Bucket is distributed to the corresponding multiple knot that goes according to preset strategy and includes by described Centroid: What each Bucket was assigned to multiple correspondence by described Centroid removes multiple knot, in the plurality of correspondence Go multiple knot determines a host node and at least one secondary node.
Described Centroid judges each to go whether multiple knot can be used, or whether adds new duplicate removal joint Point;When judging that certain goes multiple knot unavailable, or add new when removing multiple knot, described center Node redistributes described each Bucket;Described Centroid updates described routing table and is synchronized to each Remove multiple knot;Described multiple knot is gone to carry out Data Migration according to the routing table after described renewal.
Described go multiple knot to carry out Data Migration according to the routing table after described renewal to include: described host node Described Data Migration is initiated according to the routing table after described renewal.
Described when judging that certain goes multiple knot unavailable, described Centroid is redistributed described each Bucket includes: when judging that described host node is unavailable, described Centroid from described at least one Secondary node redefines out a host node;Described go multiple knot according to the routing table after described renewal Carry out Data Migration to include: described in the host node that redefines initiate institute according to the routing table after described renewal State Data Migration.
A finger print information storehouse, described finger print information storehouse are stored in solid-state to go multiple knot to include described in each Cuckoo Hash mapping table on hard disk, including described in remove the fingerprint corresponding to each Bucket of multiple knot The storage information of the data block that information and described finger print information represent.
Run M cuckoo Hash mapping table on described solid state hard disc simultaneously, and use N number of cloth simultaneously Paddy bird hash function;Wherein, M × N=128.
Run 32 cuckoo Hash mapping tables on described solid state hard disc simultaneously, and use 4 road cloth simultaneously Paddy bird hash function.
In order to solve above-mentioned technical problem, disclosed herein as well is a kind of data read-write method, including: will Data cutting is multiple data block the finger print information calculating each data block respectively;Determine described every number According to the Bucket corresponding to the finger print information of block;According to the routing table obtained from Centroid, determine and institute State Bucket corresponding remove multiple knot;Send fingerprint queries request to the duplicate removal corresponding with described Bucket Node, the request of described fingerprint queries includes the finger print information of data block;Receive corresponding with described Bucket Remove the finger print information not inquired that multiple knot returns;The finger print information that do not inquires described in uploading and The data block represented removes multiple knot to corresponding with described Bucket.
The described Bucket corresponding to finger print information determining described each data block includes: by described fingerprint Information carries out modulo operation with the total quantity of described Bucket, determines institute according to the result of described modulo operation State the Bucket corresponding to finger print information.
Described method also includes: on the data block of the described finger print information not inquired and representative thereof is whole When passing complete, uploading the mapped file of described data to removing multiple knot, described mapped file includes described number According to the finger print information of each data block, the finger print information of described each data block is according to the cutting of data block Order arrangement.
The described mapped file of described data of uploading to removing multiple knot, including: by described mapped file cutting For multiple data blocks the cryptographic Hash of the data block calculating mapped file respectively;Determine described mapped file Bucket corresponding to the cryptographic Hash of data block;Determine and the number of described mapped file according to described routing table According to removing multiple knot corresponding to the Bucket that the cryptographic Hash of block is corresponding;Upload the data block of described mapped file With corresponding cryptographic Hash to corresponding to the Bucket corresponding with the cryptographic Hash of the data block of described mapped file Remove multiple knot.
Described is that multiple data block includes by described mapped file cutting: by the header of described mapped file Cutting is first data block in the plurality of data block;The header of described mapped file includes described Total size of mapped file, the information such as total quantity of the plurality of data block.
Described method also includes: from the mapped file going multiple knot to obtain described data;According to described mapping The finger print information of each data block of data described in file is from going each of the multiple knot described data of acquisition Data block;Institute is gone out according to the finger print information of the described each data block sequential concatenation in described mapped file State data.
Described from go multiple knot obtain described data mapped file include: according to the name of described mapped file Claim and data block sequence number is from each data block going multiple knot to obtain described mapped file;Literary composition is mapped by described Each data block of part is spliced into the mapped file of described data.
The routing table that described basis obtains from Centroid, determines and corresponding with described Bucket removes multiple knot Including: when storing data first, obtain routing table from described Centroid;Obtain according to from Centroid The routing table taken, determines and corresponding with described Bucket removes multiple knot.
The routing table that described basis obtains from Centroid, determines and corresponding with described Bucket removes multiple knot Also include: send request bag and remove multiple knot to corresponding with described Bucket;Receive and described Bucket The corresponding respond packet going multiple knot to return, described respond packet includes the version information of routing table;Judge institute State the version information of the routing table in respond packet and the version letter of the described routing table obtained from Centroid Cease the most identical;When the version information of the routing table in described respond packet obtains with from Centroid with described The version information of routing table identical time, determine with described according to the described routing table obtained from Centroid What Bucket was corresponding removes multiple knot;When the version information of the routing table in described respond packet and from Centroid When the version information of the routing table obtained differs, obtain the routing table after updating from described Centroid; Redefine according to the routing table after described renewal and corresponding with described Bucket remove multiple knot.
In order to solve above-mentioned technical problem, disclosed herein as well is a kind of data read-write method, including:
Centroid transmission routing table is to client, and described routing table includes Bucket and goes between multiple knot Corresponding relation;Going multiple knot to receive the fingerprint queries request of described client, described fingerprint queries please Ask and include and described finger print information corresponding for the Bucket going multiple knot to be assigned to;Described go multiple knot to institute State finger print information to inquire about, the finger print information not inquired is back to described client;Described duplicate removal Node receives the described finger print information not inquired of described client upload and representative data thereof Block.
Described method also includes: described in go multiple knot preserve in the described Bucket being assigned to described in do not look into Ask the finger print information that arrives, with described Container file corresponding for the Bucket being assigned to preserves institute State data block, described in go multiple knot to described client return described data block preserve successful message.
Described go multiple knot to return described data block to described client to preserve before successful message, described Method also includes: described in go multiple knot by standby for the data block of the described finger print information not inquired and representative thereof Part is to secondary node.
Described method also includes: described in go multiple knot to preserve the data of mapped file of described client upload Block and corresponding cryptographic Hash.
Data block and the corresponding cryptographic Hash of the mapped file of the described client upload of described preservation include: In Container file corresponding to the Bucket of described correspondence, preserve the data block of described mapped file; In the Bucket of described correspondence, preserve cryptographic Hash and first storage of the data block of described mapped file Information.
Described first storage information includes: preserve the Container file of the data block of described mapped file Title, the data block of described mapped file side-play amount in described Container file and described mapping The size of the data block of file.
Described method also includes: described in go multiple knot to receive described client to obtain described mapped file The request of data block;The described data block going multiple knot to send described mapped file is to described client;Institute State multiple knot to receive representated by each finger print information that described client obtains in described mapped file The request of data block;Described multiple knot is gone to send data block representated by described each finger print information to institute State client.
Described multiple knot is gone to send data block representated by described each finger print information to described client bag Include: described in go multiple knot to determine the second storage information of described data block according to described finger print information, described Second storage information includes the title preserving the Container file of described data block, and described data block exists Side-play amount in described Container file and the size of described data block;Described go multiple knot according to institute The title stating Container file judges that described Container file has been filed the most to background server; When described Container file has been filed to background server, described in go multiple knot according to described data The size of block side-play amount in described Container file and described data block is from described background server Obtain described data block and send to described client;When described Container file is still saved in this locality Time, described in go multiple knot according to described data block side-play amount in described Container file and described The size of data block obtains described data block and sends to described client from this locality.
Described Centroid sends routing table and includes to client: when described client stores data first Time, described Centroid receives the routing table request of described client;Described Centroid sends route Table is to described client.
Described Centroid send routing table also include to client: described in go multiple knot to receive described visitor The request bag of family end: described in go multiple knot to send respond packet to described client, described respond packet includes institute State the version information of the routing table that multiple knot preserves;Version letter when the routing table that described client preserves When breath is inconsistent with the version information of the described routing table going multiple knot to preserve, described Centroid connects Receive the routing table request of described client;Described Centroid sends the routing table after updating to described visitor Family end.
Described go multiple knot that described finger print information is inquired about, the finger print information not inquired is back to Described client includes: described in go multiple knot to judge whether described finger print information is deposited by Bloom filter ?;In the presence of being judged that by Bloom filter described finger print information is not, determine that described finger print information is The finger print information not inquired;In the presence of judging described finger print information by Bloom filter, referring to Stricture of vagina information bank inquires about whether described finger print information exists;When inquiring described fingerprint in finger print information storehouse During information, determine that described finger print information exists;When not inquiring described fingerprint letter in finger print information storehouse During breath, determine that described finger print information is the finger print information not inquired.
In order to solve above-mentioned technical problem, disclosed herein as well is a kind of data-storage system, including: in Heart node and one or more remove multiple knot, wherein, described Centroid, for will according to preset strategy What each bucket (Bucket) was assigned to correspondence removes multiple knot, and corresponding with remove multiple knot according to Bucket Relation creates routing table, and synchronizes described routing table and remove multiple knot to each;Described remove multiple knot, be used for According to described routing table, store each described in the finger print information corresponding to Bucket that is assigned to and described finger The data block that stricture of vagina information represents.
In order to solve above-mentioned technical problem, disclosed herein as well is a kind of client for reading and writing data, Including: cutting computing module, for being multiple data block by data cutting and calculating each data block respectively Finger print information;Bucket determines module, for determining corresponding to the finger print information of described each data block Bucket;Node determines module, for according to the routing table obtained from Centroid, determining with described What Bucket was corresponding removes multiple knot;Request sending module, is used for sending fingerprint queries request to described What Bucket was corresponding goes multiple knot, the request of described fingerprint queries to include the finger print information of data block;Information connects Receive module, for receiving the fingerprint that do not inquire that go multiple knot the return letter corresponding with described Bucket Breath;Transmission module in data, is used for the data block of finger print information and the representative thereof not inquired described in uploading extremely Corresponding with described Bucket removes multiple knot.
In order to solve above-mentioned technical problem, disclosed herein as well is a kind of system for reading and writing data, bag Include: Centroid and remove multiple knot, wherein, described Centroid, it is used for sending routing table to client, Described routing table includes Bucket and removes the corresponding relation between multiple knot;Described remove multiple knot, be used for connecing Receiving the fingerprint queries request of described client, the request of described fingerprint queries includes going multiple knot to divide with described The finger print information corresponding for Bucket being fitted on;Described finger print information is inquired about, the finger that will do not inquire Stricture of vagina information is back to described client;Receive the described fingerprint letter not inquired of described client upload Breath and representative data block thereof.
Compared with prior art, the application can obtain and include techniques below effect: achieve 100PB The overall duplicate removal storage management of the other initial data of higher level and the other finger print information of 100TB higher level, Having the highest extensibility, adding the new multiple knot rear center node that goes in system can be according to presetting Strategy re-starts data distribution, goes multiple knot to be automatically performed Data Migration, makes performance and the capacity of system Can be extended easily.
Certainly, the arbitrary product implementing the application must be not necessarily required to reach all the above skill simultaneously Art effect.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes of the application Point, the schematic description and description of the application is used for explaining the application, is not intended that the application's Improper restriction.In the accompanying drawings:
Fig. 1 is the knot of a kind of data-storage system (for the system of reading and writing data) of the embodiment of the present application Structure schematic diagram;
Fig. 2 is the routing table schematic diagram of the embodiment of the present application;
Fig. 3 is the schematic flow sheet of a kind of data read-write method of the embodiment of the present application;
Fig. 4 is the schematic flow sheet of a kind of data read-write method of the embodiment of the present application;
Fig. 5 is the structural representation of a kind of client for reading and writing data of the embodiment of the present application.
Detailed description of the invention
Embodiments of the present invention are described in detail, thereby to the present invention below in conjunction with drawings and Examples How application technology means solve technical problem and reach the process that realizes of technology effect and can fully understand And implement according to this.
Fig. 1 is the data-storage system (hereinafter referred to as " system ") that the embodiment of the present application provides, including Centroid 10 and go multiple knot 11, Centroid 10 to couple with removing multiple knot 11.In systems, Centroid 10 be responsible for multiple distributed management removing multiple knot 11 and intrasystem data are distributed and Replica management.Multiple knot 11 is gone to be responsible for data block and the finger print information of data block and storage information It is managed and preserves, and under the distributed management of Centroid 10, completing duplication and the migration of data. Go multiple knot 11 to possess abstract storage engines layer, new storage engines can be added very easily.
System to the management of data block with bucket (Bucket) as unit, Bucket is one in system and patrols Collecting concept, distribute a Bucket numbering for each Bucket, this Bucket numbering is used for and each number Set up corresponding relation according to the finger print information of block by the hash algorithm preset, thus by data block according to Bucket numbering stores and sets up the corresponding relation between data block and storage file respectively.The center of system Data block and the finger print information of data block that system is preserved by node by Bucket carry out global administration.
Centroid removes multiple knot according to what each Bucket was assigned to correspondence by preset strategy, this default plan It can be slightly load balancing.Such as, Centroid obtains each load data removing multiple knot, logical The real-time change of overload data determines each current load condition removing multiple knot, and Bucket is preferential Be assigned to present load relatively low remove multiple knot, realize each by data distributing equilibrium and go the negative of multiple knot Carry equilibrium.This preset strategy can be position security strategy, and such as, Centroid is according to the concerning security matters of data Property or the authority of client different Bucket is assigned to the different multiple knot that goes, make different concerning security matters level Other or different rights client data is saved in different going in multiple knot.
Centroid each Bucket is assigned to correspondence remove multiple knot, numbered by Bucket and Go the mark of multiple knot to the corresponding relation setting up Bucket with remove multiple knot, and create according to this corresponding relation Build routing table.This routing table can regard a mapping table as, have recorded Bucket and goes reflecting between multiple knot Penetrating relation, Fig. 2 is the exemplary plot of routing table in the embodiment of the present application, wherein, and the numeral generation of horizontal gauge outfit Table Bucket numbers, and the copy mark of the digitized representation Bucket of longitudinal gauge outfit, the letter in form divides Do not represent and different remove multiple knot.As in figure 2 it is shown, the Bucket of the most numbered 0, No. 0 copy quilt Being assigned to multiple knot D, No. 1 copy is assigned to multiple knot A;The Bucket of numbered 1,0 Number copy is assigned to multiple knot A, and No. 1 copy is assigned to multiple knot B.Fig. 2 is for this Routing table in application embodiment is illustrative, is not intended that the limit to the application protection domain System, can arrange any number of multiple knot that goes in system, each to go multiple knot to be assigned to multiple Bucket, each Bucket can also have one or more backup and back up and remove multiple knot different.
Centroid creates after routing table, is synchronized to this routing table each remove multiple knot.In system Each goes multiple knot to determine the Bucket being assigned to this locality according to routing table, stores and is assigned to this locality Finger print information corresponding for Bucket and finger print information represent data block.Multiple knot is gone to preserve each When finger print information that Bucket is corresponding and the data block that finger print information represents, for each Bucket being assigned to Create corresponding container (Container) file, each Bucket preserve the finger print information of correspondence, The data block that finger print information represents is preserved in the Container file corresponding with Bucket.And fingerprint letter Corresponding relation between breath and Bucket, is to be entered the Bucket sum of internal system by finger print information Row modulo operation, determines the Bucket numbering that finger print information is corresponding, this calculating process according to operation result Generally completed by the client to system storage data.When the finger print information corresponding with Bucket increasingly Time many, the data block preserved in corresponding Container file increases therewith, and Container file takies Memory space increase the most therewith, in order to ensure each copy going multiple knot can store multiple Bucket And control to go the load of multiple knot, when the size of a Container file corresponding for Bucket exceedes During predetermined threshold value, go multiple knot by this Container archive to background server 12, in Fig. 1 Shown in, each go multiple knot to couple with background server 12, go multiple knot to receive corresponding number again During according to block, store in the Container file being positioned at background server 12.
Each Bucket is distributed to when removing multiple knot of correspondence according to preset strategy by Centroid, can be by What each Bucket was assigned to multiple correspondence removes multiple knot, makes each Bucket exist in systems multiple Copy, and be that each copy distributes different copy marks.Such as in the routing table shown in Fig. 2, in Heart node is that each Bucket is assigned to two and goes multiple knot, each Bucket to remove multiple knot different Copy be respectively provided with copy mark 0 and 1.
Each Bucket is assigned to multiple when removing multiple knot by Centroid, goes in multiple knot really the plurality of A fixed host node and at least one secondary node.Centroid can according to copy mark determine host node and Secondary node, determines a primary copy mark, other copies in multiple copy marks of each Bucket Mark is backing copy mark, such as, copy is designated the copy of 0 as the master of each Bucket Copy, the copy of other copy marks is backing copy.And go in multiple knot all, by certain The multiple knot that goes at the copy place that the copy of Bucket is designated 0 is defined as the host node of this Bucket, should The multiple knot that goes at other copy places of Bucket is the secondary node of this Bucket.To prevent certain from going When multiple knot is unavailable, this goes the data of the Bucket on multiple knot to store and read-write all will be unable to carry out. Each copy of Bucket includes finger print information corresponding for this Bucket and preserves the representative of this finger print information The Container file of data block.
Centroid can judge each to go whether multiple knot can be used, or judges whether internal system adds New removes multiple knot.By going the heartbeat message between multiple knot to judge, each duplicate removal saves Centroid Point whether can with or whether add and new remove multiple knot.Centroid judges that certain goes multiple knot not Adding new when removing multiple knot in available or system, Centroid redistributes each Bucket, is The internal Bucket of system will change with the mapping relations removing multiple knot.When certain goes multiple knot unavailable Time, the Bucket going multiple knot corresponding with this is re-assigned to other according to preset strategy by Centroid In multiple knot;New when removing multiple knot when adding in system, Centroid according to preset strategy by system Interior Bucket redistributes.Above two situation all can make internal system Bucket save with duplicate removal The mapping relations of point change, and Centroid closes with the mapping removing multiple knot according to the Bucket after change System updates routing table, and the routing table after updating is synchronized to each remove multiple knot.Owing to Bucket distributes To the multiple knot that goes there occurs change, intrasystem go multiple knot by according to the routing table number after updating According to migration.
This Data Migration is initiated by the host node of the Bucket going multiple knot to change being assigned to.Example As, in the routing table, No. 0 copy (primary copy) of the Bucket of numbered 1 is by removing multiple knot A Become multiple knot B, then by this go multiple knot A initiate numbered 1 Bucket No. 0 copy to Remove the Data Migration of multiple knot B, go multiple knot B to judge this Bucket of numbered 1 further according to routing table Other secondary nodes whether there occurs change, if it occur that change, such as from going multiple knot D to become Removing multiple knot E, the data in the Bucket of numbered 1 backup to multiple knot E the most again, duplicate removal saves The copy of the Bucket of numbered the 1 of some E is designated backing copy mark.Data Migration is each after completing Go multiple knot can delete and the number of the local Bucket that there are not mapping relations according to the routing table after updating According to.When going multiple knot unavailable of the host node as certain Bucket, Centroid is at this Bucket Secondary node in redefine a host node, by the host node redefined according to update after route The Data Migration about this Bucket initiated by table.Such as, the host node of the Bucket of numbered 1 When going multiple knot A unavailable, Centroid from numbered 1 the secondary node duplicate removal of Bucket Node B and go in multiple knot C, determines the host node of the Bucket that multiple knot B is numbered 1, The copy mark then removing the copy of the Bucket of numbered 1 in multiple knot B becomes primary copy mark (example Such as 0), in the routing table after renewal, the secondary node of the Bucket of numbered 1 is for removing multiple knot C and going Multiple knot D, then by going the multiple knot B data by the Bucket of numbered 1 to backup to multiple knot D.
Each duplicate removal intra-node is built with a finger print information storehouse.This finger print information storehouse includes multiple knot The finger print information corresponding to each Bucket and the storage information of data block that represents of finger print information.This refers to Stricture of vagina information bank can be to use the form of Key-Value Store, with finger print information as Key, this finger print information The storage information of the data block represented is Value.During the reading and writing data of system, relate to a large amount of Finger print information inquiry and comparison process, each go in multiple knot use Bloom filter to undertake part Inquiry request, due to the possibility of the existence under-enumeration of Bloom filter, also have a large amount of request need further exist for by Finger print information storehouse completes.Therefore the reading performance requirement to finger print information storehouse (Key-Value Store) The highest.For small-sized Key-Value Store, can be Key-Value pair (Key-Value Pair) with presets, such as log-structure, leave on common hard disc, so After in internal memory set up index, with the Key-Value Pair in access hard disk rapidly.But owing to this is System is applied to the data storage of more than 100PB rank, the very big (100PB of finger print information and information memory capacity Initial data, the finger print information of corresponding about 50TB and storage information), the most now cannot set Standby internal memory sets up index.Therefore inventors herein have recognized that, completely can be at solid state hard disc (Solid State Drives, SSD) on realize a Hash table to deposit whole Key-Value in finger print information storehouse Pair.This Hash table being stored on solid state hard disc is cuckoo Hash mapping table, owing to removing multiple knot The Bloom filter that first passes through carry out the inquiry comparison of finger print information, there is the under-enumeration due to hash-collision Situation, cuckoo Hash mapping table is a kind of mode that can process hash-collision, its basic ideas Be use two different hash functions to calculate the position that Key deposits, (1) if two positions all Free time, then a position is selected to insert;(2) if only one of which position is idle, then this it is inserted into Clear position;(3) if two positions are the most idle, randomly choose one of both position and should Key on position kicks out of, and the position then calculating another cryptographic Hash of the Key kicked out of corresponding is entered Row inserts, if this position is sky, inserts, if not for empty, then kicks out of the Key on this position, Clear position is found in so continuation always.Obvious this mode likely produces Infinite Cyclic, the most generally Set a maximum lookup number of times, when reaching this maximum, it is believed that this Hash table is the fullest.Invention People selects cuckoo Hash, is because the system input and output number of times when inquiring about Key and is usually arranged as often Amount.
Common cuckoo Hash mapping table only has the utilization rate of 49%, so generally using cuckoo Hash Two kinds of main deformation: 1) increase hash function number;2) increase each position and can deposit Key Number.Both deformation may serve to improve the utilization rate of cuckoo Hash mapping table.The application's Inventor have selected murmur2 hash function as basic hash function, and by arranging different kinds Son, identical Key value can produce different cryptographic Hash.
Owing to assisting based on NVMe (NonVolatile Memory express, high speed nonvolatile storage) View solid state hard disc (SSD) bottom be all the page (Page) with 4K as ultimate unit, therefore fingerprint Information bank is all to be written and read operation sized by 4K when carrying out operating.In finger print information storehouse Key-Value Pair size is 256Byte, then the Page of a 4K can store 16 finger print informations. So this cuckoo Hash mapping table, 16 Key-Value Pair are deposited in each position, each Key-Value Pair is to write by insertion sequence in Page, does not sort by Key, this unordered Mode can avoid the expense brought that sorts on solid state hard disc.Reality according to present inventor Test, uses the asynchronous mode of 128 concurrent (queue depth × job number=128), can fully excavate (Input/Output Operations Per Second per second is written and read (I/O) for the IOPS of NVMe The number of times of operation) ability (450K), the biggest concurrent in order to produce, present inventor exists Two aspects are optimized: 1, run multiple cuckoo Hash mapping table on one piece of NVMe hard disk The Key-Value Store of form;2, on one piece of NVMe hard disk, multiple cuckoo hash function is used, And use asynchronous reading manner;And need to meet: the cuckoo run on every piece of solid state hard disc simultaneously The number of Hash mapping table is multiplied by the number of cuckoo hash function equal to 128.Present inventor sends out Existing, when cuckoo hash function becomes many, the QPS of single cuckoo Hash mapping table (query rate per second, Query Per Second) decline fairly obvious, become 8 tunnel cuckoo from 4 tunnel cuckoo hash functions and breathe out During uncommon function, QPS have dropped half, and when cuckoo hash function very little time, cuckoo Hash mapping Table space utilization rate then declines substantially.Choosing comprehensively considers performance and space availability ratio, present invention People selects 4 tunnel cuckoo hash functions, and its space availability ratio can reach 98.66%.Thus, it is desirable to 32 cuckoo Hash mapping tables are run on one piece of NVMe hard disk.And it is divided into multiple Hash mapping table Another one benefit is to reduce the locking granularity in this finger print information storehouse.
The process below client and above-mentioned data-storage system carrying out data read-write operation is done further Explanation.Client is when system write data, as it is shown on figure 3, this process comprises the following steps.
In step s 201, data cutting is multiple data block and calculates each data respectively by client The finger print information of block.
The relatively low hash algorithm of collision rate is used to calculate the cryptographic Hash of each data block as finger print information, The hash algorithms such as such as SHA-1, MD5.
In step S202, client determines the Bucket corresponding to the finger print information of each data block.
The finger print information of data block and intrasystem Bucket sum are carried out modulo operation by client, according to The result of modulo operation is mated, so that it is determined that this finger print information is corresponding with Bucket numbering Bucket.Such as, the cryptographic Hash of data block is a, and intrasystem Bucket sum is p, carries out delivery Computing a%P, modulo operation result is 2, then the Bucket that finger print information reference numeral is 2 of this data block.
In step S203, client, according to the routing table obtained from Centroid, determines and Bucket Corresponding removes multiple knot.
Client removes multiple knot according to what the routing table preserved determined Bucket place that finger print information is corresponding, When client is first to system write data, first routing table can be asked to Centroid.Such as, data The Bucket that finger print information reference numeral is 2 of block, in the routing table, the Bucket of numbered 2 is divided It is fitted on multiple knot A and removes multiple knot B, the master of the Bucket wherein going multiple knot A to be numbered 2 Node, the secondary node of the Bucket going multiple knot B to be numbered 2, it is therefore desirable to by this data block Finger print information be sent to multiple knot A and carry out fingerprint queries.
In step S204, client sends fingerprint queries request and removes multiple knot to corresponding with Bucket, The request of this fingerprint queries includes the finger print information of data block.
Client includes reading thread, sending thread and logical process thread.Multiple reading threads are born respectively The different piece blaming these data carries out stripping and slicing and calculates the finger print information of data block, then that finger print information is temporary Being stored in inquiry request queue, each reading thread includes multiple queries request queue, each inquiry request Queue correspond to a Bucket numbering.Client can be by temporary for the finger print information of corresponding same Bucket It is stored in same inquiry request queue.When the data in inquiry request queue exceed a certain amount of or are somebody's turn to do Inquiry request queue be deferred to after date, read thread and inquiry request be placed into the buffering sending thread District.
Send thread according to Bucket corresponding to each inquiry request queue, send request bag to this Bucket Place remove multiple knot (host node of this Bucket).In one embodiment, this transmission thread includes Four relief areas, two of which relief area stores to the request of system transfers, and corresponding fingerprint is looked into respectively Asking and ask summed data block upload request, two other relief area receives what client other threads internal sent Newly requested, the most corresponding fingerprint queries please summed data block upload request.Two kinds of different relief areas are set, The newly requested separation that will sending to request and other threads of system transfers, it is possible to avoid other threads Occur blocking for a long time during write is newly requested.
When sending thread and receiving the respond packet that multiple knot is beamed back, respond packet can be sent at logic Reason thread processes accordingly.Logical process thread is according to the fingerprint queries going multiple knot to return accordingly The upload request not inquiring finger print information and its data block represented is passed to send thread by result, by Send thread and the finger print information not inquired and its data block transmission represented are removed multiple knot to corresponding. The transmission that such thread burse mode can ensure that request is continuous print, smoothly.
Wherein send the respond packet that receives of thread and include that this goes the currently stored routing table of multiple knot Version information, it is judged that the version information of the routing table in this respond packet and the routing table obtained from Centroid Version information the most identical, when the routing table in respond packet version information with from Centroid obtain When the version information of routing table differs, represent this Centroid have updated routing table and be synchronized to be Removing multiple knot in system, now client is by sending thread to obtaining the route after updating from Centroid Table, and according to the routing table after this renewal redefine Bucket corresponding remove multiple knot, thus the most true What the fixed finger print information not inquired and its data block represented should upload to removes multiple knot.Work as respond packet In the version information of routing table identical with the version information of the routing table obtained from Centroid time, still root Determine according to the routing table obtained from Centroid and corresponding with Bucket remove multiple knot, the fingerprint not inquired What information and its data block represented should be uploaded goes multiple knot constant.
In step S205, go multiple knot that finger print information is inquired about, the fingerprint letter that will do not inquire Breath is back to client.
Duplicate removal intra-node includes a Bloom filter and a finger print information storehouse.This Bloom filter is built Found this hash index removing the currently stored all finger print informations of multiple knot;In this finger print information storehouse with Depositing of the data block that the in store all finger print informations of form of Key-Value Pair and finger print information represent Storage information.Go multiple knot by fingerprint queries ask in all finger print informations access successively Bloom filter and Finger print information storehouse.Calculate the hash index of each finger print information by Bloom filter and judge whether and cloth Hash index in grand filter is identical.When being different from the hash index in Bloom filter, then Determining this to go in multiple knot does not has the data block of identical finger print information and representative thereof, when with Bloom filter In certain hash index phase simultaneously as there is the leak of hash-collision in Bloom filter, can determine that this Finger print information exists the most, needs whether comprise this fingerprint by finger print information library inquiry further and believes Breath, when there is this finger print information in finger print information storehouse, determines that this finger print information exists, when fingerprint is believed When breath storehouse does not exist this finger print information, determine that this finger print information does not exists.First pass through and there is fingerprint letter The Bloom filter of breath hash index carries out inquiry can improve the efficiency of multiple knot fingerprint queries, then leads to Cross the under-enumeration situation that finger print information storehouse is likely to occur due to hash-collision to make up Bloom filter, improve Go the accuracy of multiple knot fingerprint queries.Multiple knot is gone not inquire all in the request of this fingerprint queries Finger print information put into respond packet and be back to client.This respond packet also includes that this goes multiple knot currently to deposit The version information of the routing table of storage, judges whether to need to update routing table for client.
In step S206, finger print information that client upload does not inquires and the data block of representative thereof are extremely Corresponding with Bucket removes multiple knot.
Client the data block of the finger print information not inquired in respond packet and representative thereof is uploaded to The Bucket place that the finger print information that do not inquires is corresponding remove multiple knot.If the version information of routing table Be not changed in, then the multiple knot that goes at Bucket place that should be corresponding with the finger print information not inquired walks exactly Carry out fingerprint queries in rapid S205 removes multiple knot.Fingerprint queries request in other finger print informations due to Exist in removing multiple knot, then need not again upload, it is to avoid system has repeated to store identical data Block.
In step S207, go multiple knot preserve in the Bucket being assigned to described in the finger that do not inquires Stricture of vagina information, preserves described data block in the Container file corresponding with the Bucket being assigned to.
Go multiple knot to receive the data block of finger print information and the representative thereof not inquired, with finger print information Corresponding Bucket preserves the finger print information not inquired, in the Bucket institute corresponding with finger print information Corresponding Container file preserves the data block that this finger print information represents.The name of Container file Title is by the general unique identifier of numbering+internal system of the Bucket corresponding to Container file (UUID)+date (Date) forms, such as 2_abcd234_010515.In order to ensure that data block is write Enter disk, by the way of O_SYNC flag bit is set, data block is write corresponding Container literary composition Part, just returns after making to have write every time, is write by the finger print information of this data block after pwrite returns again Enter finger print information storehouse, during write finger print information storehouse, using the finger print information of this data block as Key, should Second storage information of data block, as Value, forms a Key-Value Pair and is saved in finger print information In storehouse.This second storage information includes the title preserving the Container file of this data block, these data Block side-play amount (Offset) in this Container file and the size (Chunksize) of this data block. The hash index of the Key-Value Pair of this new preservation is updated, for follow-up in Bloom filter Data duplication elimination query.
In step S208, go multiple knot to return data block to client and preserve successful message.
After the data block of the finger print information not inquired and representative thereof preserves, go multiple knot to client The end return data block successful message of preservation, or in one embodiment, when the fingerprint letter not inquired When there is secondary node in systems in the Bucket that breath is corresponding, not host node not inquiring client upload Finger print information and after the data block of representative preserves, then backup to the standby joint of corresponding Bucket Point, returns the data block successful message of preservation when backing up complete backward client.
In step S209, when the data block of the finger print information not inquired and representative thereof has all been uploaded Bi Shi, the mapped file of client upload data is to removing multiple knot.
Mapped file includes the finger print information of each data block of these data, and the fingerprint of each data block Information according to client by this data cutting be during multiple data block cutting order arrangement, with ensure to this The correct mapping of data.
Mapped file piecemeal, when uploading mapped file, is uploaded by client too.Client will map File cutting is the cryptographic Hash of multiple data block the data block calculating mapped file respectively.Such as, client End calculates the cryptographic Hash of each data block of mapped file respectively by murmur2 hash function.Client Determine the Bucket corresponding to cryptographic Hash of the data block of mapped file, determine according to routing table and map literary composition Remove multiple knot corresponding to the Bucket that the cryptographic Hash of the data block of part is corresponding, upload the data of mapped file Block and corresponding cryptographic Hash are to removing multiple knot corresponding to corresponding Bucket.This mapping literary composition of client upload During the data block of part, carry out fingerprint queries according to the cryptographic Hash of the data block of each mapped file too, Only upload the data block of mapped file corresponding to the cryptographic Hash not inquired, it is to avoid upload the mapping of repetition File data blocks.When mapped file cutting is multiple data block by client, by the header of mapped file Cutting is first data block in multiple data block, and the header of this mapped file includes mapped file The information such as the total quantity of total size and the plurality of data block.
In step S210, go the data block of mapped file of multiple knot preservation client upload with corresponding Cryptographic Hash.
In the Bucket corresponding with the cryptographic Hash of the data block of mapped file, preserve the data of mapped file The cryptographic Hash of block and the first storage information, at the Container file corresponding to the Bucket of this correspondence The data block of middle preservation mapped file.This first storage information includes the data block of preservation mapped file The title of Container file, the data block of mapped file side-play amount in Container file and reflecting Penetrate the size of the data block of file.Again with the title+data block sequence number of mapped file as Key, to map First storage information of the data block of file is Value, updates finger print information as Key-Value Pair Storehouse.So far client all terminates to the process of system write data.
As shown in Figure 4, in the embodiment of the present application, client reads the process of data, this process bag from system Include following steps.
In step S301, client according to the mapped file title of data and data block sequence number to duplicate removal Node request mapped file.
Client is first to first data block going multiple knot to ask mapped file, the first of mapped file Individual data block includes the header of this mapped file.The header of this mapped file includes the big of mapped file The total quantity of the data block of little and this mapped file.Client according to the header of mapped file to duplicate removal Node sends the request of other data blocks obtaining mapped file.
In step s 302, multiple knot is gone to send the data block of mapped file to client.
Go to mapped file title that multiple knot sends and data block sequence number and finger print information storehouse according to client In Key mate, thus inquire the data block of this mapped file in finger print information storehouse Key-Value Pair, determines first corresponding for the Key storage information of data block with mapped file.According to Container file name in first storage information determines which the data block of this mapped file is stored in In Container file, further according to inclined in Container file of the data block of this mapped file The size of shifting amount and this mapped file data block gets the number of this mapped file from Container file According to block.
In step S303, client is spliced into mapped file, and root according to the data block of mapped file According to the finger print information of data block each in mapped file to duplicate removal node requests data block.
Client is spliced into complete mapped file according to the block sequence number of mapped file data block.Mapped file Finger print information including all data blocks and the order arrangement of the cutting according to data block.Client determine with The Bucket that finger print information is corresponding, is determining the Bucket place corresponding with finger print information by routing table Remove multiple knot, go multiple knot to send the request of the corresponding data block of acquisition to this.
In step s 304, the data block that the finger print information going multiple knot to send mapped file represents is to visitor Family end.
Multiple knot is gone to inquire about finger print information storehouse according to the finger print information in the request obtaining data block, inquiry To the second storage information corresponding with this finger print information.According to the Container literary composition in the second storage information Part title determines which Container file is the data block that this finger print information represents be saved in, and according to The size of this data block side-play amount in Container file and this data block obtains from Container file Get this data block.In one embodiment, according to the Container filename in the second storage information After claiming to determine which Container file is the data block that this finger print information represents be saved in, it is judged that should Background server filed by Container file, if this Container file is filed Background server, goes multiple knot to get number from this Container file being stored in background server Send to client according to block and by data block.
In step S305, client is according to suitable in mapped file of the finger print information of each data block Sequence is spliced into described data.
As it is shown in figure 5, the embodiment of the present application is used for the client of reading and writing data, including:
Cutting computing module 501, for being multiple data block by data cutting and calculating each data respectively The finger print information of block;
Bucket determines module 502, for determining the Bucket corresponding to finger print information of described each data block;
Node determines module 503, for according to the routing table obtained from Centroid, determining with described What Bucket was corresponding removes multiple knot;
Request sending module 504, for sending fingerprint queries request to the duplicate removal corresponding with described Bucket Node, the request of described fingerprint queries includes the finger print information of data block;
Information receiving module 505, returns not for receiving the multiple knot that goes corresponding with described Bucket The finger print information inquired;
Transmission module 506 in data, are used for finger print information and the data of representative thereof not inquired described in uploading Block removes multiple knot to corresponding with described Bucket;When the described finger print information not inquired and representative thereof When data block is all uploaded complete, it is additionally operable to the mapped file uploading described data to removing multiple knot, described Mapped file includes the finger print information of each data block of described data, the fingerprint letter of described each data block Cease the cutting order arrangement according to data block.
It addition, also disclose in a kind of the embodiment of the present application the system for reading and writing data, it is referred to figure Shown in 1, including: Centroid 10 and one or more multiple knot 11 that goes, wherein,
Described Centroid 10, is used for sending routing table to client, and described routing table includes Bucket And remove the corresponding relation between multiple knot;
Described removing multiple knot 11, for receiving the fingerprint queries request of described client, described fingerprint is looked into The request of inquiry includes and described finger print information corresponding for the Bucket going multiple knot to be assigned to;Described fingerprint is believed Breath is inquired about, and the finger print information not inquired is back to described client;Receive described client The described finger print information not inquired uploaded and representative data block thereof.
It should be noted that the system for reading and writing data illustrated in fig. 1 and Fig. 3, shown by 4 The feature of embodiment is the most corresponding, the client for reading and writing data illustrated in fig. 5 also with Fig. 3,4 The feature of shown embodiment is the most corresponding, therefore Fig. 1,5 embodiment in weak point can join See the description of Fig. 3, embodiment shown by 4, repeat no more.
Date storage method, data-storage system and the data read-write method of the embodiment of the present application offer, use Client in reading and writing data and the system for reading and writing data, it is achieved that other former to 100PB higher level The overall duplicate removal storage management of beginning data and the other finger print information of 100TB higher level, has the highest Extensibility, after what system addition was new removes multiple knot, Centroid can re-start according to preset strategy Data are distributed, and go multiple knot to be automatically performed Data Migration, make the performance of system and the capacity can be easily It is extended.Multiple knot is gone to achieve a high-performance finger print information storehouse based on solid state hard disc each, In solid state hard disc, set up jumbo cuckoo Hash mapping table, overcome the data volume when finger print information Cannot set up index time the biggest in internal memory, and then cannot be carried out the technical difficulty of duplication elimination query, protect simultaneously Demonstrate,prove the efficiency of finger print information inquiry and and improve the accuracy that finger print information is inquired about.
In a typical configuration, calculating equipment include one or more processor (CPU), input/ Output interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or the form such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM). Internal memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by Any method or technology realize information storage.Information can be computer-readable instruction, data structure, The module of program or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), Other kinds of random access memory (RAM), read only memory (ROM), electrically erasable Read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, tape Magnetic rigid disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storage can be by The information that calculating equipment accesses.According to defining herein, computer-readable medium does not include non-temporary electricity Brain readable media (transitory media), such as data signal and the carrier wave of modulation.
As employed some vocabulary in the middle of description and claim to censure specific components.This area skill Art personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This explanation In the way of book and claim not difference by title is used as distinguishing assembly, but with assembly in function On difference be used as distinguish criterion." bag as mentioned by the middle of description in the whole text and claim Contain " it is an open language, therefore " comprise but be not limited to " should be construed to." substantially " refer to receivable In range of error, those skilled in the art can solve described technical problem, base in the range of certain error Originally described technique effect is reached.Additionally, " coupling " word comprises any directly and indirectly electrical coupling at this Catcher section.Therefore, if a first device is coupled to one second device described in literary composition, then described first is represented Device can directly be electrically coupled to described second device, or by other devices or to couple means the most electric Property is coupled to described second device.Description subsequent descriptions is to implement the better embodiment of the present invention, so For the purpose of described description is the rule so that the present invention to be described, it is not limited to the scope of the present invention. Protection scope of the present invention is when being as the criterion depending on the defined person of claims.
Also, it should be noted term " includes ", " comprising " or its any other variant are intended to non- Comprising of exclusiveness, so that include that the commodity of a series of key element or system not only include that those are wanted Element, but also include other key elements being not expressly set out, or also include for this commodity or be Unite intrinsic key element.In the case of there is no more restriction, statement " including ... " limit Key element, it is not excluded that there is also other identical element in the commodity including described key element or system.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should reason Solve the present invention and be not limited to form disclosed herein, be not to be taken as the eliminating to other embodiments, And can be used for various other combination, amendment and environment, and can in invention contemplated scope described herein, It is modified by above-mentioned teaching or the technology of association area or knowledge.And those skilled in the art are carried out changes Move and change is without departing from the spirit and scope of the present invention, the most all should be in the protection of claims of the present invention In the range of.

Claims (33)

1. a date storage method, it is characterised in that be applied to include Centroid and remove multiple knot Data-storage system, described date storage method, including:
Described Centroid removes multiple knot according to what each bucket (Bucket) was assigned to correspondence by preset strategy;
Described Centroid creates routing table according to Bucket with the corresponding relation removing multiple knot, and synchronizes institute State routing table and remove multiple knot to each;
Described go multiple knot according to described routing table, store each described in corresponding to the Bucket that is assigned to The data block that finger print information and described finger print information represent.
2. date storage method as claimed in claim 1, it is characterised in that described in remove multiple knot root According to described routing table, store each described in the finger print information corresponding to Bucket that is assigned to and described fingerprint The data block that information represents, including:
Described go multiple knot be each described in the Bucket that is assigned to be respectively created the container of correspondence (Container) file;
The described Bucket going multiple knot to be assigned to described in each preserves the finger print information of correspondence, with The Container file corresponding for Bucket being assigned to described in each preserves what described finger print information represented Data block.
3. date storage method as claimed in claim 2, it is characterised in that
Whether the described size going multiple knot to judge described Container file is more than predetermined threshold value;
When described Container file size more than predetermined threshold value time, described in go multiple knot by described Container archive is to background server.
4. date storage method as claimed in claim 1, it is characterised in that described Centroid root Multiple knot is removed according to what each Bucket was distributed to correspondence by preset strategy, including:
What each Bucket was assigned to multiple correspondence by described Centroid removes multiple knot, the plurality of right Going of answering determines a host node and at least one secondary node in multiple knot.
5. the date storage method as described in claim 1 or 4, it is characterised in that
Described Centroid judges each to go whether multiple knot can be used, or whether adds new duplicate removal joint Point;
When judging that certain goes multiple knot unavailable, or add new when removing multiple knot, described center Node redistributes described each Bucket;
Described Centroid updates described routing table and is synchronized to each remove multiple knot;
Described multiple knot is gone to carry out Data Migration according to the routing table after described renewal.
6. date storage method as claimed in claim 5, it is characterised in that described in remove multiple knot root Data Migration is carried out according to the routing table after described renewal, including:
Described host node initiates described Data Migration according to the routing table after described renewal.
7. date storage method as claimed in claim 5, it is characterised in that described when judging certain Individual when going multiple knot unavailable, described Centroid redistributes described each Bucket, including:
When judging that described host node is unavailable, described Centroid is from least one secondary node described In redefine out a host node;
Described go multiple knot to carry out Data Migration according to the routing table after described renewal to include:
The described host node redefined initiates described Data Migration according to the routing table after described renewal.
8. date storage method as claimed in claim 1, it is characterised in that each described duplicate removal saves Point includes a finger print information storehouse, the cuckoo Hash that described finger print information storehouse is stored on solid state hard disc Mapping table, including described in remove the finger print information corresponding to each Bucket of multiple knot and described finger print information The storage information of the data block represented.
9. date storage method as claimed in claim 8, it is characterised in that on described solid state hard disc Run M cuckoo Hash mapping table simultaneously, and use N number of cuckoo hash function simultaneously;Wherein, M × N=128.
10. date storage method as claimed in claim 9, it is characterised in that on described solid state hard disc Run 32 cuckoo Hash mapping tables simultaneously, and use 4 tunnel cuckoo hash functions simultaneously.
11. 1 kinds of data read-write methods, it is characterised in that including:
It is multiple data block the finger print information calculating each data block respectively by data cutting;
Determine the Bucket corresponding to finger print information of described each data block;
According to the routing table obtained from Centroid, determine and corresponding with described Bucket remove multiple knot;
Send fingerprint queries request to ask to the multiple knot that goes corresponding with described Bucket, described fingerprint queries Finger print information including data block;
Receive the finger print information that do not inquire that go multiple knot return corresponding with described Bucket;
The finger print information not inquired described in uploading and the data block of representative thereof are to corresponding with described Bucket Remove multiple knot.
12. methods as claimed in claim 11, it is characterised in that described determine described each data Bucket corresponding to the finger print information of block includes:
The total quantity of described finger print information Yu described Bucket is carried out modulo operation, transports according to described delivery The result calculated determines the Bucket corresponding to described finger print information.
13. methods as claimed in claim 11, it is characterised in that described method also includes:
When the data block of the described finger print information not inquired and representative thereof is all uploaded complete, upload institute Stating the mapped file of data to removing multiple knot, described mapped file includes each data block of described data Finger print information, the finger print information of described each data block is according to the cutting order arrangement of data block.
14. methods as claimed in claim 13, it is characterised in that described in upload reflecting of described data Penetrate file to removing multiple knot, including:
By the Kazakhstan that described mapped file cutting is multiple data block the data block calculating mapped file respectively Uncommon value;
Determine the Bucket corresponding to cryptographic Hash of the data block of described mapped file;
The Bucket corresponding with the cryptographic Hash of the data block of described mapped file is determined according to described routing table Corresponding removes multiple knot;
Upload the data block of described mapped file and corresponding cryptographic Hash to the data with described mapped file Multiple knot is removed corresponding to the Bucket that the cryptographic Hash of block is corresponding.
15. methods as claimed in claim 14, it is characterised in that described described mapped file is cut It is divided into multiple data block to include:
It is first data block in the plurality of data block by the header cutting of described mapped file;Institute The header stating mapped file includes the size of described mapped file, the total quantity etc. of the plurality of data block Information.
16. methods as claimed in claim 13, it is characterised in that described method also includes:
From the mapped file going multiple knot to obtain described data;
According to the finger print information in described mapped file from each data going multiple knot to obtain described data Block;
Go out described according to the finger print information of the described each data block sequential concatenation in described mapped file Data.
17. methods as claimed in claim 16, it is characterised in that described from going multiple knot to obtain institute The mapped file stating data includes:
Title according to described mapped file and data block sequence number are from going multiple knot to obtain described mapped file Each data block;
Each data block by described mapped file is spliced into the mapped file of described data.
18. methods as claimed in claim 11, it is characterised in that described basis obtains from Centroid The routing table taken, determines that the go multiple knot corresponding with described Bucket includes:
When storing data first, obtain routing table from described Centroid;
According to the routing table obtained from Centroid, determine and corresponding with described Bucket remove multiple knot.
19. methods as claimed in claim 18, it is characterised in that described basis obtains from Centroid The routing table taken, determines that the go multiple knot corresponding with described Bucket also includes:
Send request bag and remove multiple knot to corresponding with described Bucket;
Receiving the respond packet of going multiple knot return corresponding with described Bucket, described respond packet includes route The version information of table;
Judge the version information of routing table in described respond packet and the described route obtained from Centroid The version information of table is the most identical;
Version information and route that is described and that obtain from Centroid when the routing table in described respond packet When the version information of table is identical, determine and described Bucket according to the described routing table obtained from Centroid Corresponding removes multiple knot;
When the version information of the routing table in described respond packet and the version of the routing table from Centroid acquisition When this information differs, obtain the routing table after updating from described Centroid;After described renewal Routing table redefines corresponding with described Bucket removes multiple knot.
20. 1 kinds of data read-write methods, it is characterised in that including:
Centroid transmission routing table is to client, and described routing table includes Bucket and goes between multiple knot Corresponding relation;
Go multiple knot receive described client fingerprint queries request, described fingerprint queries request include with Described finger print information corresponding for the Bucket going multiple knot to be assigned to;
Described go multiple knot that described finger print information is inquired about, the finger print information not inquired is back to Described client;
The described described finger print information not inquired going multiple knot to receive described client upload and Representative data block.
21. methods as claimed in claim 20, it is characterised in that described method also includes:
Described go multiple knot preserve in the described Bucket being assigned to described in the finger print information that do not inquires, Described data block is being preserved in described Container file corresponding for the Bucket being assigned to,
Described go multiple knot to described client return described data block preserve successful message.
22. methods as claimed in claim 21, it is characterised in that described in go multiple knot to described visitor Before family end returns the described data block successful message of preservation, described method also includes:
Described multiple knot is gone to backup to standby by the data block of the described finger print information not inquired and representative thereof Use node.
23. methods as claimed in claim 21, it is characterised in that described method also includes:
The data block of the described mapped file going multiple knot to preserve described client upload and corresponding Hash Value.
24. methods as claimed in claim 23, it is characterised in that in the described client of described preservation Data block and the corresponding cryptographic Hash of the mapped file passed include:
In the Container file corresponding to the Bucket of described correspondence, preserve described mapped file Data block;
In the Bucket of described correspondence, preserve the cryptographic Hash and first of the data block of described mapped file Storage information.
25. methods as claimed in claim 24, it is characterised in that described first storage information includes: Preserve the title of the Container file of the data block of described mapped file, the data of described mapped file The size of the data block of block side-play amount in described Container file and described mapped file.
26. methods as claimed in claim 23, it is characterised in that described method also includes:
The request of the described data block going multiple knot to receive the described client described mapped file of acquisition;
The described data block going multiple knot to send described mapped file is to described client;
Described go multiple knot to receive described client to obtain each finger print information in described mapped file The request of representative data block;
Described multiple knot is gone to send data block representated by described each finger print information to described client.
27. methods as claimed in claim 26, it is characterised in that described in go multiple knot to send described Data block representated by each finger print information to described client includes:
Described multiple knot is gone to determine the second storage information of described data block according to described finger print information, described Second storage information includes the title preserving the Container file of described data block, and described data block exists Side-play amount in described Container file and the size of described data block;
Described multiple knot is gone to judge that described Container file is according to the title of described Container file No file to background server;
When described Container file has been filed to background server, described in go multiple knot according to described number According to the size of block side-play amount in described Container file and described data block from described background service Device obtains described data block and sends to described client;
When described Container file is still saved in this locality, described in go multiple knot to exist according to described data block Side-play amount and the size of described data block in described Container file obtain described data block from this locality And send to described client.
28. methods as claimed in claim 20, it is characterised in that described Centroid sends route Table includes to client:
When described client stores data first, described Centroid receives described client and obtains road By the request of table;
Described Centroid sends routing table to described client.
29. methods as claimed in claim 28, it is characterised in that described Centroid sends route Table also includes to client:
The described request bag going multiple knot to receive described client:
Described go multiple knot to send respond packet to described client, described respond packet include described in remove multiple knot The version information of the routing table preserved;
The version information of the routing table preserved when described client and the described road going multiple knot to preserve By the version information of table inconsistent time, described Centroid receive described client routing table request;
Described Centroid sends the routing table after updating to described client.
30. methods as claimed in claim 20, it is characterised in that described in go multiple knot to described finger Stricture of vagina information is inquired about, and the finger print information not inquired is back to described client and includes:
Described multiple knot is gone to judge whether described finger print information exists by Bloom filter;
In the presence of being judged that by Bloom filter described finger print information is not, determine that described finger print information is The finger print information not inquired;
In the presence of judging described finger print information by Bloom filter, finger print information storehouse is inquired about institute State whether finger print information exists;
When inquiring described finger print information in finger print information storehouse, determine that described finger print information exists;
When not inquiring described finger print information in finger print information storehouse, determine that described finger print information is not for look into Ask the finger print information arrived.
31. 1 kinds of data-storage systems, it is characterised in that including: Centroid and one or more go Multiple knot, wherein,
Described Centroid, for being assigned to going of correspondence according to preset strategy by each bucket (Bucket) Multiple knot, and create routing table according to Bucket with the corresponding relation removing multiple knot, and synchronize described route Table removes multiple knot to each;
Described remove multiple knot, for according to described routing table, store each described in the Bucket that is assigned to The data block that corresponding finger print information and described finger print information represent.
32. 1 kinds of clients for reading and writing data, it is characterised in that including:
Cutting computing module, for being multiple data block by data cutting and calculating each data block respectively Finger print information;
Bucket determines module, for determining the Bucket corresponding to finger print information of described each data block;
Node determines module, for according to the routing table obtained from Centroid, determining and described Bucket Corresponding removes multiple knot;
Request sending module, for sending fingerprint queries request to the duplicate removal joint corresponding with described Bucket Point, the request of described fingerprint queries includes the finger print information of data block;
Information receiving module, corresponding with described Bucket goes what multiple knot returned not inquire about for receiving The finger print information arrived;
Transmission module in data, is used for the data block of finger print information and the representative thereof not inquired described in uploading extremely Corresponding with described Bucket removes multiple knot.
33. 1 kinds of systems for reading and writing data, it is characterised in that including: Centroid and duplicate removal joint Point, wherein,
Described Centroid, is used for sending routing table to client, and described routing table includes Bucket and goes Corresponding relation between multiple knot;
Described remove multiple knot, for receiving the fingerprint queries request of described client, described fingerprint queries Request includes and described finger print information corresponding for the Bucket going multiple knot to be assigned to;To described finger print information Inquire about, the finger print information not inquired is back to described client;Receive in described client The described finger print information not inquired passed and representative data block thereof.
CN201510226830.0A 2015-05-06 2015-05-06 Data-storage system and data read-write method Active CN106201771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510226830.0A CN106201771B (en) 2015-05-06 2015-05-06 Data-storage system and data read-write method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510226830.0A CN106201771B (en) 2015-05-06 2015-05-06 Data-storage system and data read-write method

Publications (2)

Publication Number Publication Date
CN106201771A true CN106201771A (en) 2016-12-07
CN106201771B CN106201771B (en) 2019-07-05

Family

ID=57459493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510226830.0A Active CN106201771B (en) 2015-05-06 2015-05-06 Data-storage system and data read-write method

Country Status (1)

Country Link
CN (1) CN106201771B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766478A (en) * 2017-10-11 2018-03-06 复旦大学 A kind of design method of concurrent index structure towards high competition scene
CN107832341A (en) * 2017-10-12 2018-03-23 千寻位置网络有限公司 AGNSS user's duplicate removal statistical method
CN108093024A (en) * 2017-11-14 2018-05-29 西北工业大学 A kind of classification method for routing and device based on data frequency
CN108509616A (en) * 2018-03-30 2018-09-07 北京怡生乐居信息服务有限公司 Data processing method and system
CN109725842A (en) * 2017-10-30 2019-05-07 伊姆西Ip控股有限责任公司 Accelerate random writing layout with the system and method for mixing the distribution of the bucket in storage system
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN110071964A (en) * 2019-03-26 2019-07-30 罗克佳华科技集团股份有限公司 File synchronisation method, device, file sharing network, file are total to system and storage medium
CN110134331A (en) * 2019-04-26 2019-08-16 重庆大学 Routed path planing method, system and readable storage medium storing program for executing
CN110209727A (en) * 2019-04-04 2019-09-06 特斯联(北京)科技有限公司 A kind of date storage method, terminal device and medium
CN110674116A (en) * 2019-09-25 2020-01-10 四川长虹电器股份有限公司 System and method for checking and inserting data repetition of database based on swoole
CN111158948A (en) * 2019-12-30 2020-05-15 深信服科技股份有限公司 Data storage and verification method and device based on duplicate removal and storage medium
CN111966649A (en) * 2020-10-21 2020-11-20 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight
CN112148928A (en) * 2020-09-18 2020-12-29 鹏城实验室 Cuckoo filter based on fingerprint family
CN113420400A (en) * 2021-07-06 2021-09-21 北京字跳网络技术有限公司 Routing relation establishing method, request processing method, device and equipment
CN113625968A (en) * 2021-08-12 2021-11-09 网易(杭州)网络有限公司 File authority management method and device, computer equipment and storage medium
CN115988002A (en) * 2023-02-16 2023-04-18 荣耀终端有限公司 Data transmission method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539950A (en) * 2009-05-08 2009-09-23 成都市华为赛门铁克科技有限公司 Data storage method and device
US20120323860A1 (en) * 2011-06-14 2012-12-20 Netapp, Inc. Object-level identification of duplicate data in a storage system
CN102968498A (en) * 2012-12-05 2013-03-13 华为技术有限公司 Method and device for processing data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539950A (en) * 2009-05-08 2009-09-23 成都市华为赛门铁克科技有限公司 Data storage method and device
US20120323860A1 (en) * 2011-06-14 2012-12-20 Netapp, Inc. Object-level identification of duplicate data in a storage system
CN102968498A (en) * 2012-12-05 2013-03-13 华为技术有限公司 Method and device for processing data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIANSHENG WEI等: ""MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services"", 《2010 IEEE 26TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST)》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766478A (en) * 2017-10-11 2018-03-06 复旦大学 A kind of design method of concurrent index structure towards high competition scene
CN107832341B (en) * 2017-10-12 2022-01-28 千寻位置网络有限公司 AGNSS user duplicate removal statistical method
CN107832341A (en) * 2017-10-12 2018-03-23 千寻位置网络有限公司 AGNSS user's duplicate removal statistical method
CN109725842A (en) * 2017-10-30 2019-05-07 伊姆西Ip控股有限责任公司 Accelerate random writing layout with the system and method for mixing the distribution of the bucket in storage system
CN109725842B (en) * 2017-10-30 2022-10-11 伊姆西Ip控股有限责任公司 System and method for accelerating random write placement for bucket allocation within a hybrid storage system
CN108093024A (en) * 2017-11-14 2018-05-29 西北工业大学 A kind of classification method for routing and device based on data frequency
CN108509616A (en) * 2018-03-30 2018-09-07 北京怡生乐居信息服务有限公司 Data processing method and system
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN109740037B (en) * 2019-01-02 2023-11-24 山东省科学院情报研究所 Multi-source heterogeneous flow state big data distributed online real-time processing method and system
CN110071964B (en) * 2019-03-26 2022-03-15 罗克佳华科技集团股份有限公司 File synchronization method, device, file sharing network, file sharing system and storage medium
CN110071964A (en) * 2019-03-26 2019-07-30 罗克佳华科技集团股份有限公司 File synchronisation method, device, file sharing network, file are total to system and storage medium
CN110209727A (en) * 2019-04-04 2019-09-06 特斯联(北京)科技有限公司 A kind of date storage method, terminal device and medium
CN110134331A (en) * 2019-04-26 2019-08-16 重庆大学 Routed path planing method, system and readable storage medium storing program for executing
CN110134331B (en) * 2019-04-26 2020-06-05 重庆大学 Routing path planning method, system and readable storage medium
CN110674116A (en) * 2019-09-25 2020-01-10 四川长虹电器股份有限公司 System and method for checking and inserting data repetition of database based on swoole
CN110674116B (en) * 2019-09-25 2022-05-03 四川长虹电器股份有限公司 System and method for checking and inserting data repetition of database based on swoole
CN111158948B (en) * 2019-12-30 2024-04-09 深信服科技股份有限公司 Data storage and verification method and device based on deduplication and storage medium
CN111158948A (en) * 2019-12-30 2020-05-15 深信服科技股份有限公司 Data storage and verification method and device based on duplicate removal and storage medium
CN112148928A (en) * 2020-09-18 2020-12-29 鹏城实验室 Cuckoo filter based on fingerprint family
CN112148928B (en) * 2020-09-18 2024-02-20 鹏城实验室 Cuckoo filter based on fingerprint family
CN111966649B (en) * 2020-10-21 2021-01-01 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight
CN111966649A (en) * 2020-10-21 2020-11-20 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight
CN113420400A (en) * 2021-07-06 2021-09-21 北京字跳网络技术有限公司 Routing relation establishing method, request processing method, device and equipment
CN113625968A (en) * 2021-08-12 2021-11-09 网易(杭州)网络有限公司 File authority management method and device, computer equipment and storage medium
CN113625968B (en) * 2021-08-12 2024-03-01 网易(杭州)网络有限公司 File authority management method and device, computer equipment and storage medium
CN115988002A (en) * 2023-02-16 2023-04-18 荣耀终端有限公司 Data transmission method and electronic equipment
CN115988002B (en) * 2023-02-16 2023-08-15 荣耀终端有限公司 Data transmission method and electronic equipment

Also Published As

Publication number Publication date
CN106201771B (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN106201771A (en) Data-storage system and data read-write method
US10810161B1 (en) System and method for determining physical storage space of a deduplicated storage system
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US9798486B1 (en) Method and system for file system based replication of a deduplicated storage system
US9967298B2 (en) Appending to files via server-side chunking and manifest manipulation
CN104077423B (en) Consistent hash based structural data storage, inquiry and migration method
US7827146B1 (en) Storage system
US9141633B1 (en) Special markers to optimize access control list (ACL) data for deduplication
US8548957B2 (en) Method and system for recovering missing information at a computing device using a distributed virtual file system
US9424185B1 (en) Method and system for garbage collection of data storage systems
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US9547706B2 (en) Using colocation hints to facilitate accessing a distributed data storage system
US7689764B1 (en) Network routing of data based on content thereof
CN102708165B (en) Document handling method in distributed file system and device
US7577808B1 (en) Efficient backup data retrieval
US9965505B2 (en) Identifying files in change logs using file content location identifiers
CN105550371A (en) Big data environment oriented metadata organization method and system
Frey et al. Probabilistic deduplication for cluster-based storage systems
CN104184812B (en) A kind of multipoint data transmission method based on private clound
CN104408111A (en) Method and device for deleting duplicate data
US9383936B1 (en) Percent quotas for deduplication storage appliance
CN109522283A (en) A kind of data de-duplication method and system
US20200349115A1 (en) File system metadata deduplication
US8612717B2 (en) Storage system
CN104951475B (en) Distributed file system and implementation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant