TW201351126A

TW201351126A - Method of data storing and maintenance in a distributed data storage system and corresponding device

Info

Publication number: TW201351126A
Application number: TW102112953A
Authority: TW
Inventors: Anne-Marie Kermarrec; Merrer Erwan Le; Gilles Straub; Kempen Alexandre Van
Original assignee: Thomson Licensing
Priority date: 2012-05-03
Filing date: 2013-04-12
Publication date: 2013-12-16

Abstract

The present invention generally relates to distributed data storage systems. In particular, the present invention is related to a method of data storing in a distributed data storage system that comprises a clustering of data blocks and the use of random linear combinations of data blocks that makes the distributed data storage system efficient in terms of storage space needed and inter-device communication costs, both for the storage method, as for the associated repair method.

Description

分佈式資料儲存系統內資料檔案之儲存方法和管理裝置以及故障儲存裝置之修理方法和管理裝置 Method and management device for storing data files in distributed data storage system, and repairing method and management device for fault storage device

本發明一般係關於分佈式資料儲存系統。更具體而言，本發明係關於分佈式資料儲存系統內資料儲存方法，就網路儲存裝置間交換資料所需帶寬，以及就儲存資料項所需網路儲存裝置數量，兼具高資料可用性，以及對網路和資料儲存資源之低衝擊性。本發明又涉及此等分佈式資料儲存系統內故障儲存裝置之修理方法，以及實施本發明牽涉之裝置。 The present invention generally relates to distributed data storage systems. More specifically, the present invention relates to a data storage method in a distributed data storage system, and has high data availability for the bandwidth required for exchanging data between network storage devices and the number of network storage devices required for storing data items. And low impact on network and data storage resources. The invention further relates to a method of repairing a faulty storage device within such distributed data storage systems, and apparatus for implementing the invention.

隨大量資料處理裝置(諸如視訊和影像處理裝置)之快速展開部署，亟需有巨量資料之可靠儲存器，供直接儲存或備援儲存之一部份。由於愈來愈多裝置具備網路連接力，在網路連接裝置(儲存裝置)內分佈式資料儲存，被視為是成本有效之解決方案。在可部署於諸如網際網路等無管理網路之此等分佈式資料儲存系統，已部署方法可把同樣資料項拷貝於複數之網路連接裝置，以確保資料可用性和對資料損失之彈性。此稱為資料重複或增加冗餘。冗餘必須採取廣義，且僅涵蓋資料複製以及使用寫碼技術，諸如抹除或再生電碼(其中編碼資料置於彈性使用之儲存裝置)。對於有因為裝置故障致永久資料損失或因為暫時裝置不可用致暫時資料損失之虞的拷貝，希能有高度冗餘。然而，為降低通訊和所需儲存規模之成本(所謂重複成本)，則寧願低度冗餘。 With the rapid deployment of a large number of data processing devices, such as video and image processing devices, there is no need for reliable storage of huge amounts of data for direct storage or backup storage. As more and more devices have network connectivity, distributed data storage within networked devices (storage devices) is considered a cost effective solution. In distributed data storage systems that can be deployed on unmanaged networks such as the Internet, deployed methods can copy the same data items to multiple network connected devices to ensure data availability and flexibility for data loss. This is called data duplication or adds redundancy. Redundancy must be broad and covers only data replication and the use of write code techniques, such as erasing or reproducing code (where the coded material is placed in a flexible storage device). There is a high degree of redundancy for copies of temporary data loss due to device failure or temporary data loss due to temporary device unavailability. However, in order to reduce the cost of communication and the required storage scale (so-called recurring costs), low redundancy is preferred.

因此，冗餘成為必須基於不可靠組件提供可靠服務的任何實務系統之關鍵要旨。儲存系統是服務的典型例，使用冗餘遮蔽難免磁碟不可用和故障。如上所述，可使用基本重複或寫碼技術提供此冗餘。抹除碼可提供遠比基本重複更佳效率，但未完全部署於現時系統。使用抹除碼主要關切的是，除因寫碼解碼操作增加複雜性外，是來自故障儲存裝置之維護。事實上，當儲存裝置故障，其所儲存不同檔案之全部區塊，都必須更換，以確保資料持久性。意即對於各損失之區塊，此區塊來源之整個檔案都必須下載並解碼，以再造唯一新區塊。事關帶寬和解碼操作之常務，與基本資料重複相較，大為限制抹除碼在系統內之使用，其中故障和因此要修理，成為常事而非例外。然而，可用網路寫碼，在維護過程當中，大為減少必要之帶寬，此舉設定新穎分佈式儲存系統之場景，特別設計來涉及已編碼的檔案維護，因此提高抹除碼之效率，同時緩和其已知缺點。 Therefore, redundancy is a key element of any practical system that must provide reliable service based on unreliable components. The storage system is a typical example of a service, and the use of redundant masking is inevitable that the disk is unavailable and faulty. As noted above, this redundancy can be provided using basic repetition or write coding techniques. The erase code provides far more efficiency than basic repetition, but is not fully deployed in the current system. The main concern with the use of erase codes is that, in addition to the added complexity of the write decoding operation, it is the dimension from the fault storage device. Protection. In fact, when a storage device fails, all blocks of different files it stores must be replaced to ensure data persistence. This means that for each loss block, the entire file from which the block originated must be downloaded and decoded to recreate the unique new block. The overhead of bandwidth and decoding operations, compared to the repetition of basic data, greatly limits the use of erase codes in the system, where failures and therefore repairs become commonplace rather than exceptional. However, the network can be used to write code, and the necessary bandwidth is greatly reduced during the maintenance process. This is a scenario for a novel distributed storage system, which is specifically designed to involve the maintenance of encoded files, thereby improving the efficiency of erasing codes. Alleviate its known shortcomings.

因此，亟需有一種分佈式資料儲存解決方案，達成高水準之資料可用性，並聯合考慮可用性需要和重複成本。 Therefore, there is an urgent need for a distributed data storage solution that achieves high levels of data availability and combines availability and duplication costs.

本發明旨在消弭先前技術之某些不便。 The present invention is directed to eliminating some of the inconveniences of the prior art.

為使分佈式資料儲存系統內達成最適資料儲存，本發明擬議一種分佈式資料儲存系統內資料儲存方法，包括在網路內相連接之儲存裝置，本方法為執行各資料檔案儲存於分佈式資料儲存系統，包括步驟為：把資料檔案***為k資料區塊，並透過k資料區塊之隨機線性組合，由此等k資料區塊，創造至少n編碼資料區塊；儲存至少n編碼資料區塊，是把檔案之至少n編碼資料區塊，展佈於同一儲存裝置簇集一部份之至少n儲存裝置，各簇集包括截然有別的儲存裝置集合，檔案之至少n編碼資料區塊係分佈於儲存裝置簇集之至少n儲存裝置，故各儲存裝置簇集儲存來自至少二不同檔案之編碼資料區塊，而儲存裝置簇集之各儲存裝置儲存來自至少二不同檔案之編碼資料區塊。 In order to achieve optimal data storage in a distributed data storage system, the present invention proposes a data storage method in a distributed data storage system, including a storage device connected in a network. The method is to execute various data files stored in distributed data. The storage system comprises the steps of: splitting the data file into k data blocks, and by using a random linear combination of the k data blocks, thereby creating at least n coded data blocks; storing at least n coded data blocks; The block is to store at least n coded data blocks of the file in at least n storage devices of a cluster of the same storage device, each cluster set includes a distinct set of storage devices, and at least n coded data blocks of the file Storing at least n storage devices clustered in the storage device, so each storage device is clustered to store coded data blocks from at least two different files, and each storage device clustered by the storage device stores coded data regions from at least two different files. Piece.

本發明亦包括分佈式資料儲存系統內故障儲存裝置之修理方法，其中資料是按照本發明儲存方法儲存，而儲存的檔案***成k個資料區塊，此方法包括步驟為：於故障儲存裝置從屬之儲存裝置簇集，增加替換儲存裝置；利用替換儲存裝置，從儲存裝置簇集內之任何k+1剩餘儲存裝置，接受k+1新隨機線性組合，係由來自各k+1儲存裝置儲存的二個不同檔案X和Y之二編碼資料區塊所產生；把所接受新的隨機線性組合彼此之間加以組合，使用代數運算得二個線性組合，其中得二區塊，其中只關係X，另一只關係Y；把二線性組合儲存於替換儲存裝置內。 The invention also includes a method for repairing a fault storage device in a distributed data storage system, wherein the data is stored according to the storage method of the present invention, and the stored file is split into k data blocks, the method comprising the steps of: subordinate to the fault storage device The storage devices are clustered, and the replacement storage device is added; and the k+1 new random linear combination is received from any k+1 remaining storage devices in the cluster of storage devices by using the replacement storage device, which is stored by each k+1 storage device. The two different archives X and Y of the two encoded data blocks are generated; the accepted new random linear combinations are combined with each other, and two linear combinations are obtained using algebraic operations, wherein two blocks are obtained, wherein only the relationship X , another relationship Y; The bilinear combination is stored in a replacement storage device.

按照修理方法之變化具體例，修理方法包括把回到分佈式資料系統的故障儲存裝置，再度整合入儲存裝置簇集內。 In accordance with a specific example of a change in the repair method, the repair method includes reintegrating the faulty storage device back to the distributed data system into the cluster of storage devices.

本發明亦包括在分佈式資料儲存系統內包括網路內相連接儲存裝置的資料檔案之儲存管理裝置，此裝置包括資料***器，把資料檔***成k資料區塊，並由此等k個資料區塊，透過k個資料區塊之隨機線性組合，產生至少n編碼資料區塊；此裝置又包括資料分佈器，供儲存至少n編碼資料區塊，是把檔案之至少n編碼資料區塊，展佈於同樣儲存裝置簇集一部份之至少n儲存裝置，各簇集包括截然不同的儲存裝置集合，檔案之至少n編碼資料區塊，係分佈於儲存裝置簇集的至少n儲存裝置，使各儲存裝置簇集儲存來自至少二個不同檔案的編碼資料區塊，而儲存裝置簇集之各儲存裝置儲存來自至少二個不同檔案的編碼資料區塊。 The present invention also includes a storage management device for a data file including a network-connected storage device in a distributed data storage system, the device comprising a data splitter, splitting the data file into k data blocks, and thereby k The data block generates at least an n-coded data block by a random linear combination of k data blocks; the device further includes a data distributor for storing at least the n-coded data block, wherein the at least n-coded data block of the file is And at least n storage devices distributed in a cluster of the same storage device, each cluster comprising a distinct set of storage devices, at least n encoded data blocks of the file, distributed in at least n storage devices clustered in the storage device Each storage device is clustered to store coded data blocks from at least two different files, and each of the storage devices in the storage device stores coded data blocks from at least two different files.

本發明亦關於按照本發明儲存方法儲存資料的分佈式資料儲存系統內，故障儲存裝置之修理管理裝置。此修理管理裝置包括替換器，對故障儲存裝置從屬之儲存裝置簇集，增加替換儲存裝置；分佈器，把來自各k+1儲存裝置所儲存二個不同檔案X和Y，由二個編碼資料區塊產生的k+1新的隨機線性組合，從儲存裝置簇集內之任一k+1剩餘儲存裝置，分佈於替換儲存裝置；組合器，使用代數運算，把在獲得二個線性組合中間所接受新的隨機線性組合加以組合，得二區塊，其一只關係X，另一只關係Y；和資料書寫器，把二個線性組合儲存於替換儲存裝置。 The present invention also relates to a repair management apparatus for a fault storage device in a distributed data storage system for storing data in accordance with the storage method of the present invention. The repair management device includes a replacer, clusters the storage devices subordinate to the fault storage device, adds a replacement storage device, and distributes two different files X and Y stored by each k+1 storage device from the two encoded data. The new random linear combination of k+1 generated by the block is distributed from any storage device in the cluster of storage devices to the replacement storage device; the combiner uses algebraic operations to obtain the middle of the two linear combinations The new random linear combination is accepted and combined to obtain two blocks, one for the relationship X and the other for the Y; and the data writer to store the two linear combinations in the replacement storage device.

10‧‧‧檔案X 10‧‧‧File X

12,13‧‧‧k資料區塊 12,13‧‧‧k data block

15,16,17,18‧‧‧n編碼資料區塊Xj 15,16,17,18‧‧‧n coded data block Xj

20‧‧‧第一簇集1 20‧‧‧First cluster 1

21‧‧‧第二簇集2 21‧‧‧Second Cluster 2

39‧‧‧替換儲存裝置 39‧‧‧Replacement storage device

30000‧‧‧簇集 30000‧‧‧ cluster

34,45,36‧‧‧新隨機線性組合 34,45,36‧‧‧New random linear combination

30,31,32,33‧‧‧儲存裝置 30, 31, 32, 33‧‧‧ storage devices

200,201,201‧‧‧儲存裝置 200,201,201‧‧‧ storage device

210,211,212‧‧‧儲存裝置 210,211,212‧‧‧ storage devices

2000,2100‧‧‧第一區塊 2000, 2100‧‧‧ first block

2010,2110‧‧‧第二區塊 2010, 2110‧‧‧Second block

2020,2120‧‧‧第三區塊 2020, 2120‧‧‧ third block

300,301,310,311,320,321,330,331‧‧‧隨機碼區塊 300,301,310,311,320,321,330,331‧‧‧ random code block

2001,2011,2021,2002,2012,2022‧‧‧編碼資料區塊Xj 2001, 2011, 2021, 2002, 2012, 2022‧‧‧ Coded data block Xj

2101,2111,2121,2102,2112,2122‧‧‧編碼資料區塊Xj 2101, 2111, 2121, 2102, 2112, 2122‧‧‧ Coded data block Xj

400‧‧‧裝置 400‧‧‧ device

410‧‧‧非無常性記憶器NVM 410‧‧‧Non-invariant memory NVM

411‧‧‧處理單位 411‧‧‧Processing unit

412‧‧‧時計 412‧‧‧time

413‧‧‧網路界面 413‧‧‧Web interface

414‧‧‧數位資料和位致匯流排 414‧‧‧Digital data and position bus

415‧‧‧連接 415‧‧‧Connect

420‧‧‧無常性記憶器VM 420‧‧‧Variable Memory VM

4101‧‧‧NVM暫存器 4101‧‧‧NVM register

4102‧‧‧資料儲存器 4102‧‧‧Data storage

4201‧‧‧VM暫存器 4201‧‧‧VM register

4202‧‧‧資料儲存器 4202‧‧‧Data storage

700‧‧‧儲存管理裝置 700‧‧‧Storage management device

701‧‧‧資料***器 701‧‧‧ data splitter

702‧‧‧儲存分佈器 702‧‧‧Storage distributor

703‧‧‧網路界面 703‧‧‧Web interface

704‧‧‧內部通訊匯流排 704‧‧‧Internal communication bus

705‧‧‧網路連接 705‧‧‧Internet connection

710‧‧‧修理管理裝置 710‧‧‧Repair management device

711‧‧‧替換器 711‧‧‧Replacer

712‧‧‧分佈器 712‧‧‧Distributor

713‧‧‧網路界面 713‧‧‧Web interface

714‧‧‧內部通訊匯流排 714‧‧‧Internal communication bus

715‧‧‧連接 715‧‧‧Connect

716‧‧‧組合器 716‧‧‧ combiner

717‧‧‧資料書寫器 717‧‧‧Data writer

500‧‧‧啟用儲存方法 500‧‧‧Enable storage method

501‧‧‧把資料檔案***成k個資料區塊並產生n個編碼資料區塊 501‧‧‧ Split the data file into k data blocks and generate n coded data blocks

502‧‧‧把n個編碼資料區塊展佈於同樣儲存裝置簇集之諸儲存裝置 502‧‧‧Distributing n coded data blocks to storage devices clustered in the same storage device

503‧‧‧完成儲存方法 503‧‧‧Complete storage method

600‧‧‧啟用修理方法 600‧‧‧Enable repair method

601‧‧‧於故障儲存裝置簇集增加替換資料儲存裝置 601‧‧‧Adding replacement data storage devices in clusters of fault storage devices

602‧‧‧從簇集內剩餘儲存裝置接收新的隨機線性組合 602‧‧‧Receive a new random linear combination from the remaining storage devices in the cluster

603‧‧‧把新的隨機線性組合加以組合並減到X和Y 603‧‧‧ Combine new random linear combinations and reduce them to X and Y

604‧‧‧把結果儲存於替換儲存裝置 604‧‧‧Save the results in a replacement storage device

605‧‧‧完成修理方法 605‧‧‧Complete repair method

第1圖表示本發明儲存方法之特殊細部；第2圖表示本發明儲存方法之資料簇集實施例；第3圖表示儲存裝置故障之修理過程；第4圖繪示能夠實施本發明之裝置；第5圖表示實施本發明方法特殊具體例之演算法；第6a圖為分佈式資料儲存系統內資料檔案之儲存管理裝置，分佈式資料系統包括在網路內相連之儲存裝置；第6b圖為分佈式資料儲存系統內故障儲存裝置之修理管理裝置，其中資料係按照本發明儲存方法儲存。 1 is a view showing a special detail of the storage method of the present invention; FIG. 2 is a view showing a data clustering embodiment of the storage method of the present invention; FIG. 3 is a view showing a repair process of a storage device failure; and FIG. 4 is a view showing a device capable of implementing the present invention; Figure 5 is a diagram showing an algorithm for implementing a specific embodiment of the method of the present invention; Figure 6a is a storage management device for data files in a distributed data storage system, the distributed data system including storage devices connected in the network; A repair management device for a fault storage device in a distributed data storage system, wherein The data was stored in accordance with the storage method of the present invention.

本發明更多的優點，透過本發明非限制性之特殊具體例，加以說明，即可明白。 Further advantages of the present invention will become apparent from the following detailed description of the invention.

茲參照附圖說明具體例。 Specific examples will be described with reference to the drawings.

如前所述，於今已知在資料儲存系統內，抹除碼具備較基本重複更佳效率。然而在實務上，於此等儲存系統內之應用，並未廣泛通行，儘管好處甚明。其相對缺乏應用的理由之一是，先前技術的寫碼方法認為，每當區塊需***或修理時，均可找到新的儲存裝置，即假設有無限儲存裝置資源存在。此外，儲存裝置之可用性未列入考量。此二先決條件構成在現時分佈式資料儲存系統內簡單應用抹除碼之實際障礙，而當設計上必須選擇回應此等關鍵議題時，會造成混淆。為消除此等缺點，本發明擬議把儲存裝置簇集，負責在分佈式資料儲存系統內構成冗餘的主資料區塊，又擬議使用和部署抹除碼之實際機構。那麼，比較簡單重複和寫碼計劃二者，本發明可得重大性能增益。本發明之簇集方式，得以維持發生在儲存裝置位準(即儲存裝置包括許多檔案之許多區塊)，以代替單一檔案位準，而應用抹除碼容許有效資料重複，因此提升多次修理，並改進分佈式資料儲存系統之性能增益。 As mentioned earlier, it is known in the data storage system that the erase code has better efficiency than the basic repetition. However, in practice, the applications in these storage systems are not widely available, although the benefits are clear. One of the reasons for its relative lack of application is that the prior art method of writing code believes that whenever a block needs to be inserted or repaired, a new storage device can be found, assuming that there are unlimited storage device resources. In addition, the availability of storage devices is not considered. These two prerequisites constitute a practical obstacle to the simple application of erasure codes in current distributed data storage systems, and can be confusing when design must respond to such key issues. To eliminate these shortcomings, the present invention proposes to cluster storage devices responsible for constructing redundant primary data blocks within a distributed data storage system, and to propose the actual mechanism for using and deploying erase codes. Then, with both simple repetition and code writing schemes, the present invention can achieve significant performance gains. The clustering method of the present invention can maintain the level of the storage device (that is, the storage device includes many blocks of many files) instead of a single file level, and the application of the erasure code allows the effective data to be duplicated, thereby improving the multiple repairs. And improve the performance gain of distributed data storage systems.

使用最大距離分離式(MDS)碼時，抹除碼之效率最高，即所謂「最適用」。意即對指定儲存常務而言，MDS碼就資料可用性言，提供最佳可能之效率。MDS碼使n冗餘區塊(等於編碼資料區塊)中任何k子集，均足夠重建損失資料。意即欲重建M位元組之檔案，需下載剛好M位元組。Reed Solomon(RS)是MDS碼之傳統例。隨機性提供彈性方式，以重建MDS碼。 When using the Maximum Distance Separation (MDS) code, the erasure code is the most efficient, so-called "best fit". This means that for a given storage routine, the MDS code provides the best possible efficiency in terms of data availability. The MDS code makes any of the k subsets of the n redundant blocks (equal to the encoded data block) sufficient to reconstruct the loss data. In order to reconstruct the M-byte file, you need to download just the M-bit. Reed Solomon (RS) is a traditional example of an MDS code. Randomness provides a flexible way to reconstruct the MDS code.

本發明擬議在包括網路內相連接儲存裝置之分佈式資料儲存系統內，資料檔案之特別儲存方法。此方法為執行各資料檔案，以儲存於分佈式資料儲存系統內，包括如下步驟：把資料檔案***成k資料區塊，並透過k個資料區塊之隨機線性組合，由此等k區塊創造n個編碼資料區塊；把檔案之編碼資料區塊，展佈於同一儲存裝置簇集一部份的n儲存裝置，各簇集包括截然不同的儲存裝置集合，檔案之n編碼資料區塊係分佈於儲存裝置簇集之n儲存裝置，使各儲存裝置簇集儲存來自至少二個不同檔案之編碼資料區塊，而且使儲存裝置簇集之各儲存裝置儲存來自至少二個不同檔案之編碼資料區塊。 The present invention proposes a special storage method for data files in a distributed data storage system including a network-connected storage device. The method is to execute each data file for storage in a distributed data storage system, comprising the steps of: splitting the data file into k data blocks, and through a random linear combination of k data blocks, thereby k blocks Create n coded data blocks; store the coded data blocks of the file and distribute them in a part of the same storage device. The device, each cluster includes a distinct set of storage devices, and the n-coded data blocks of the file are distributed in the n storage device of the storage device cluster, so that each storage device clusters and stores the coded data blocks from at least two different files. And storing, by the storage devices, the storage devices from the at least two different files.

第1圖表示本發明儲存方法之特別例，其中檔案***成k=2資料區塊，而關聯之線性組合方法產生n=4編碼資料區塊。進行如下：各檔案X(10)切成同樣規模的k資料區塊(12,13)，再創造n資料區塊Xj(15,16,17,18)，做為此等k區塊之隨機線性組合。分佈式儲存系統之各儲存裝置j，再儲存此等k資料區塊的隨機線性組合之編碼資料區塊Xj。在圖場Fq(即Fq意指具有q元件之「有限圖場」)隨機均勻選擇關聯之隨機係數α(例如以區塊15言，即2和7)。有限圖場之利用，是實施錯誤改正碼所必須，為技術專家所公知。簡言之，有限圖場是數集合，諸如離散數集合，但加法和乘法規則不同，為離散數通常所知。 Figure 1 shows a special example of the storage method of the present invention in which the file is split into k = 2 data blocks, and the associated linear combination method produces n = 4 encoded data blocks. The following is done: each file X (10) is cut into k data blocks of the same size (12, 13), and then n data blocks Xj (15, 16, 17, 18) are created, which are randomized for such k blocks. Linear combination. Each storage device j of the distributed storage system stores a random linear combination of coded data blocks Xj of the k data blocks. In the field Fq (i.e., Fq means "limited field" with q elements), the associated random coefficients a are randomly selected (e.g., in blocks 15, i.e., 2 and 7). The use of limited map fields is necessary to implement error correction codes and is well known to technical experts. In short, a finite field is a set of numbers, such as a set of discrete numbers, but the addition and multiplication rules are different and are generally known as discrete numbers.

除n編碼資料區塊Xj(15-18)儲存外，關聯隨機係數α亦需儲存。由於其規模與區塊Xj規模相較可略而不計，則儲存此等係數所需儲存空間亦可略。一般而言，當在此使用字句(隨機)線性組合時，此即包括關聯係數。 In addition to the n-coded data block Xj (15-18), the associated random coefficient α also needs to be stored. Since the scale is not negligible compared to the size of the block Xj, the storage space required to store these coefficients may be omitted. In general, when a word (random) linear combination is used here, this includes the correlation coefficient.

做為實施例，考慮檔案X(10)規模為M=1秭位元組。選用參數k(資料區塊數)和n(k資料區塊之隨機線性組合數)，使得有碼實施存在，例如k=2，而n=4。以先前技術之隨機數產生器，可產生關聯隨機係數α，經參數化，以產生1至q範圍之離散數。 As an example, consider that the file X(10) size is M=1秭bytes. The parameter k (number of data blocks) and n (the number of random linear combinations of the data blocks) are selected such that a coded implementation exists, such as k=2 and n=4. With the prior art random number generator, an associated random coefficient a can be generated, parameterized, to produce a discrete number ranging from 1 to q.

按照分佈式儲存系統設計者所願望的冗餘位準，選擇n/k。例如就k=2，n=4碼而言，n/k=2，因此儲存1 Gb的檔案，系統需儲存空間2 Gb。此外，n/k代表系統可容忍之故障數(故障儲存裝置數)。對指定k=2，n=4的實施例，只要保留k=2編碼資料區塊，即可恢復原有檔案。因此，引進的冗餘量和分佈式儲存系統的容障之間，得到妥協。 Select n/k according to the redundancy level desired by the distributed storage system designer. For example, for k=2, n=4 code, n/k=2, so to store 1 Gb file, the system needs to store 2 Gb. In addition, n/k represents the number of faults the system can tolerate (the number of fault storage devices). For the embodiment specifying k=2, n=4, the original file can be restored as long as the k=2 encoded data block is reserved. Therefore, the compromise between the introduced redundancy and the tolerance of the distributed storage system is compromised.

如此儲存於分佈式儲存系統內的檔案，重建如下。以數學術語而言，已從k資料區塊的隨機線性組合創造之n編碼資料區塊Xj，各以利用k資料區塊跨越的副空間之隨機向量表示。為重建檔案X，因而足以在此副空間內獲得k獨立向量。因為關聯之隨機係數α，是在檔案X儲存之際，事先利用上述隨機數產生器所產生，故可滿足獨立性要求。事實上，線性獨立的k向量之每一家族，形成非單數之矩陣，可以反相，因而檔案X可以很高或然率(即接近1)重建，或以更正式術語而言：設D為隨機變數，指n冗餘區塊Xj或是屬於的該k隨機向量，所跨越副空間維度，則可表示如下式： The files thus stored in the distributed storage system are reconstructed as follows. In mathematical terms, the n-coded data blocks Xj, which have been created from a random linear combination of k data blocks, are each represented by a random vector of subspaces spanned by k data blocks. In order to reconstruct the archive X, it is sufficient to obtain k independent vectors in this subspace. Since the associated random coefficient α is generated by the above-described random number generator in advance when the file X is stored, the independence requirement can be satisfied. In fact, each family of linearly independent k-vectors forms a non-singular matrix that can be inverted, so the archive X can be reconstructed with a high probability (ie close to 1), or in more formal terms: let D be a random variable , refers to the n redundant block Xj or belongs to The k-random vector, which spans the subspace dimension, can be expressed as follows:

此方程式可得k隨機向量所跨越副空間維度確實n之或然率，使此等k向量之家族，呈線性獨立。當使用實際圖場規模，典型上為2⁸或2¹⁶時，對每一n，顯示此或然率很接近1。如上所述，在有限圖場Fq內，圖場規模為元件數。數值2⁸或2¹⁶為實值，因為有限圖場之一元件分別相當於一或二位元組(8位元或16位元)。例如，就傳統實值的2¹⁶圖場規模和n=16而言，當接觸確實n=16儲存裝置時，得以重建檔案X之或然率為0.999985。因此，隨機(MDS)碼提供彈性方式，可最適編碼資料。與使用固定編碼矩陣之傳統抹除碼相較，有所不同，因此具有固定額度k/n，即冗餘系統則不能創造超過固定數之冗餘和獨立區塊。事實上，按照本發明所擬使用隨機碼時，額度之概念消失，因為可按照必須產生許多冗餘區塊Xj，只要做出檔案X的k區塊Xj之新隨機組合即可。此性能使隨機碼成為額度較少碼，稱為根源碼(fountain code)。此額度較少性能使此等碼很適用於分佈式儲存系統之脈絡，使錯誤「損失」儲存裝置可重新整合，詳後。 This equation gives the likelihood that the k-random vector spans the subspace dimension indeed n, making the families of these k-vectors linearly independent. When using the actual field size, typically 2 ⁸ or 2 ¹⁶ , for each n, this probability is shown to be very close to 1. As described above, in the finite field Fq, the field size is the number of components. The value 2 ⁸ or 2 ¹⁶ is a real value because one of the elements of the finite field is equivalent to one or two bytes (8 or 16 bits), respectively. For example, with respect to the traditional real-value 2 ¹⁶ field size and n=16, the probability of reconstructing the file X is 0.999985 when the contact is indeed n=16. Therefore, the random (MDS) code provides an elastic way to optimally encode the data. Compared with the traditional erasure code using a fixed coding matrix, it has a fixed amount of k/n, that is, a redundant system cannot create more than a fixed number of redundant and independent blocks. In fact, when the random code is intended to be used in accordance with the present invention, the concept of the credit disappears because a number of redundant blocks Xj can be generated as long as a new random combination of the k blocks Xj of the archive X is made. This performance makes the random code a lesser code, called the fountain code. This small amount of performance makes these codes ideal for the context of a distributed storage system, allowing the error "loss" storage device to be re-integrated, as detailed.

關於(上述參數k,n的)MDS抹除碼所討論之使用，使修理損失資料容易且有效率，本發明擬議採用特別資料簇集方法，提升同時修理屬於複數檔案之損失資料。簇集規模視碼型式而定。更準確而言，若MDS碼正從k區塊產生n編碼資料區塊，則簇集將為確實n。本發明儲存方法之如此簇集例，如第2圖所示。所有儲存裝置集合區分成脫離之簇集。因此，各儲存裝置只屬於一簇集。如此組織欲儲存於分佈式儲存系統之各檔案，即儲存於特定簇集內。簇集包括來自不同檔案之資料。儲存裝置包括來自不同檔案之資料。此外，儲存資料包括來自該簇集所儲存每一檔案之一資料區塊。第2圖提供之實施例有六個檔案X1至X6，各檔案包括n=3編碼資料區塊Xj，為此等檔案的k區塊之隨機線性組合。二儲存簇集各包括三個儲存裝置之集合：第一簇集1(20)包括儲存裝置1,2,3(200,201,202)，而第二簇集2(21)包括三個儲存裝置4,5,6(210,211,212)。檔案X1的三個(n=3)編碼資料區塊Xj，儲存於簇集1(20)：第一區塊(2000)在儲存裝置1(200)，第二區塊(2010)在儲存裝置2(201)，而第三區塊(2020)在儲存裝置3(202)。檔案X2的三個編碼資料區塊Xj，儲存於簇集2(21)：第一區塊(2100)在儲存裝置4(210)，第二區塊(2110)在儲存裝置5(211)，而第三區塊(2120)在儲存裝置6(212)。同樣，簇集1亦儲存檔案X3之編碼資料區塊Xj(2001,2011,2021)和檔案X5之編碼資料區塊Xj(2002,2012,2022)於儲存裝置1,2,3(分別為200,201,202)。同樣，簇集2亦儲存檔案X4之編碼資料區塊Xj(2101,2111,2121)和檔案X6之編碼資料區塊Xj(2102,2112,2122)於儲存裝置4,5,6(分別為210,211,212)。檔案是按照選擇負載平衡政策，依到達順序儲存(例如檔案X1於簇集1，檔案X2於簇集2，檔案X3於簇即1等)。 With regard to the use of the MDS erasure code (of the above parameters k, n), the repair loss data is easy and efficient, and the present invention proposes to use a special data clustering method to improve the simultaneous repair of loss data belonging to a plurality of files. The cluster size depends on the code type. More precisely, if the MDS code is generating an n-coded data block from the k-block, the cluster will be exactly n. An example of such a clustering of the storage method of the present invention is shown in FIG. All sets of storage devices are divided into clusters that are detached. Therefore, each storage device belongs to only one cluster. The files that are to be stored in the distributed storage system are thus organized, that is, stored in a specific cluster. Clustering includes information from different files. The storage device includes information from different files. In addition, the stored data includes each file stored from the cluster. One of the data blocks. The embodiment provided in Figure 2 has six files X1 through X6, each of which includes n = 3 encoded data blocks Xj, for which a random linear combination of k blocks of such files. The two storage clusters each comprise a collection of three storage devices: a first cluster 1 (20) comprising storage devices 1, 2, 3 (200, 201, 202) and a second cluster 2 (21) comprising three storage devices 4, 5 , 6 (210, 211, 212). Three (n=3) coded data blocks Xj of file X1 are stored in cluster 1 (20): first block (2000) is in storage device 1 (200), and second block (2010) is in storage device 2 (201), while the third block (2020) is in storage device 3 (202). The three encoded data blocks Xj of the file X2 are stored in the cluster 2 (21): the first block (2100) is in the storage device 4 (210), and the second block (2110) is in the storage device 5 (211), The third block (2120) is in storage device 6 (212). Similarly, cluster 1 also stores the encoded data block Xj (2001, 2011, 2021) of file X3 and the encoded data block Xj (2002, 2012, 2022) of file X5 in storage devices 1, 2, 3 (200, 201, 202 respectively). ). Similarly, cluster 2 also stores the encoded data block Xj (2101, 2111, 2121) of file X4 and the encoded data block Xj (2102, 2112, 2122) of file X6 in storage devices 4, 5, 6 (210, 211, 212 respectively). ). The files are stored in order of arrival according to the selected load balancing policy (for example, file X1 in cluster 1, file X2 in cluster 2, file X3 in cluster, etc.).

為管理檔案，維持二指標即夠：其一把各檔案映射於簇集，另一把各儲存裝置映射於簇集。按照本發明特殊具體例，儲存裝置可以其IP(網際網路協定)位址識別。 To manage the file, maintaining two indicators is enough: one file is mapped to the cluster, and the other storage device is mapped to the cluster. According to a particular embodiment of the invention, the storage device can be identified by its IP (Internet Protocol) address.

本發明資料區塊安置策略，隱含簡單檔案管理，充分配合分佈式儲存系統內儲存之檔案數，直接用於此等系統之維護過程，詳後。須知如何構成簇集，簇集如何充填不同檔案，可按照任何策略為之，像均勻取樣，或使用特殊協定。誠然，先前技術存在各種安置策略，例如有些聚焦於負載平衡，其他則在可用性。 The data block placement strategy of the present invention implies simple file management, fully cooperates with the number of files stored in the distributed storage system, and is directly used for the maintenance process of such systems, after details. It is important to know how to form a cluster, how to cluster different files, and follow any strategy, such as even sampling, or using special protocols. Admittedly, there are various placement strategies in the prior art, such as some focusing on load balancing and others in usability.

安置策略和維護(修理)過程，視為二建造區塊，通常是經單獨設計。 The resettlement strategy and maintenance (repair) process are considered to be two building blocks, usually individually designed.

就本發明而言，安置策略直接服務於維護過程，詳後。分佈式資料儲存系統由於此等系統之商業上實施，易故障。典型上，用於從網際網路訂戶儲存資料供此項服務之分佈式儲存系統，採用數以千計的儲存裝置，裝設硬碟機。因此，需要可靠的維護機制，以修理此等故障造成的資料損失。為此，系統需監驗儲存裝置，傳統上使用潛時為基礎(timeout-based)的觸發機制，以決定是否必須進行修理。本發明簇集方法之第一語用學觀點是，儲存裝置之簇集容易管理，可以完全去中心法方式實施監驗，藉創造自主簇集，需要時自行監驗並再生(即修理資料損失)。現時儲存系統與此相反，要修理故障儲存裝置時，替換故障儲存裝置之儲存裝置需存取與正儲存故障儲存裝置的各冗餘區塊關聯之全部檔案；要接觸之儲存裝置即可定位於任意位置，要求替換儲存裝置在修理之前，先質詢其位置。本發明因安置係建構在指定簇集內，故不需如此。 For the purposes of the present invention, the placement strategy directly serves the maintenance process, as detailed. Distributed data storage systems are prone to failure due to the commercial implementation of such systems. Typically, a distributed storage system for storing data from Internet subscribers for this service uses thousands of storage devices and hard drives. Therefore, a reliable maintenance mechanism is needed to repair the data loss caused by such failures. To this end, the system needs to monitor storage devices, traditionally using a timeout-based trigger mechanism to determine if repairs are necessary. The first language of the clustering method of the present invention The academic point is that the clusters of storage devices are easy to manage, and they can be completely de-centralized to implement supervision. By creating autonomous clusters, they can be self-monitored and regenerated (ie, repair data loss) when needed. In contrast to the current storage system, when the faulty storage device is to be repaired, the storage device replacing the faulty storage device needs to access all the files associated with the redundant blocks in which the faulty storage device is being stored; the storage device to be contacted can be located at In any position, the replacement storage device is required to be inquired for its location prior to repair. The present invention does not need to be constructed because the placement system is constructed within a designated cluster.

如果按照此先前技術，存取各儲存檔案視為獨立事件，典型情況為，當使用均勻隨機安置資料於足夠大的儲存裝置集合時，若不同檔案之冗餘區塊不儲存於同一儲存裝置集合，則繼續於集合內接觸全部此等儲存裝置之或然率，即隨區塊數遞減。事實是各主儲存裝置在實務上可有某些或然率，並存取增加此等主儲存裝置數，則在指定之時間點，能夠存取全部所需區塊之或然率即降低。與上述先前技術解決方案不同的是，使用本發明簇集安置方法，修理以便繼續之或然率，不再因故障儲存裝置所儲存區塊數而定，因儲存裝置組群之方式是，為替換儲存裝置集體主導重點區塊。此外，替換儲存裝置需連接之儲存裝置數，無關故障儲存裝置所儲存之區塊數。此數改為視簇集規模而定，係由系統操作者固定和預定，因而減少替換儲存裝置需維持之連接數。 According to this prior art, accessing each storage file is considered as an independent event, typically when redundant random placement data is used in a sufficiently large collection of storage devices, if redundant blocks of different files are not stored in the same storage device set. Then, the probability of continuing to contact all of the storage devices within the collection, that is, decreasing with the number of blocks. The fact is that each primary storage device may have some likelihood in practice and access to increase the number of such primary storage devices, then at a specified point in time, the probability of being able to access all of the required blocks is reduced. Different from the prior art solution described above, with the clustering method of the present invention, the probability of repairing to continue is no longer determined by the number of blocks stored in the fault storage device, because the storage device group is in the form of replacement storage. The device collectively dominates the key blocks. In addition, the number of storage devices to be connected to the storage device is replaced, irrespective of the number of blocks stored in the failed storage device. This number is changed depending on the cluster size and is fixed and predetermined by the system operator, thus reducing the number of connections that the replacement storage device needs to maintain.

儲存方法之特別效率，可借助第3圖說明，圖示故障儲存裝置之修理，詳後。 The special efficiency of the storage method can be illustrated by means of Figure 3, which shows the repair of the fault storage device.

與第3圖所示本發明相反的是，使用傳統抹除碼之先前技術方法如下：為修理指定檔案之一資料區塊，替換儲存裝置必須下載足夠冗餘抹除碼編碼區塊，能夠加以解碼，以重新創造(未編碼的平常資料)檔案。一旦此操作已經完成，替換儲存裝置可再編碼檔案，並再生損失的冗餘資料區塊，其再編碼必須重複對各損失區塊為之。此項先前技術方法有下述缺點，是使用此等型式碼所致：1.為修理一區塊，即檔案之小部份，替換儲存裝置必須下載儲存檔案區塊的其他儲存裝置所儲存之全部區塊。此舉通訊耗費成本，又費時，因為此第一步驟未完成時，不能結合第二步驟(下述)；2.一旦第一步驟完成，替換儲存裝置必須順利完成下載之區塊，能夠再生未編碼之平常資料檔案。此計算密集式操作，對大型檔案尤然； 3.然後，使用編碼演算法，必須從再生之平常資料檔案，利用編碼再創造損失區塊。 Contrary to the invention illustrated in Figure 3, the prior art method of using a conventional erase code is as follows: To repair a data block of a specified file, the replacement storage device must download a sufficient redundant erase code block to be able to Decode to recreate the (uncoded normal data) file. Once this operation has been completed, the replacement storage device can re-encode the file and regenerate the lost redundant data block whose re-encoding must be repeated for each loss block. This prior art method has the following disadvantages: it is caused by the use of these types of codes: 1. To repair a block, that is, a small part of the file, the replacement storage device must download the other storage device storing the file block. All blocks. This communication is costly and time consuming, because the first step is not completed, the second step (described below) cannot be combined; 2. once the first step is completed, the replacement storage device must successfully complete the downloaded block and can be regenerated. The usual data file of the code. This computationally intensive operation is especially true for large files; 3. Then, using the coding algorithm, the code must be used to recreate the loss block from the normal data file of the regeneration.

與此先前技術方法相反，本發明儲存方法之簇集式安置策略，以及使用隨機碼，在修理過程中，有重大益處。如前述，按照先前技術修理方法，同一檔案之複數區塊在其間組合。而按照本發明方法，使用網路寫碼不在檔案位準，而在系統位準，即本發明修理方法包括把複數檔案之資料區塊組合，大為減在修理之際，儲存裝置間交換之訊文數。利用儲存裝置所儲存編碼資料區塊Xj，只是代數元件，可對此進行代數運算。 In contrast to this prior art approach, the clustered placement strategy of the storage method of the present invention, as well as the use of random codes, has significant benefits in the repair process. As described above, the complex blocks of the same file are combined therebetween in accordance with the prior art repair method. According to the method of the present invention, the use of network writing code is not in the file level, but in the system level, that is, the repair method of the present invention includes combining the data blocks of the plurality of files, greatly reducing the exchange between the storage devices at the time of repair. Number of messages. The coded data block Xj stored by the storage device is only an algebraic component, which can be algebraically operated.

修理過程結束時，在修理過程結束時所欲得者為故障儲存裝置之修理。在本發明脈絡中，故障儲存裝置之修理，意指對故障儲存裝置儲存一編碼資料區塊Xj之各檔案，創造隨機向量。任一隨機向量均為冗餘或編碼資料區塊。因此，故障儲存裝置的修理過程所需操作，不替換故障儲存裝置內所儲存正確資料，而是再生故障儲存裝置所損失資料量。此項選擇對所謂儲存裝置再整合，提供額外益處，詳後。 At the end of the repair process, the repair at the end of the repair process is the repair of the faulty storage device. In the context of the present invention, the repair of the fault storage device means storing a file of a coded data block Xj for the faulty storage device to create a random vector. Any random vector is a redundant or coded data block. Therefore, the operation of the repair process of the fault storage device does not replace the correct data stored in the fault storage device, but the amount of data lost by the fault storage device. This option provides additional benefits for re-integration of so-called storage devices.

第3圖表示按照本發明修理故障儲存裝置，係基於使用本發明資料儲存方法之分佈式資料儲存系統。於此，簇集(30000)起初包括四個儲存裝置(30,31,32,33)。各儲存裝置儲存二檔案(即檔案X和檔案Y)之隨機碼區塊Xj。二檔案X和Y的k=2(即檔案X和Y劃分語塊成k=2區塊)。第一儲存裝置(30)儲存隨機碼區塊(即編碼資料區塊)300和301。第二儲存裝置(31)儲存隨機碼區塊310和311。第三儲存裝置(32)儲存隨機碼區塊320和321。第四儲存裝置(33)儲存隨機碼區塊330和331。假設第四儲存裝置(33)故障，必須修理，進行如下： Figure 3 illustrates the repair of a faulty storage device in accordance with the present invention based on a distributed data storage system using the data storage method of the present invention. Here, the cluster (30000) initially includes four storage devices (30, 31, 32, 33). Each storage device stores a random code block Xj of two files (ie, file X and file Y). The second file X and Y have k=2 (that is, the archive X and Y partition blocks are k=2 blocks). The first storage device (30) stores random code blocks (i.e., coded data blocks) 300 and 301. The second storage device (31) stores random code blocks 310 and 311. The third storage device (32) stores random code blocks 320 and 321 . The fourth storage device (33) stores random code blocks 330 and 331. Assuming that the fourth storage device (33) is faulty, it must be repaired as follows:

1.於簇集(30000)增加第五即替換儲存裝置(39)。替換儲存裝置從簇集內之k+1剩餘儲存裝置，接收從各儲存裝置儲存的此等隨機碼所產生隨機碼之新隨機線性組合(關聯係數為α)，即長方形34-36和箭頭3000-3005所示。 1. Add a fifth, replacement storage device (39) to the cluster (30000). The replacement storage device receives a new random linear combination of random codes generated by the random codes stored in the storage devices from the k+1 remaining storage devices in the cluster (correlation coefficient α), that is, rectangles 34-36 and arrows 3000 -3005 is shown.

2.所得產生之新隨機線性組合彼此組合，其方式為留二線性組合，分別消除因數X和Y。即僅關係X之一線性組合，和僅關係Y之另一線性組合。此項消除要小心選擇此等組合之係數，使用例如傳統「高斯消除法」代數運算。 2. The resulting new stochastic linear combinations are combined with one another by leaving a bilinear combination, eliminating the factors X and Y, respectively. That is, only one linear combination of the relationship X, and another linear combination of only the relationship Y. This elimination requires careful selection of the coefficients of these combinations, using, for example, the traditional "Gaussian elimination" algebraic operation.

3.剩餘二線性組合，儲存於替換儲存裝置39。此即箭頭3012和3013所示。 3. The remaining two linear combinations are stored in the replacement storage device 39. This is indicated by arrows 3012 and 3013.

如今，修理操作完成，系統視為再度成為安定和操作中狀態。 Now that the repair operation is complete, the system is considered to be in a stable and operational state again.

在大多數分佈式儲存系統中，決定宣告儲存裝置故障，係使用潛時為之。重點在於這是不確定性的決定，容易錯誤。事實上，儲存裝置會錯誤潛時，在修理完成之後，會意外再度連接。當然，潛時愈久，所為錯誤愈少。然而，使用長期潛時有危險，因儲存系統之反應性減少，突發故障時，可能導致無法挽回的資料損失。再整合的想法是，把業已錯誤潛時的儲存裝置再度整合。使用抹除碼時，再整合尚未定址。若未實施再整合，則被錯誤認定故障的儲存裝置，即不必修理，因此會成為資源浪費，因為無益於容忍另外故障。因為事實上修理過的儲存裝置，不含有來自其他儲存裝置之獨立冗餘，因此不帶有附加冗餘益處。 In most distributed storage systems, the decision to declare a storage device failure is based on the use of latency. The point is that this is a decision of uncertainty and is easy to make mistakes. In fact, the storage device will be in error and will be connected again after the repair is completed. Of course, the longer you sneak, the less mistakes you make. However, the use of long-term latency is dangerous, as the reactivity of the storage system is reduced, and unexpected failures can result in irreparable data loss. The idea of reintegration is to reintegrate the storage devices that have been wrongly submerged. When using the erase code, the integration is not yet addressed. If re-integration is not implemented, the storage device that is mistakenly identified as faulty, that is, it does not have to be repaired, is therefore a waste of resources because it is not conducive to tolerating additional failures. Because the repaired storage device does not contain independent redundancy from other storage devices, there is no additional redundancy benefit.

本發明一特別有利之具體例，包括再整合誤判的故障儲存裝置，即例如在檢知連接潛時，被分佈式資料儲存系統認為故障，但再連接到系統之裝置。由本發明可如此再整合，因為僅需在簇集增加更多冗餘資料，並修理誤判故障儲存裝置，而在乍見不要時，增加簇集之冗餘，在下一次同一簇集有任何儲存裝置故障時，即不必執行修理。此係衍自隨機碼之性能，連同本發明簇集計劃有以致之。因此，再整合增加分佈式資料儲存系統在資源使用上之效率。 A particularly advantageous embodiment of the present invention includes a faulty storage device that is reintegrated and misjudged, i.e., a device that is considered to be faulty by the distributed data storage system but is reconnected to the system, for example, when detecting the connection potential. The invention can be re-integrated in this way, because only more redundant data needs to be added in the cluster, and the faulty storage device is repaired, and the redundancy of the cluster is increased when the glimpse is not seen, and any storage device is clustered in the same cluster next time. In the event of a failure, no repairs are necessary. This is derived from the performance of random codes, along with the clustering plan of the present invention. Therefore, reintegration increases the efficiency of distributed data storage systems in resource usage.

本發明可有不同變化具體例，開發此儲存裝置再整合之議題。 The present invention can have different variations and specific examples to develop the issue of re-integration of the storage device.

按照第一變化具體例，簇集規模維持確實n儲存裝置。若儲存裝置故障，改用替換儲存裝置。按照本發明故障儲存裝置之修理方法，具備編碼資料區塊。若故障儲備裝置回復(即只是暫時不可用)，不會再整合於簇集做為簇集之一儲存裝置，而是做為自由裝置整合於儲存裝置庫，需要時可做為此簇集之替換裝置，或按照變化例，供另一簇集使用。 According to the first variation specific example, the cluster size is maintained as a true n storage device. If the storage device fails, use the replacement storage device instead. According to the repair method of the fault storage device of the present invention, a coded data block is provided. If the faulty storage device replies (that is, it is only temporarily unavailable), it will not be integrated into the cluster as a storage device of the cluster, but will be integrated into the storage device library as a free device, and can be clustered as needed. Replace the device, or according to a variant, for another cluster.

按照第二變化具體例，已修理(即更換另一替換儲存裝置)和回到簇集之故障裝置，將會再整合入簇集內。意即簇集如今會維持n+1儲存裝置位準一段時期(即到下一次故障為止)，在此原先即有n儲存裝置。二種情況適用：在故障裝置暫時不存在之際，在n節點不改變資料，節點單純加在業已為簇集一部份之n儲存節點。反之，故資料改變，故障節點需與簇集n節點之其餘同步化。此項同步化不需要故障節點完整修理所需之作業，而是在裝置不存在期間，對簇集所儲存各新檔案，僅需產生一區塊之新隨機線性組合即可，如借助第1圖所述，以及儲存利用故障儲存裝置所產生新隨機線性組合。當然，若簇集保持在n+1儲存裝置位準，加在簇集的任何新檔案，必須展佈於簇集之n+1節點。只要無裝置故障，此即繼續不斷。於下次裝置故障後，簇集規模會再減到n。 According to a second variant, the faulty device that has been repaired (i.e., replaced with another replacement storage device) and returned to the cluster will be reintegrated into the cluster. This means that clustering will now maintain the n+1 storage device level for a period of time (ie, until the next failure), where there are n storage devices. The two cases apply: when the faulty device does not exist temporarily, the n node does not change the data, and the node simply adds the n storage node that is already part of the cluster. On the contrary, the data changes, and the faulty node needs to be synchronized with the rest of the clustered n nodes. This synchronization does not require the operation required for the complete repair of the faulty node. Instead, during the absence of the device, it is only necessary to generate a new random linear combination of the blocks for each new file stored in the cluster. The figure illustrates, as well as storing a new random linear combination of fault storage devices. Of course, if the cluster remains at the n+1 storage device level, any new files added to the cluster must be spread over the n+1 nodes of the cluster. This continues as long as there is no device failure. After the next device failure, the cluster size will be reduced to n.

按照變化具體例所述，簇集從包括n儲存裝置，改為包括n+1儲存裝置，或n+2、n+10或n+m，其中m為任何整數。此舉不改變本發明資料儲存方法，也不改變修理方法，只是必須考量儲存方法，從檔案***成k資料區塊，不是n而是創造n+m編碼資料區塊，並展佈於簇集之n+m儲存裝置部份。在簇集內有超過n儲存裝置之優點是，在簇集內有更多冗餘，但創造更多資料儲存常務。 According to a variant example, the clustering comprises from n storage means to n+1 storage means, or n+2, n+10 or n+m, where m is any integer. This does not change the data storage method of the present invention, nor does it change the repair method. It is only necessary to consider the storage method, splitting from the file into the k data block, instead of creating a n+m coded data block and spreading it to the cluster. The n+m storage unit. The advantage of having more than n storage devices in the cluster is that there is more redundancy within the cluster, but creating more data storage overhead.

第4圖表示一種裝置，可用做分佈式儲存系統內之儲存裝置，以實施本發明資料項之儲存方法。裝置400可為一般目的之裝置，皆能扮演儲存裝置的管理裝置任務，此裝置包括如下組件，利用數位資料和位址匯流排414相連接：處理單位411(或CPU，即中央處理單位)；非無常性記憶器NVM 410；無常性記憶器VM 420；時計412，提供參考時計訊號，供裝置400的組件間之操作同步化，以及供計時之用；網路界面413，供裝置400經由連接415，與網路內連接之其他裝置相連。 Figure 4 shows a device that can be used as a storage device in a distributed storage system to implement the method of storing data items of the present invention. The device 400 can be a general purpose device, and can function as a management device of the storage device. The device includes the following components, which are connected by a digital data and an address bus 414: a processing unit 411 (or a CPU, that is, a central processing unit); Non-aliasing memory NVM 410; apochronous memory VM 420; timepiece 412, providing reference time-stamp signals for synchronizing operation between components of device 400, and for timing; network interface 413 for connecting device 400 via 415, connected to other devices connected to the network.

須知在說明記憶器410和420所用「暫存器」字句，是在各所述記憶器內，指定能夠儲存某些二元資料之低容量記憶器區，以及能夠儲存可執行程式或是全部資料組合之高量記憶器區。 It should be noted that the "scratchpad" words used in the description of the memories 410 and 420 are in each of the memories, specify a low-capacity memory area capable of storing certain binary data, and can store executable programs or all data. A combination of high volume memory areas.

處理單位411可實施為微處理器、定製晶片、專用(微)控制器等。非無常性記憶器NVM 410可實施呈任何形式之非無常性記憶器，諸如硬碟、非無常性隨機存取記憶器、EPROM(可抹除可規劃ROM)等。 Processing unit 411 can be implemented as a microprocessor, a custom wafer, a dedicated (micro) controller, or the like. The non-aliasable memory NVM 410 can implement non-argumentary memory in any form, Such as hard disk, non-arbitrary random access memory, EPROM (erasable programmable ROM).

非無常性記憶器NVM 410特別包括暫存器4201，持有程式，展示可執行程式，包括本發明正確修理方法，以及包括暫存參數之暫存器4202。啟動時，處理單位411載錄NVM暫存器4101內包括之指令，複製到VM暫存器4201，加以執行。 The non-aliasable memory NVM 410 includes, in particular, a register 4201, a program that displays executable programs, including the correct repair method of the present invention, and a register 4202 including temporary storage parameters. At startup, the processing unit 411 records the instructions included in the NVM register 4101, copies them to the VM register 4201, and executes them.

VM記憶器420特別包括：暫存器4201，包括NVM暫存器4101之程式"prog"複本；資料儲存器4202。 The VM memory 420 includes, in particular, a register 4201 including a program "prog" replica of the NVM register 4101; a data store 4202.

諸如裝置400，適於實施本發明資料項儲存方法之裝置，包括：***機構，把資料檔案***成k資料區塊(CPU 411，VM暫存器4202)，並由此等k區塊，經k資料區塊之隨機線性組合，創造n編碼資料區塊；展佈機構(CPU 411，網路界面413)，把檔案之n編碼資料區塊展佈於n儲存裝置，為同一儲存裝置簇集之一部份，各簇集包括截然有別之儲存裝置集合，檔案之n編碼資料區塊係分佈於儲存裝置簇集之n儲存裝置，使各儲存裝置儲存來自至少二不同檔案之編碼資料區塊，而儲存裝置簇集之各儲存裝置儲存來自至少二不同檔案之編碼資料區塊。 An apparatus, such as apparatus 400, adapted to implement the data item storage method of the present invention, comprising: a splitting mechanism that splits the data file into k data blocks (CPU 411, VM register 4202), and thereby k blocks, k Random linear combination of data blocks to create n-coded data blocks; spread organization (CPU 411, network interface 413), spread the n-coded data blocks of the file to n storage devices, clustered for the same storage device In one part, each cluster includes a distinct set of storage devices, and the n-coded data blocks of the file are distributed in the n storage device of the storage device cluster, so that each storage device stores the encoded data region from at least two different files. And the storage devices clustered by the storage device store the coded data blocks from the at least two different files.

按照變化具體例，本發明在硬體內實施呈專用組件(例如ASIC、FPGA或VLSI，分別為應用專屬積體電路、外場可規劃閘陣列、甚大型積體電路之縮寫代號)，或獨特電子組件，整合於裝置內，或形成硬體和軟體之矩陣。 According to a specific example, the present invention implements a dedicated component (such as an ASIC, an FPGA, or a VLSI, which is an abbreviation code for an application-specific integrated circuit, an external field programmable gate array, or a very large integrated circuit), or a unique electronic device. Components, integrated into the device, or form a matrix of hardware and software.

第5a圖表示本發明分佈式資料儲存系統內資料檔案之儲存方法流程圖。 Figure 5a is a flow chart showing the method of storing data files in the distributed data storage system of the present invention.

第一步驟500啟用方法。此項啟用包括啟用方法應用所需變數和記憶器空間。在步驟501，要儲存之檔案***成k資料區塊，由此等k資料區塊，經k資料區塊之隨機線性組合，創造n編碼資料區塊。在步驟502，把檔案之n資料區塊，展佈於同一儲存裝置簇集一部份的分佈式資料儲存系統之儲存裝置上。分佈式資料儲存系統之各簇集包括截然有別之儲存裝置集合。檔案之n編碼資料區塊分佈(或展佈，按先前的用語)於同一儲存裝置簇集，使各儲存裝置簇集儲存來自二或以上檔案之編碼資料區塊，而儲存裝置簇集之各儲存裝置儲存來自至少二檔案之編碼資料，另參見第2圖及其說明。在步驟503，完成此方法。 The first step 500 enables the method. This enablement includes enabling the method application to apply the required variables and memory space. In step 501, the file to be stored is split into k data blocks, thereby forming a n-coded data block by random linear combination of k data blocks. In step 502, the n data blocks of the file are distributed on the storage device of the distributed data storage system of a part of the same storage device cluster. Clusters of distributed data storage systems include distinct storage Storage device collection. The n-coded data block of the file is distributed (or spread, according to the previous term) in the same storage device, so that each storage device clusters the coded data blocks from the second or more files, and the storage devices are clustered. The storage device stores encoded data from at least two files, see also Figure 2 and its description. At step 503, the method is completed.

在本發明分佈式資料儲存系統內執行此等步驟，可利用該系統內之裝置以不同方式進行。 Performing such steps within the distributed data storage system of the present invention can be performed in different ways using the devices within the system.

例如，步驟501可利用管理裝置執行，即管理分佈式資料儲存系統之管理裝置，或管理特殊簇集之管理裝置。特殊裝置之管理裝置，亦可改為任何裝置，諸如儲存裝置，亦可扮演管理裝置之任務。 For example, step 501 can be performed using a management device, ie, a management device that manages a distributed data storage system, or a management device that manages a particular cluster. The management device of the special device can also be changed to any device, such as a storage device, and can also serve as a management device.

第5b圖表示分佈式資料儲存系統內故障儲存裝置之修理方法流程圖，其中檔案***成k資料區塊，而資料是按照本發明儲存方法儲存。 Figure 5b is a flow chart showing the repair method of the fault storage device in the distributed data storage system, wherein the file is split into k data blocks, and the data is stored in accordance with the storage method of the present invention.

第一步驟600啟用方法。此項啟用包括啟用方法應用所需變數和記憶器空間。在步驟601，於故障儲存裝置從屬之儲存裝置簇集，增加替換儲存裝置。再於步驟602，替換儲存裝置從儲存裝置簇集內全部k+1剩餘儲存裝置，接受隨機線性組合。此等組合係從來自二不同檔案X和Y的二編碼資料區塊產生(註：按照本發明資料儲存方法，各儲存裝置儲存來自至少二不同檔案之編碼資料區塊)。然後，在步驟603，將此等所接受新隨機線性組合彼此組合，故得二線性組合，其一只關係X，另一關係Y。在最後步驟604，把此二組合儲存於替換裝置內，完成修理(步驟605)。 The first step 600 enables the method. This enablement includes enabling the method application to apply the required variables and memory space. In step 601, the replacement storage device is added to the storage device subordinate to the fault storage device. In step 602, the replacement storage device accepts a random linear combination from all of the k+1 remaining storage devices in the storage device cluster. These combinations are generated from two coded data blocks from two different files X and Y (Note: in accordance with the data storage method of the present invention, each storage device stores coded data blocks from at least two different files). Then, in step 603, the accepted new random linear combinations are combined with each other, so that a two-linear combination is obtained, one relationship X, and the other relationship Y. In a final step 604, the two combinations are stored in the replacement device and the repair is completed (step 605).

修理方法可藉檢測資料冗餘所需位準降到預定位準以下，加以觸發。 The repair method can be triggered by lowering the required level of data redundancy to a predetermined level.

第6a圖為分佈式資料儲存系統內資料檔案儲存之管理裝置700，分佈式資料儲存系統包括網路內相連之儲存裝置。裝置700以下稱無儲存管理裝置。儲存管理裝置包括網路界面703，有網路連接705，供連接至網路。儲存管理裝置700又包括資料***器701，把資料檔案***成k資料區塊，並由此等k資料區塊，經k資料區塊之隨機線性組合，創造至少n編碼資料區塊。儲存管理裝置700又包括儲存分佈器702，供儲存至少n編碼資料區塊，係將檔案之至少n編碼資料區塊，展佈於同一儲存裝置簇集一部份之至少n儲存裝置。各簇集包括截然有別之儲存裝置集合，而檔案之至少n編碼資料區塊，利用分佈器分佈於儲存裝置之至少n儲存裝置，故各儲存裝置簇集儲存來自至少二不同檔案之編碼資料區塊，而且儲存裝置簇集之各儲存裝置儲存來自至少二不同檔案之編碼資料區塊。資料***器701、儲存分佈器702和網路界面703，經由儲存管理裝置700內部之通訊匯流排相連。 Figure 6a is a management device 700 for data archive storage in a distributed data storage system, the distributed data storage system including storage devices connected within the network. Device 700 is hereinafter referred to as no storage management device. The storage management device includes a web interface 703 with a network connection 705 for connecting to the network. The storage management device 700 further includes a data splitter 701 that splits the data file into k data blocks and thereby creates a minimum of n encoded data blocks by random linear combination of the k data blocks. The storage management device 700 further includes a storage distributor 702 for storing at least the n-coded data block, and distributing at least the n-coded data block of the file to at least the n-storage device of a portion of the same storage device cluster. Each cluster includes a distinct collection of storage devices, and the archives At least n coded data blocks are distributed by the distributor to at least n storage devices of the storage device, so each storage device clusters the coded data blocks from the at least two different files, and the storage devices are stored in the storage device clusters. At least two coded data blocks of different files. The data splitter 701, the storage distributor 702, and the web interface 703 are connected via a communication bus bar inside the storage management device 700.

按照特殊具體例，儲存管理裝置本身為分佈式資料儲存系統內儲存裝置之一。 According to a particular embodiment, the storage management device itself is one of the storage devices in the distributed data storage system.

第6b圖為分佈式資料儲存系統內故障儲存裝置之修理管理裝置710，其中資料係按照本發明儲存方法儲存，而儲存之檔案***成k資料區塊。裝置710以下稱為修理管理置。修理管理裝置710包括網路界面713，經由連接715連接分佈式資料儲存系統內之裝置；替換器711，於故障儲存裝置從屬之儲存裝置簇集，增加替換儲存裝置；分佈器712，從儲存裝置簇集內任一k+1剩餘儲存裝置，把來自各k+1儲存裝置儲存的二不同檔案X和Y之二編碼資料區塊所產生k+1新隨機線性組合，分佈到替換儲存裝置。修理管理裝置710又包括組合器716，把所接受新隨機線性組合彼此之間加以組合，得二線性組合，其中使用代數運算得二區塊，其一只關係X，另一只關係Y。最後，修理管理裝置包括資料書寫器717，把二線線組合儲存於替換儲存裝置。網路界面713、分佈器712、替換器711、組合器716和資料書寫器717，係經內部通訊匯流排714相連。 Figure 6b is a repair management device 710 of the fault storage device in the distributed data storage system, wherein the data is stored according to the storage method of the present invention, and the stored file is split into k data blocks. Device 710 is hereinafter referred to as a repair management device. The repair management device 710 includes a network interface 713, connected to devices in the distributed data storage system via a connection 715; a replacer 711, clustered at the storage device to which the fault storage device is attached, and a replacement storage device; a distributor 712, the slave storage device Any k+1 remaining storage device in the cluster distributes the k+1 new random linear combination generated by the two different files X and Y of the two different files stored in each k+1 storage device to the replacement storage device. The repair management device 710, in turn, includes a combiner 716 that combines the accepted new random linear combinations with each other to obtain a two-linear combination in which algebraic operations are used to obtain two blocks, one for the relationship X and the other for the relationship Y. Finally, the repair management device includes a data writer 717 that stores the two-wire combination in a replacement storage device. The network interface 713, the distributor 712, the replacer 711, the combiner 716, and the data writer 717 are connected via an internal communication bus 714.

按照特殊具體例，儲存修理管理裝置本身係分佈式資料系統的儲存裝置之一。 According to a specific embodiment, the storage repair management device itself is one of the storage devices of the distributed data system.

500‧‧‧啟用儲存方法 500‧‧‧Enable storage method

503‧‧‧完成儲存方法 503‧‧‧Complete storage method

600‧‧‧啟用修理方法 600‧‧‧Enable repair method

605‧‧‧完成修理方法 605‧‧‧Complete repair method

Claims

一種分佈式資料儲存系統內資料檔案之儲存方法，該分佈式資料儲存系統包括在網路內相連之儲存裝置，其特徵為，本方法為執行各資料檔案儲存於分佈式資料儲存系統，包括如下步驟：把資料檔案***為k資料區塊，並透過k資料區塊之隨機線性組合，由此等k資料區塊，創造至少n編碼資料區塊；儲存至少n編碼資料區塊，是把檔案之至少n編碼資料區塊，展佈於同一儲存裝置簇集一部份之至少n儲存裝置，各簇集包括截然有別的儲存裝置集合，檔案之至少n編碼資料區塊係分佈於儲存裝置簇集之至少n儲存裝置，故各儲存裝置簇集儲存來自至少二不同檔案之編碼資料區塊，而儲存裝置簇集之各儲存裝置儲存來自二不同檔案之編碼資料區塊者。 A method for storing data files in a distributed data storage system, the distributed data storage system comprising a storage device connected in a network, wherein the method is configured to execute each data file in a distributed data storage system, including the following Step: split the data file into k data blocks, and through the random linear combination of k data blocks, thereby creating at least n coded data blocks, and storing at least n coded data blocks, At least n encoded data blocks are distributed in at least n storage devices of a cluster of the same storage device, each cluster includes a distinct set of storage devices, and at least n encoded data blocks of the files are distributed in the storage device At least n storage devices are clustered, so each storage device clusters coded data blocks from at least two different files, and each storage device clustered by the storage device stores coded data blocks from two different files.

一種分佈式資料儲存系統內故障儲存裝置之修理方法，其中資料係按照申請專利範圍第1項方法儲存，而儲存之檔案***成k資料區塊，其特徵為，此方法包括下列步驟：於故障儲存裝置從屬之儲存裝置簇集，增加替換儲存裝置；利用替換儲存裝置，從儲存裝置簇集內之任何k+1剩餘儲存裝置，接受k+1新隨機線性組合，係由來自k+1儲存裝置儲存的二個不同檔案X和Y之二編碼資料區塊所產生；把所接受新隨機線性組合彼此之間加以組合，使用代數運算得二線性組合，其中得二區塊，其一只關係X，另一只關係Y；把二線性組合儲存於替換儲存裝置內者。 A method for repairing a fault storage device in a distributed data storage system, wherein the data is stored according to the method of the first application of the patent scope, and the stored file is split into k data blocks, characterized in that the method comprises the following steps: The storage device is clustered by the storage device, and the replacement storage device is added; and the k+1 new random linear combination is received from any k+1 remaining storage device in the cluster of the storage device by using the replacement storage device, which is stored by k+1 The two different files X and Y two coded data blocks stored by the device are generated; the accepted new random linear combinations are combined with each other, and algebraic operations are used to obtain a two-linear combination, wherein two blocks are obtained, one relationship is obtained X, the other relationship Y; the two linear combination is stored in the replacement storage device.

如申請專利範圍第2項之方法，其中該修理方法包括把回到分佈式資料系統之故障儲存裝置，再整合入儲存裝置簇集者。 The method of claim 2, wherein the repairing method comprises returning to the faulty storage device of the distributed data system and integrating the storage device cluster.

一種分佈式資料儲存系統內資料檔案之儲存管理裝置(700)，該分佈式資料儲存系統包括網路內相連之儲存裝置，其特徵為，本儲存管理裝置包括如下機構：資料***器(701)，可供把資料檔案***成k資料區塊，並透過k資料區塊之隨機線性組合，由此等k資料組合，創造至少n編碼資料區塊；儲存分佈器(702)，供儲存至少n編碼資料區塊，是把檔案之至少n編碼資料區塊，展佈於同一儲存裝置簇集一部份之至少n儲存裝置，各簇集包括截然有別的儲存裝置集合，檔案之至少n編碼資料區塊係分佈於儲存裝置簇集之至少n儲存裝置，故各儲存裝置簇集儲存來自至少二不同檔案之編碼資料區塊，而儲存裝置簇集之各儲存裝置儲存來自至少二不同檔案之編碼資料區塊者。 A storage management device (700) for a data file in a distributed data storage system, the distributed data storage system comprising a storage device connected to the network, wherein the storage management device comprises: a data splitter (701) , the data file can be split into k data blocks, and through a random linear combination of k data blocks, thereby k-data combination to create at least n-coded data blocks; storage distributor (702) for storing at least n The coded data block is a block of at least n coded data blocks of the file, distributed in at least n storage devices of a cluster of the same storage device, each cluster The set includes a distinct set of storage devices, wherein at least n encoded data blocks of the file are distributed in at least n storage devices of the storage device cluster, so each storage device clusters the encoded data blocks from at least two different files, and Each storage device clustered by the storage device stores coded data blocks from at least two different files.

一種分佈式資料儲存系統內故障儲存裝置之修理管理裝置(710)，其中資料係按照申請專利範圍第1項方法儲存，而儲存之檔案***成k資料區塊，其特徵為，此裝置包括下列機構：替換器(711)，於故障儲存裝置從屬之儲存裝置簇集，增加替換儲存裝置；分佈器(712)，把各k+1儲存裝置儲存的二不同檔案X和Y之二編碼資料區塊所產生k+1新隨機線性組合，從儲存裝置簇集內任何k+1剩餘儲存裝置，分佈於替換儲存裝置；組合器(716)，把所接收之新隨機線性組合彼此之間加以組合，使用代數運算得二線性組合，其中得二區塊，其一只關係X，而另一只關係Y；資料書寫器(717)，把二線性組合儲存於替換儲存裝置者。 A repair management device (710) for a fault storage device in a distributed data storage system, wherein the data is stored according to the method of the first application of the patent scope, and the stored file is split into k data blocks, characterized in that the device includes the following Mechanism: a replacer (711), clustered in the storage device subordinate to the fault storage device, adding a replacement storage device; a distributor (712), storing two different files X and Y two encoded data regions stored in each k+1 storage device The k+1 new random linear combination generated by the block is distributed from the storage device cluster to any of the k+1 remaining storage devices, and is distributed to the replacement storage device; the combiner (716) combines the received new random linear combinations with each other The algebraic operation is used to obtain a bilinear combination, wherein two blocks are obtained, one of which is X, and the other is Y; the data writer (717) stores the bilinear combination in the replacement storage device.