CN107967124B - Distributed persistent memory storage system and method - Google Patents

Distributed persistent memory storage system and method Download PDF

Info

Publication number
CN107967124B
CN107967124B CN201711344457.4A CN201711344457A CN107967124B CN 107967124 B CN107967124 B CN 107967124B CN 201711344457 A CN201711344457 A CN 201711344457A CN 107967124 B CN107967124 B CN 107967124B
Authority
CN
China
Prior art keywords
data
file
key
storage
persistent memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711344457.4A
Other languages
Chinese (zh)
Other versions
CN107967124A (en
Inventor
刘鹏
张真
王昌淦
章亮
王义飞
王小聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Innovative Data Technologies Inc
Original Assignee
Nanjing Innovative Data Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Innovative Data Technologies Inc filed Critical Nanjing Innovative Data Technologies Inc
Priority to CN201711344457.4A priority Critical patent/CN107967124B/en
Publication of CN107967124A publication Critical patent/CN107967124A/en
Application granted granted Critical
Publication of CN107967124B publication Critical patent/CN107967124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0607Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

A distributed persistent memory storage system and method is based on persistent memory design, a distributed storage method of storing data on different nodes in a scattered manner is adopted, I/O operation is carried out on bare equipment in a mode of directly managing bare disks, relevant metadata is stored in a key/value database in a key/value mode, relevant operation of a user on the key/value database is abstracted into a corresponding interface, a small file system is directly connected in an abutting joint mode, all metadata are loaded into a memory, data and log files of the file system are stored on the bare equipment through block equipment, and therefore the file system and the system can share the bare equipment and can also respectively designate different equipment. The invention combines the distributed storage mode, abandons the local file system, further reduces the expense of the file system and fully exerts the performance advantage of the persistent memory.

Description

Distributed persistent memory storage system and method
Technical Field
The invention belongs to the field of cloud storage of distributed big data, and particularly relates to a distributed persistent memory storage system and method.
Background
In the present society, information data is in explosive growth, and the explosive data growth causes multiple problems such as too low speed of data in the processing process, and a storage system based on the traditional storage technology can not meet the requirements of a computer system on performance and power consumption gradually. However, as the price of the memory is lower and lower, the capacity of the memory is larger and larger, and the reading and writing speed of the memory is higher than that of the magnetic disk by more than one order of magnitude, feasibility is provided for developing a distributed persistent memory storage system to solve mass data processing and storage.
In a traditional cloud storage system, a centralized storage node is usually adopted to store all data, so that the storage node becomes a bottleneck of system performance, becomes a focus of reliability and safety, and cannot meet the requirement of large-scale storage application. In addition, although the mass data storage can be realized by using the mechanical disk as the storage medium of the data, and the cost is low, before the data is written, the duplicate log needs to be written first, so that the writing amplification is doubled, and under the condition of mass data, the method can cause huge waste on system resources, consume a large amount of time, and greatly reduce the working efficiency.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: firstly, a traditional cloud storage system usually adopts a table look-up mode when positioning a data storage position, which is easy to cause system bottleneck; secondly, in order to support transactions, a log mechanism is introduced in a conventional cloud storage system interface, that is, all write operations need to write logs (in a log file system XFS mode) first and then write to a local file system. That is, when the system performs the write operation, one data needs to be written twice, namely, the log and the local file system, which causes that the throughput of the disk output is only half of the physical performance of the disk output in the case of large-scale continuous IO; thirdly, one IO needs to be completed through a plurality of modules such as a kernel module, bare equipment and a storage engine, queue and thread switching is involved among the modules, and memory copying is needed when part of the modules process the IO, so that the IO path is too long, and the overall performance is not high; fourthly, the single request queue of the traditional IO interface standard on the traditional block layer usually has the problems of request queue lock, hardware interruption and remote memory access, which results in poor expandability; fifth, most conventional cloud storage systems are based on mechanical hard disk design, which results in that the physical performance of the system on persistent memory cannot be fully utilized, especially latency and IOPS.
Aiming at the defects in the prior art, the invention provides a distributed persistent memory storage system and a distributed persistent memory storage method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a distributed persistent memory storage system, comprising: a key/value database, an operation abstract interface, a file system and a bottommost block device; the key/value database stores pre-written logs, data object metadata, addressing data information and distributor metadata; in order to interface an abstract interface of operation, a small file system is realized, and the distribution and management of metadata, file space and disk space are realized; the distributor employs computational addressing algorithms to decide where the real data should be stored; when the data are stored, the data are directly written into the bare device through the distributor, the metadata are stored into a key/value database in a key/value mode, the small file system is directly connected with the small file system through a related operation abstract interface, and the metadata are stored on the bare device through the block device.
In order to optimize the technical scheme, the specific measures adopted further comprise:
the key/value database takes a hash table as a data structure.
And the related operation of the key/value database is packaged into an interface, and the small file system is directly connected with the interface, so that the packaging of the bottom layer system is provided for the key/value database.
The system adopts a nonvolatile memory device as a storage medium, namely an NVM main memory and a storage-level memory.
The small file system comprises a client and a server, wherein the server is used for storing data, the client receives file operation commands intercepted from the virtual file system VFS and processes the file operation commands, and the connection between the client and the server is realized through a Remote Procedure Call (RPC) protocol packet.
The calculation addressing algorithm specifically includes: striping the file according to the preset or required data, numbering the separated objects, and obtaining a unique identifier for each object; performing hash calculation on the objects according to the identifiers to ensure that the objects are uniformly distributed into different virtual nodes; uniformly distributing and storing the data of the virtual nodes containing the objects according to the equipment weight, the real-time node computing resources and the node network resources; in the calculation process, the storage position of the data object is determined by the acquired cluster state diagram and the distribution strategy.
In addition, a storage method of the distributed persistent memory storage system is also provided, which is characterized by comprising the following steps:
when a system performs writing operation, whether I/O is aligned or not is judged according to the minimum allocation unit, the aligned writing request processing generates a logical data block and a storage container of a binary file according to the actual size of metadata, the area spanned by the data block is an integral multiple of the minimum allocation unit, if the interval is written before, the previous data block is recorded to facilitate subsequent space recovery, and then the previous data block is written into the storage container of the binary file;
the non-aligned write request firstly searches whether a reusable binary file storage container exists or not according to the offset of the initial position, if the reusable binary file storage container is found, the operation of aligning and zero padding is firstly carried out according to the block size, and then whether the free space of the binary file storage container can be directly used for distinguishing and carrying out the operation of direct or overlay writing is judged;
when the direct writing operation is executed, the system can directly generate the logical data block to be put into the storage container of the binary file
When the operation of the overlay writing is executed, the system can align to the block size according to the offset and the length of the starting position, and if the overlay area is just aligned to the block size, the data does not need to be read from the bare device; if the covered area is not aligned to the block size, reading the unaligned part to form an aligned buffer area, then newly generating a logical data block, adjusting the original logical data block, recording the partial area to be recycled, and finally writing the partial area into a storage container of a binary file; for the case of overwrite, neither is it written directly to the bare device, but it is written to the key/value database through the pre-write log system.
If the reusable binary file storage container is not found, the system performs zero padding alignment operation according to whether the offset and the length of the initial position are aligned to the block size, and finally, the offset and the length of the aligned initial position are used as data blocks and are further placed in the binary file storage container.
The invention has the beneficial effects that:
Figure BDA0001507781650000031
as can be seen from the above table, a general enterprise may consider the cost of the storage system when selecting the storage system, and the conventional cloud storage system generally uses a mechanical hard disk such as a disk and a tape library as a storage medium, although the cost of the conventional cloud storage system is low, the performance efficiency is also reduced accordingly. The distributed persistent memory storage system designed by the invention takes nonvolatile memory devices such as flash memories, SSDs and the like as storage media, and the cost is acceptable by most enterprises, so that the improvement of the overall performance efficiency of the system at reasonable cost becomes possible.
Drawings
FIG. 1 is a block diagram of the overall architecture of the distributed persistent memory storage system of the present invention.
FIG. 2 is a schematic diagram of the computational addressing algorithm of the distributed persistent memory storage system of the present invention.
FIG. 3 is a block diagram of the mini-file system of the distributed persistent memory storage system of the present invention.
FIG. 4 is a diagram of an application program interface multi-queue structure of the mini-file system of the distributed persistent memory storage system of the present invention.
FIG. 5 is a process diagram of a distributed persistent memory storage system implementing a read I/O operation in accordance with the present invention.
FIG. 6 is a flow chart of a write I/O operation implemented by the distributed persistent memory storage system of the present invention.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
Aiming at the defects of data write amplification, too long IO path, insufficient support for high-performance hardware and the like of the traditional cloud storage system, the distributed persistent memory storage system shown in the figure 1 is provided and comprises a key/value database, an operation abstract interface, a file system and the bottommost block device. The key/value database stores pre-written logs, data object metadata, addressing data information, and metadata for the distributor (the distributor is responsible for deciding where the actual data should be stored). And the operation abstract interface realizes the encapsulation of the bottom system for the key/value database. In order to interface an abstract interface of operation, a small file system is realized, and the distribution and management of metadata, file space and disk space are solved. Because the local file system is abandoned, the user can directly realize the I/O operation block device file by the linux system.
The system adopts a calculation addressing mode to replace a table look-up mode in the traditional distributed storage to realize a distributed storage principle, data is directly written into bare equipment through a distributor during storage, metadata is stored into a key/value database through a key/value mode, and then a small file system is directly butted through a related operation abstract interface and is stored on the bare equipment through block equipment. The mass data writing speed is improved, the resource waste and the time consumption are reduced, the automatic load balance of the storage nodes is realized, and the overall performance of the system is maximized.
The distributed storage mode that data are stored on different storage nodes in a scattered mode is adopted, the storage load is shared by the different storage nodes, the storage information is positioned through a calculation addressing algorithm, the reliability, the availability and the access efficiency of the system are improved, and the system is easy to expand.
The system abandons ext4/xfs and other local file systems, adopts a mode of directly managing bare equipment, and the direct management of the bare equipment necessarily needs to carry out space management of the bare equipment, so that a corresponding distributor is developed, and a calculation addressing algorithm is adopted to determine where real data should be stored.
Metadata is needed to locate the mapping position of the data, and therefore, the storage of the metadata is very important. According to the system, the metadata is stored in the key/value database in a key/value form through the key/value database with the hash table as the data structure, so that the reading speed of the metadata is improved. In order to facilitate the realization of the key/value database by a user, the system encapsulates the relevant operation of the key/value database into an interface, and directly interfaces with a small file system to provide encapsulation of a bottom layer system for the key/value database. The system also introduces a small file system, which ensures that the metadata can be better stored on the block device and adopts a brand-new application program interface standard.
The system adopts a nonvolatile memory device as a storage medium, namely an NVM main memory and a storage-level memory, and has the characteristics that data can still be stored after power failure.
In the computational addressing algorithm shown in fig. 2, a file is first fragmented, i.e. striped, according to a preset or required data, and the separated objects are numbered, and each object obtains a unique id. Secondly, the objects are subjected to hash calculation according to the id, and the objects are guaranteed to be uniformly distributed into different virtual nodes. And finally, uniformly distributing and storing the data of the virtual nodes containing the object according to the calculation addressing algorithm and the equipment weight, the real-time node calculation resource, the node network resource and the like. In the calculation process, the storage position of the data object is determined by the acquired cluster state diagram and the distribution strategy.
As shown in fig. 3, the mini file system developed by the present system is composed of a client and a server. The server is used for storing data, the client receives and processes file operation commands intercepted from the virtual file system VFS, and the connection between the client and the server is realized by Remote Procedure Call (RPC) protocol packets. The working flow of the small file system is as follows:
firstly, the application program needs to call a file storage function of the virtual file system VFS, and the VFS helps the application program to store files under a folder of the small file system. In addition, the operation of the VFS is transmitted to a kernel module of the small file system, and the kernel module receives the storage command of the VFS and transmits the storage command to the client, and the client is responsible for storing the data transmitted by the kernel. The client adds data storage commands to the write queue of the respective device (including the service) and performs the write operation at the appropriate time.
When the system wants to read data, the application program only needs a VFS read command, the VFS will naturally tell the kernel module of the mini file system that the system needs to read the file, and the kernel module will tell the client that the client needs to read the file according to its own scheme, and the client will fetch the file from a certain device (which may be a service, and the communication with the server is performed through an RPC packet), and then send the file back to the application program through transmission.
As shown in FIG. 4, the system adopts a completely new interface standard in terms of the application program interface of the small file system. In the IO stack, after a bio request is submitted to a block layer by a submit _ bio layer, the bio request is simply checked and processed by a generic _ make _ request function and a _ generic _ make _ request function provided by the block layer, so that the bio request is submitted to a driver layer, and a make _ request provided by an interface standard driver is registered to process the function, wherein the function is processed by adding the request to a request queue corresponding to a CPU (Central processing Unit), a command format required by an interface standard protocol is generated for the request, and a hash table for DMA is prepared for the command. And finally, informing the equipment to take the newly submitted command in the request queue in a PIO mode, after the equipment finishes the received command, firstly submitting the command completion state information to a completion queue of a CPU corresponding to the host, then informing the host of IO completion in an interruption mode, and carrying out response processing by the host according to the completion condition of the command.
The distributed persistent memory storage system is specifically designed by the following method:
and configuring a weight for each device when the cluster is constructed, and calculating the distribution of the data objects by adopting a calculation addressing algorithm through the weight of each device and real-time storage node load resources. The distribution of the objects is mainly determined by a cluster load state diagram and a distribution strategy, wherein the cluster load state diagram describes available resources, real-time load and a hierarchical structure of the system, the cluster load state diagram comprises the number of racks and storage node hard disks, and the distribution strategy specifies the storage strategy of the storage pool, and the storage strategy comprises copies, erasure codes and storage limits of the copies and the erasure codes, including whether the copies are distributed in different racks, different servers and the like. The set of data x to storage devices is calculated by these composite factors: ala (x) ═ e (devicel, device2, …, deviceN).
Considering that the method of directly managing bare devices by local file systems such as ext4/xfs and the like is abandoned, the system necessarily needs to perform space management on the bare devices, and the system adopts a distributor to determine where real data should be stored by a calculation addressing algorithm so as to realize finer-grained writing.
In terms of metadata storage, the system adopts a key/value database, so that metadata is stored in the database in the form of key/value. There will be multiple logical blocks of data in a metadata, recorded by one byte addressing. The logical data blocks are mapped to the storage container of the binary file by the id of the storage container of the binary file, and the storage container of the binary file is mapped to the area on the actual physical disk by the offset and length of the start position.
The system encapsulates the operation of the user on the key/value database into an interface which is used for connecting the key/value database and the file system and persistently stores the metadata in the log. When the system starts to mount the file system, all metadata can be stored on the bare device through the block device only by rewriting the log.
In order to fully mine the performance of the current or future persistent memory, the system adopts a brand-new interface standard on an application program interface of a small file system, defines a special command set and an optimized interface register for the persistent memory, and adopts a plurality of submission queues and a completion queue on a host end to provide parallelism and expandability.
Firstly, for a persistent memory with high random access performance, an IO scheduler is shielded at a block layer based on an IO stack of a brand-new interface so as to reduce delay brought by the IO scheduler to operations such as request merging and reordering. Secondly, in the interface drive, the client realizes a pair of management request and completion queue, and can support a plurality of pairs of IO submission and IO completion queues, and the submission queue and the completion queue can be expanded along with the number of cores of the CPU, thereby reducing competition of queue locks, improving cache hit rate, improving parallelism, fully utilizing the bandwidth of hardware, and having good expandability.
The system directly exposes the persistent memory chip to the CPU, namely directly puts the persistent memory chip on the memory bus, so that the CPU can address the memory storage chip through the common load and the storage, thereby ensuring that the small file system can operate in an atomic mode and ensuring the program execution sequence and the operating consistency of the file system. This architecture allows access latency to be reduced and the byte addressing characteristics of the memory storage chip to be exploited.
In addition, in order to prevent that random reading and writing and large cache aggravate the cache size and cause repeated reading and writing when the performance of the DRAM and the memory are similar, and influence the performance of the file system, the small file system eliminates the cache based on the DRAM and adopts a higher-speed CPU cache, so that the performance of the file system is greatly improved.
And adopting a persistent memory for selection of the bare equipment. Compared with a volatile memory, the persistent memory has the advantages of high speed, high density, miniaturization, low power consumption, radiation resistance, capability of still keeping data after power failure and the like, effectively improves the throughput rate and the overall performance of a system, and ensures the reliability of the system. The non-volatile memories used in the system are mainly classified into two types:
the first type is NVM main memory, which is characterized in that the traditional DRAM memory can be directly replaced or the traditional DRAM memory and the traditional DRAM memory can be used simultaneously to form a mixed main memory, and the advantages are that the NVM main memory is controlled by hardware, only the main memory controller needs to be properly improved according to the characteristics of the NVM main memory, the upper layer application is transparent, and the system level (operating system, file system and the like) is not required to be changed. The NVM main memory structure mainly utilizes the advantage of low static power consumption of a persistent memory, simultaneously meets the performance requirement of the traditional DRAM, and improves the storage density as much as possible.
The second type is storage class memory, which is a generic term for storage devices used in the storage hierarchy between the traditional DRAM main memory and HDD external memory. Compared with the HDD external memory, the storage-class memory has the advantages of no moving part, low delay and high throughput rate; compared with a DRAM main memory, the memory has the advantages of nonvolatility, low cost, low power consumption and the like.
By combining the two types of persistent memories, a user can select different types of memories according to different use scenes so as to exert the overall performance of the system.
Because the system adopts the persistent memory to support storage according to the computational addressing, the hardware is solidified by adopting a memory file system, internal data transmission of the system is carried out by using the RDMA technology based on the infiniband network, and an asynchronous model is adopted for a network model, so that the link concurrency is improved. Maximizing the speed of the memory chip.
As shown in fig. 5, when the system performs a read operation, it may read unwritten data during the search by searching for a data block of the associated logic, and the data performs a direct zero padding operation.
As shown in FIG. 6, when the system performs a write operation, it is determined whether the I/O is aligned according to the minimum allocation unit.
The aligned write request processing generates a logical data block and a storage container of the binary file according to the actual size of the metadata, the data block spans an area which is an integral multiple of the minimum allocation unit, and if the section is written before, the previous data block is recorded to be convenient for subsequent space recovery. And then written into the storage container of the binary file.
A non-aligned write request will first look for a binary file storage container that can be reused based on the offset of the starting location.
If the reusable binary file storage container is found, firstly, the zero padding operation is carried out according to the block size, and then whether the free space of the binary file storage container can be directly used for distinguishing and carrying out the direct or overwriting operation is judged.
When the direct writing operation is executed, the system can directly generate a logical data block to be placed in a storage container of the binary file; when performing an overwrite operation, the system will align to a block size (4 KB by default) depending on the offset and length of the starting location. If the covered area is just aligned to the block size, data does not need to be read from the bare device, but if the covered area is not aligned to the block size, the part which is not aligned needs to be read out to be spliced into an aligned buffer area, then a logical data block is newly generated, the original logical data block is adjusted, the part which needs to be recycled is recorded, and finally the part is written into a storage container of the binary file. For the case of overwrite, neither is it written directly to the bare device, but it is written to the key/value database through the pre-write log system.
If no reusable binary file storage container is found, the system performs zero padding alignment operation according to whether the offset and length of the starting position are aligned to the block size (default to 4KB), so padding is related to subsequent disc writing operation, and when the disc is actually written, there are two ways, namely a direct I/O way, that is, the offset and the buffer are required to be aligned, and another is a standard I/O way, which does not require that the offset and the buffer are aligned, but writes to the buffer, and then synchronizes to the disc, but reduces the writing efficiency. And finally, taking the offset and the length of the aligned initial position as data blocks, and further putting the data blocks into a binary file storage container.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (6)

1. A distributed persistent memory storage system, comprising: a key/value database, an operation abstract interface, a file system and a bottommost block device; the key/value database stores pre-written logs, data object metadata, addressing data information and distributor metadata; in order to interface an abstract interface of operation, a file system is realized, and the distribution and management of metadata, file space and disk space are solved; the distributor employs computational addressing algorithms to decide where data should be stored; when the data are stored, the data are directly written into the bare device through the distributor, the metadata are stored into a key/value database in a key/value form, and then are directly connected with a file system through a related operation abstract interface and are stored on the bare device through the block device; wherein, the calculation addressing algorithm specifically comprises: striping the file according to the preset or required data to obtain separate objects, numbering the separate objects, and obtaining a unique identifier for each object; performing hash calculation on the objects according to the identifiers to ensure that the objects are uniformly distributed into different virtual nodes; uniformly distributing and storing the data of the virtual nodes containing the objects according to the equipment weight, the real-time node computing resources and the node network resources; in the calculation process, the storage position of the data object is determined by the acquired cluster state diagram and the distribution strategy.
2. A distributed persistent memory storage system as claimed in claim 1, wherein: the key/value database takes a hash table as a data structure.
3. A distributed persistent memory storage system as claimed in claim 1, wherein: and the related operation of the key/value database is encapsulated into an interface, and the interface is directly connected with a file system to provide encapsulation of a bottom layer system for the key/value database.
4. A distributed persistent memory storage system as claimed in claim 1, wherein: the system adopts a nonvolatile memory device as a storage medium, namely an NVM main memory and a storage-level memory.
5. A distributed persistent memory storage system as claimed in claim 1, wherein: the file system comprises a client and a server, the server is used for storing data, the client receives file operation commands intercepted from the virtual file system VFS and processes the file operation commands, and the connection between the client and the server is realized by Remote Procedure Call (RPC) protocol packets.
6. A method of storing in a distributed persistent memory storage system according to any one of claims 1 to 5, comprising the steps of:
when a system performs writing operation, whether I/O is aligned or not is judged according to the minimum allocation unit, the aligned writing request processing generates a logical data block and a storage container of a binary file according to the actual size of metadata, the area spanned by the data block is an integral multiple of the minimum allocation unit, if the area is written previously, the previous data block is recorded to facilitate subsequent space recovery, and then the data block is written into the storage container of the binary file;
the non-aligned write request firstly searches whether a reusable binary file storage container exists or not according to the offset of the initial position of the binary file, if the reusable binary file storage container is found, the operation of aligning and zero padding is firstly carried out according to the block size, and then whether the free space of the binary file storage container can be directly used for distinguishing and carrying out the operation of direct writing or overwriting is judged;
when the direct writing operation is executed, the system can directly generate a logical data block to be placed in a storage container of the binary file;
when the operation of the overlay writing is executed, the system can align to the block size according to the offset and the length of the starting position, and if the overlay area is just aligned to the block size, the data does not need to be read from the bare device; if the covered area is not aligned to the block size, reading the unaligned part to form an aligned buffer area, then newly generating a logical data block, adjusting the original logical data block, recording the partial area to be recycled, and finally writing the partial area into a storage container of a binary file; for the case of overwriting, the data is not directly written into the bare device, but is written into a key/value database through a pre-write log system;
if the reusable binary file storage container is not found, the system performs zero padding alignment operation according to whether the offset and the length of the initial position are aligned to the block size, and finally, the offset and the length of the aligned initial position are used as data blocks and are further placed in the binary file storage container.
CN201711344457.4A 2017-12-14 2017-12-14 Distributed persistent memory storage system and method Active CN107967124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711344457.4A CN107967124B (en) 2017-12-14 2017-12-14 Distributed persistent memory storage system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711344457.4A CN107967124B (en) 2017-12-14 2017-12-14 Distributed persistent memory storage system and method

Publications (2)

Publication Number Publication Date
CN107967124A CN107967124A (en) 2018-04-27
CN107967124B true CN107967124B (en) 2021-02-05

Family

ID=61995432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711344457.4A Active CN107967124B (en) 2017-12-14 2017-12-14 Distributed persistent memory storage system and method

Country Status (1)

Country Link
CN (1) CN107967124B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763458B (en) * 2018-05-28 2023-06-16 腾讯科技(深圳)有限公司 Content characteristic query method, device, computer equipment and storage medium
CN108845764A (en) * 2018-05-30 2018-11-20 郑州云海信息技术有限公司 A kind of processing method and processing device of I/O data
CN109063103A (en) * 2018-07-27 2018-12-21 郑州云海信息技术有限公司 A kind of non-volatile file system of distribution
CN109656897A (en) * 2018-12-04 2019-04-19 郑州云海信息技术有限公司 Object storage gateway system and data calling method based on redis
CN109918359B (en) * 2019-01-18 2022-03-29 华南理工大学 Database service persistence method and system based on sweep
CN110032543A (en) * 2019-04-15 2019-07-19 苏州浪潮智能科技有限公司 A kind of management method of storage file system
CN110113425A (en) * 2019-05-16 2019-08-09 南京大学 A kind of SiteServer LBS and equalization methods based on the unloading of RDMA network interface card correcting and eleting codes
CN110377226B (en) * 2019-06-10 2022-02-25 平安科技(深圳)有限公司 Compression method and device based on storage engine bluestore and storage medium
CN112114738B (en) * 2019-06-20 2024-02-20 杭州海康威视数字技术股份有限公司 Method for storing data and storage device
CN110968269A (en) * 2019-11-18 2020-04-07 华中科技大学 SCM and SSD-based key value storage system and read-write request processing method
CN111182026B (en) * 2019-11-27 2023-05-12 广西金慕联行数字科技有限公司 Intelligent cloud box
CN111208946A (en) * 2020-01-06 2020-05-29 北京同有飞骥科技股份有限公司 Data persistence method and system supporting KB-level small file concurrent IO
CN111404931B (en) * 2020-03-13 2021-03-30 清华大学 Remote data transmission method based on persistent memory
CN111831221B (en) * 2020-05-26 2022-09-23 武汉安象信息科技有限公司 Distributed storage method and system based on cloud storage
CN111796767B (en) * 2020-06-24 2023-12-08 北京青云科技股份有限公司 Distributed file system and data management method
CN111984696B (en) * 2020-07-23 2023-11-10 深圳市赢时胜信息技术股份有限公司 Novel database and method
CN111966751B (en) * 2020-08-14 2022-07-08 苏州浪潮智能科技有限公司 Distributed object storage metadata storage method and system, and reading method and system
CN111984651A (en) * 2020-08-21 2020-11-24 苏州浪潮智能科技有限公司 Column type storage method, device and equipment based on persistent memory
CN112035065B (en) * 2020-08-28 2022-06-07 北京浪潮数据技术有限公司 Data writing method, device and equipment and computer readable storage medium
CN112149026B (en) * 2020-10-20 2021-04-02 北京天华星航科技有限公司 Distributed data storage system based on web end
CN112947856B (en) * 2021-02-05 2024-05-03 彩讯科技股份有限公司 Memory data management method and device, computer equipment and storage medium
CN112925763B (en) * 2021-03-22 2022-04-08 河北工业大学 Method for rapid persistence based on CAD
CN113051221B (en) * 2021-03-31 2023-06-30 网易(杭州)网络有限公司 Data storage method, device, medium, equipment and distributed file system
CN113703667A (en) * 2021-07-14 2021-11-26 深圳市有为信息技术发展有限公司 File system processing method and device for storing data in real time, vehicle-mounted terminal and commercial vehicle
CN114089911B (en) * 2021-09-07 2024-01-05 上海新氦类脑智能科技有限公司 Block segmentation and splicing processing method, device, equipment and medium based on data multiplexing
CN113935476A (en) * 2021-12-16 2022-01-14 之江实验室 Deep learning data set access method and system, electronic equipment and storage medium
CN115951841B (en) * 2023-02-27 2023-06-20 浪潮电子信息产业股份有限公司 Storage system, creation method, data processing method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012522321A (en) * 2009-03-30 2012-09-20 オラクル・アメリカ・インコーポレイテッド Data storage system and method for processing data access requests
CN104484130A (en) * 2014-12-04 2015-04-01 北京同有飞骥科技股份有限公司 Construction method of horizontal expansion storage system
CN105338118A (en) * 2015-11-30 2016-02-17 上海斐讯数据通信技术有限公司 Distributed storage system
WO2016053198A1 (en) * 2014-10-03 2016-04-07 Agency For Science, Technology And Research Distributed active hybrid storage system
CN106708425A (en) * 2015-11-13 2017-05-24 三星电子株式会社 Distributed multimode storage management
CN107239569A (en) * 2017-06-27 2017-10-10 郑州云海信息技术有限公司 A kind of distributed file system subtree storage method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012522321A (en) * 2009-03-30 2012-09-20 オラクル・アメリカ・インコーポレイテッド Data storage system and method for processing data access requests
WO2016053198A1 (en) * 2014-10-03 2016-04-07 Agency For Science, Technology And Research Distributed active hybrid storage system
CN104484130A (en) * 2014-12-04 2015-04-01 北京同有飞骥科技股份有限公司 Construction method of horizontal expansion storage system
CN106708425A (en) * 2015-11-13 2017-05-24 三星电子株式会社 Distributed multimode storage management
CN105338118A (en) * 2015-11-30 2016-02-17 上海斐讯数据通信技术有限公司 Distributed storage system
CN107239569A (en) * 2017-06-27 2017-10-10 郑州云海信息技术有限公司 A kind of distributed file system subtree storage method and device

Also Published As

Publication number Publication date
CN107967124A (en) 2018-04-27

Similar Documents

Publication Publication Date Title
CN107967124B (en) Distributed persistent memory storage system and method
US11029853B2 (en) Dynamic segment allocation for write requests by a storage system
US11307935B2 (en) Management of volume snapshots in a data storage system
US11163699B2 (en) Managing least recently used cache using reduced memory footprint sequence container
US10042751B1 (en) Method and system for multi-tier all-flash array
JP6750011B2 (en) Information processing system
CN102467408B (en) Method and device for accessing data of virtual machine
US11144399B1 (en) Managing storage device errors during processing of inflight input/output requests
CN107798130B (en) Method for storing snapshot in distributed mode
US8131969B2 (en) Updating system configuration information
CN104765575B (en) information storage processing method
US11037627B2 (en) Cell block allocation for hybrid dual write
US20180074708A1 (en) Trim management in solid state drives
CN106066890B (en) Distributed high-performance database all-in-one machine system
EP2979187B1 (en) Data flush of group table
US10235069B2 (en) Load balancing by dynamically transferring memory range assignments
WO2017025039A1 (en) Flash storage oriented data access method and device
US20190095336A1 (en) Host computing arrangement, remote server arrangement, storage system and methods thereof
US20210374003A1 (en) ZNS Parity Swapping to DRAM
CN109164976A (en) Optimize storage device performance using write buffer
KR102471966B1 (en) Data input and output method using storage node based key-value srotre
KR20220050177A (en) 3-tier hierarchical memory system
US20230126685A1 (en) Storage device and electronic system
US11662949B2 (en) Storage server, a method of operating the same storage server and a data center including the same storage server
US11586353B2 (en) Optimized access to high-speed storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant