CN104899156A - Large-scale social network service-oriented graph data storage and query method - Google Patents

Large-scale social network service-oriented graph data storage and query method Download PDF

Info

Publication number
CN104899156A
CN104899156A CN201510229346.3A CN201510229346A CN104899156A CN 104899156 A CN104899156 A CN 104899156A CN 201510229346 A CN201510229346 A CN 201510229346A CN 104899156 A CN104899156 A CN 104899156A
Authority
CN
China
Prior art keywords
block
data
limit
vertex
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510229346.3A
Other languages
Chinese (zh)
Other versions
CN104899156B (en
Inventor
周薇
包秀国
马宏远
程工
冉攀峰
刘春阳
王卿
韩冀中
庞琳
李雄
贺敏
刘玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Information Engineering of CAS
Priority to CN201510229346.3A priority Critical patent/CN104899156B/en
Publication of CN104899156A publication Critical patent/CN104899156A/en
Application granted granted Critical
Publication of CN104899156B publication Critical patent/CN104899156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-scale social network service-oriented graph data storage and query method. A data storage manager stores received graph data in a Key-Value way, wherein the vertex ID (Identity) of the graph data is taken as the Key, a vertex neighbourhood is taken as the Value, and the data of each vertex neighbourhood is stored; and a plurality of edges connected with the vertex neighbourhood are orderly stored in a memory block with a fixed size via a timestamp, a double linked list is formed, and the attribution information and the index information of the vertex are stored in a data structure. When the data storage manager receives an access request that the vertex v is accessed, the data storage manager transmits the vertex v and a k-order neighbourhood of the vertex v to a requester; and the requester caches stored data locally, firstly checks local cache during next query, and sends the access request to the data storage manager if no queried vertexes exist. The graph data storage and query method can meet a scene which is dynamically updated and is suitable for processing data sparseness, and random access.

Description

A kind of diagram data towards extensive social networks stores and querying method
Technical field
The present invention relates to a kind of diagram data towards extensive social networks to store and querying method, belong to software technology field.
Background technology
At present, the main flow way that diagram data stores is by diagram data through pre-service, is converted into the record on limit and summit, is stored in the large files of distributed file system with the form of sequential data set.During access diagram data, access the large files of storage figure data in the mode of order scanning.This organizational form cannot provide effective data storage and access performance more for the figure computing application taking turns iteration, in order to improve the access performance of diagram data, the memory management technology of diagram data becomes an important trend, as Trinity, Giraph etc.
Neo4j is the chart database adopting Key-Value memory model, and the most basic storage cell is summit and limit, as breadth traversal BFS, obtains the performance of a certain summit neighborhood operation obviously than low as the system of basic storage cell using summit neighborhood.The all data of Neo4j are stored on disk, adopt the access of memory cache expedited data, can the size of regulating memory buffer memory, obtain best performance.
Trinity is the figure computing engines based on internal memory cloud designed by Microsoft Research, Asia.Adopt the Key-Value based on internal memory to store, wherein key is unique ID on figure summit, and in order to locate addressing cell, cell is the contiguous memory block of bytes of the random length of the adjacent vertex information comprised in this summit neighborhood.Each vertex correspondence cell, a cell of diagram data are stored in trunk, and trunk is the contiguous memory that size is no more than 2GB.
Redis also adopts Key-Value to store, but Redis can only Processing Cluster internal memory can hold under diagram data.
Neo4j is the chart database adopting Key-Value memory model, but when diagram data scale is significantly greater than the size of memory cache, its performance is also along with remarkable reduction; When accessing strange land data, Neo4j adopts the mode of asking as required, cannot make full use of bandwidth.
Trinity is the figure computing engines based on internal memory cloud designed by Microsoft Research, Asia.When data frequent updating, the operations such as insertion and deletion can make cell become large or reduce.After cell becomes greatly, existing storage area cannot hold, and just needs to open up larger storage space and loads cell, cell is moved to new position from origin-location; After cell reduces, the part reduced in the space of storage cell originally vacant go out, a lot of memory fragmentation can be produced.As can be seen here, the increase of cell and reduce and can form a large amount of memory fragmentations, thus weaken the utilization factor of internal memory.Trinity adopts the mode of internal memory deflation and the larger internal memory of predistribution to solve the problem of external fragmentation.But no matter adopt which kind of method, it is inevitable that the carrying of data in internal memory is moved, and this operation is very consuming time.Therefore, under the scene of diagram data frequent updating, the performance of Trinity is undesirable.In Trinity, when the whole diagram data of Program too big to fit in memory, part trunk can be positioned on disk.When the cell asked is not at internal memory, just need target trunk all to change to internal memory, because diagram data access has randomness, the locality of data access is very poor, the cell of multiple continuous request, likely not in the trunk changing to internal memory, therefore needs new trunk to change to.In remote data access, Trinity adopts the mode of asking as required.
Redis also adopts Key-Value to store, and when the data volume of diagram data constantly increases, when exceeding memory size, can only rely on and increase machine interstitial content mode.When cluster scale increases, need to repartition data, and do Data Migration.The Data Placement of Redis transfers to client to complete, and serve end program cannot act on behalf of the request of strange land diagram data.
From above research work, the storage of own characteristic to diagram data of the data access demand that figure calculates and diagram data proposes challenge, the chart database of traditional data storage method such as file system or distributed file system and Key-Value memory model is difficult to support diagram data to access efficiently and upgrade, and then can not effectively support that figure calculates.Therefore, need to design efficient diagram data storage system according to the access characteristics of diagram data and the feature of diagram data itself.
Summary of the invention
The object of the invention is to, for the magnanimity of social network data, frequent updating, random access and the feature such as openness, provide a kind of diagram data towards extensive social networks to store and querying method.
The present invention can meet the random access of neighborhood.The present invention devises a kind of three grades of diagram data storage organizations based on neighborhood.Be basic storage cell with summit neighborhood, summit neighborhood is the doubly linked list be made up of fixed size internal memory block, and comprise the information of all incidence edges on this summit in this doubly linked list, block can by directly address.
The present invention adopts the Data renewal mechanism of zero movement.We devise a kind of memory organization method of applicable diagram data frequent updating.When diagram data upgrades, be responsible for distribution and the recovery of block by block pond.With reclaim cause the reducing or increase of summit neighborhood time, the head and the tail of our just amendment double linked list, and need not tighten storage space elimination external fragmentation, also need not distribute larger storage space, data be moved newly assigned space from original place.This mechanism not only increases the utilization rate of internal memory, and solves deflation and mobile operation that other solutions face.
The present invention adopts the remote pre-fetch strategy of aware application.The data access amount of the single request of data of diagram data is few, causes remote I/O operation frequently.Use this strategy, when current strange land data request processing, the desired data of prediction subsequent calculations, then this predicted data and current request data together return to client process, are buffered in its this locality.Significantly can reduce IO number by this mechanism, predictability and the validity of data access are provided.The nomography that analysis and summary of the present invention is common, according to often taking turns in renewal which summit needed in reference diagram, these algorithms can be divided into two classes, i.e. neighborhood processing and non-neighborhood algorithm.In neighborhood processing, each takes turns the value only needing to access adjacent vertex when iteration upgrades each vertex attributes values.And in non-neighborhood algorithm, upgrade the value that each vertex attributes values also needs to access other summit except neighborhood.
Technical scheme of the present invention is:
Towards a diagram data storage means for extensive social networks, the steps include:
1) data storage manager adopts Key-Value mode to store to the diagram data that receives, wherein with the summit ID of diagram data for Key, with summit neighborhood for Value;
2) data of each summit neighborhood are stored: many limits be connected with this neighbor assignment summit, summit are stored in the memory block Block of fixed size with timestamp in order, and shared memory block Block is formed Block doubly linked list by usage chain list structure, the index information of the attribute information on this summit and this Block doubly linked list head of sensing and afterbody is stored in a data structure Vertex.
Further, described memory block Block carries out allocate and recycle by the Block pool manager that is responsible for storage resources distribution; Memory block Block in described Block doubly linked list comprises three parts: Part I is the pointer of the previous memory block Block pointing to current memory block Block in Block doubly linked list, Part II is the pointer of the rear Block pointing to current memory block Block in Block doubly linked list, and Part III is the limit that current memory block Block stores.
Further, described step 2) in, when a You Xin limit, summit needs to be stored into summit neighborhood, sequence is temporally stabbed in this new limit, the Block checking Block doubly linked list head whether less than, if less than, then this new limit is added the head at this Block; If the Block of head is full, then distributes a new memory block Block by Block pool manager and add the head of this Block doubly linked list to, then this new limit being stored in this new memory block Block.
Further, in memory block Block described in each, be provided with a data field, interval for the timestamp preserving deposited limit; When the limit in opposite vertexes field is deleted, choose on one side from the memory block Block of Block doubly linked list afterbody, if the timestamp on this limit is less than the interval lower bound of timestamp of afterbody memory block Block, then terminate deletion action, if the timestamp on this limit is greater than the upper bound in this timestamp interval, then remove this memory block Block from this Block doubly linked list end; If the timestamp on this limit is positioned at this timestamp interval, then scan the limit in this memory block Block, delete the limit more Zao than the timestamp on this limit.
Further, storage resources is divided into two classes by described data storage manager: coarseness storage resources and fine granularity storage resources, arrange bounded buffer window in the internal memory of described data storage manager; Wherein said coarseness storage resources for storing the large files of setting, and resides in internal memory, and described fine granularity storage resources is for storing demand assigned small documents; Fine granularity storage resources is loaded into this bounded buffer window by described data storage manager as required, when this bounded buffer window is occupied full, when needing to load new fine granularity storage resources, adopt cache replacement algorithm to be swapped out by fine granularity storage resources existing in this bounded buffer window, be then loaded into new fine granularity storage resources.
Further, described coarseness storage resources and described fine granularity storage resources are all divided into the described memory block Block of fixed size, and each memory block Block has unique Block ID.
Further, first described data storage manager distributes the memory block Block of coarseness storage resources, when coarseness storage resources exhausts, starts dynamically to distribute fine granularity storage resources desirably at memory block Block.
Method as claimed in claim 5, is characterized in that, described data-carrier store is scintigram data termly, is deleted by expired diagram data and dump in described fine granularity storage resources from described coarseness storage resources.
A kind of graph data query method, is characterized in that, when data storage manager receives the request of access of access vertex v, this vertex v and k rank neighborhood thereof are transferred to requestor by data storage manager; Return data is buffered in this locality by requestor, and during inquiry next time, first check local buffer memory, the summit if there is no inquired about, then send to described data storage manager by request of access.
Further, the k rank neighborhood of described vertex v is the set formed through whole summits of bee-line k bar Bian Keda from this vertex v.
Compared with prior art, good effect of the present invention is:
(1) can meet and dynamically update.Social network user frequently produces data, and in actual applications, data analysis has very high requirement to the ageing of data, and the new data that the behavior of user produces needs constantly to add.
(2) scene of process Sparse is applicable to.In social networks, the difference of good friend's number of user is a lot, and a few users (as large V), have a lot of good friend, and good friend's number of numerous domestic consumer is rare, and from statistics angle, social network data is the extreme sparse graph that average out-degree is very little.
(3) random access is applicable to.In the process of access diagram data, be different from access order data set, generally first access certain summit, then access adjacent vertex, again access the adjacent vertex of adjacent vertex, so constantly from a summit, from the close-by examples to those far off, finally complete whole graph traversal.
Accompanying drawing explanation
Data storage management in Fig. 1 the inventive method;
Fig. 2 is the memory model of summit neighborhood;
Fig. 3 is the local realistic Comparative result figure that tests;
Fig. 4 is that experimental result comparison diagram is read in this locality;
Fig. 5 is real data and storage resource consumption comparison diagram;
Fig. 6 is the experimental result comparison diagram of remote write;
Fig. 7 is the long-range experimental result comparison diagram read;
Fig. 8 is the experimental result comparison diagram of Data Update;
The performance comparison figure of Fig. 9 remote pre-fetch and non-prefetched.
Embodiment
Be described principle of the present invention and feature below in conjunction with accompanying drawing, example, only for explaining the present invention, is not intended to limit scope of the present invention.
The present invention can be divided into three parts: the Organization of Data strategy of zero movement, the memory management policy of self-adapting data scale and application perception strange land prefetch policy.
The Organization of Data strategy of zero movement
Diagram data incremental update refers to deletes out-of-date old limit by the new limit insertion figure neutralization produced from figure.All the time the movable footprint of social media user is produced fresh diagram data to need to be stored, and the out-of-date diagram data no longer had use value in diagram data is deleted.
Adopt Key-Value to store herein, be Key with summit ID, be Value with summit neighborhood.The Organization of Data feature of summit neighborhood has: 1) adopt and be stored in Block doubly linked list by orderly for limit adjacent for summit with timestamp, be convenient to insert new limit from the head of chained list, delete old limit from afterbody.2) in order to reduce the storage cost of limit at doubly linked list, be stored sequentially in the memory block Block of fixed size on many limits be connected with summit, the block block storing the limit on same summit forms Block doubly linked list.Block is responsible for distributing by the Block pool manager being responsible for storage resources distribution specially and reclaims.3) summit neighborhood is made up of data structure Vertex and Block chained list, and wherein Vertex is the structure of user-defined C language, wherein contains the attribute information on summit and points to the index information of Block doubly linked list head and afterbody.
When neighborhood inserts new limit, sequence is temporally stabbed in new limit apicad, the Block first checking chained list head whether less than, if not full, just add the head at this Block; If the Block of head is full, and still there is the new limit be inserted into, then distribute new Block by Block pool manager, add the head of chained list to, filling new Block with being inserted into limit.Repeat this process, until upgrade the new limit of full.When deleting the aging limit in summit neighborhood, from the Block of Block double linked list afterbody, in order to delete old limit fast, have special data field to preserve in Block timestamp interval that institute deposits limit.If want the timestamp on the limit of deleting to be less than the lower bound in timestamp interval, then terminate deletion action; When the timestamp on limit to be deleted is greater than the upper bound in timestamp interval, then removes this Block from chained list end, reclaim deleted Block by Block stored reservoir; If the timestamp when limit to be deleted appears in timestamp interval, then scan the limit in this Block, delete the limit older than the timestamp on limit to be deleted, and end operation.
Adopt Block doubly linked list to store and organize summit neighborhood in the mode going out limit that timestamp is orderly, have plurality of advantages.The first, the frequent dynamic assignment of small amount memory headroom avoiding adopting single edge to store and the extra metadata of storing excess.Because the renewal of diagram data is relatively more frequent, and the data volume on every bar limit is fewer, and renewal rewards theory causes the frequent dynamic assignment of small amount internal memory; And the metadata in limit also strengthens relative to the ratio of real data, cause the allocative efficiency of memory and utilization factor to reduce, and the storage of challengeing limit by oneself is also unfavorable for access summit neighborhood.Second, avoid take summit whole go out the mode of limit global storage be unfavorable for the problem that diagram data dynamically updates, adopt overall storage, summit whole go out limit store blob (the unfixed contiguous memory space of length of dynamic assignment), new limit update can cause former storage space to exhaust, enough large new storage space must be opened up, move original data, and insert new limit.Old edge contract can cause the vacant a considerable number of disabled internal fragmentation of storage space to operation.The regular stringency of usual employing, data mobile rearranges diagram data, to eliminate fragment.Although this organizational form access summit neighborhood has very high performance, the random delta of stringency meeting appreciable impact diagram data is with new efficiency.
In the present system, take into full account the feature of diagram data and the access characteristics to diagram data, devise a kind of summit neighborhood memory model of diagram data, as shown in Figure 2.Each Block is made up of three parts, and Part I is the pointer pointing to previous block, and Part II is the pointer pointing to a rear block, and Part III is the limit that this block stores.This model can support that diagram data upgrades effectively.
Diagram data storage means in this paper is different from the mode following points that industry uses in Data Update:
First, the data of Redis and Neo4j store the mode adopting continuous dispensing, but constantly upgrade along with data, exceed memory allocated block time, need to distribute larger contiguous memory block, then original data moved in new block, and new data is inserted.Data Update frequently, can cause frequently moving of data.Although Redis have employed one and allocates mechanism in advance, the internal memory of every sub-distribution, more than the internal memory number of actual needs, increases to tackle following possible data.But fundamentally do not solve the problem that internal storage data moves.And the mode that the present invention adopts link to distribute, by block stored reservoir management free memory block, the memory block distributed is linked into doubly linked list, and the renewal of data occurs in chained list end, need not cause moving and copying of data.
Secondly, in Redis and Neo4j, summit and the limit of diagram data store respectively, and the summit neighborhood of the direct storage figure of method in this paper.Store the diagram data of identical scale, Redis and Neo4j needs the network I/O starting more more number.The number of times that Redis and Neo4j starts I/O is figure vertex number and limit number sum, and the number of times that the present invention starts I/O is figure vertex number, and in social media graph structure, the limit number of figure is far longer than figure vertex number.Even if can bulk data transfer be adopted, because Redis and Neo4j data access adopts the query language of similar SQL, need the data of write to be converted to batch query statement in client, and perform through resolving at server end, conversion and parsing have impact on efficiency.And the present invention itself adopts summit and neighborhood thereof as Organization of Data unit, in remote access process, client adopts the data of buffer buffer memory target machine to be written, when buffer zone is full, or time out event occurs, just the batch data in buffer is sent to target machine.
Finally, each Data Update of Neo4j is all transaction operation, and the period of service needs data write to be synchronized in disk file.And Redis adopts the mechanism of RDB, by backstage I/O thread by internal memory sometime the data of section be written in disk file, each ablation process can block other and upgrade.And the present invention is when upgrading, adopt mmap interface, set up between region of memory and disk file and map, data are written to region of memory, are responsible for Data Update to be synchronized in disk file by operating system nucleus.
Comprehensive above difference, the present invention has multiple advantage in design in the process doing Data Update, can reach higher more new capability.
The memory management policy of self-adapting data scale
The data access that figure calculates has very high randomness, adopts memory organization data can improve data random access efficiency.Then, the data scale of diagram data has magnanimity, and the limited memory of cluster cannot load large-scale graph data.Propose the memory management scheme of self-adapting data scale herein, preferential use memory organization diagram data, when memory source exhausts, allows that data spill in disk.Concrete way is: the operable storage resources of diagram data management system is divided into two classes, and a class is coarseness storage resources, and another kind of is fine granularity storage resources.
The pre-assigned large files of the corresponding local Linux file system of coarseness resource, file size is that 2GB, 4GB are even larger, and prerequisite is no more than cluster individual node available memory space.During operation, coarseness resource is all loaded, and resides in internal memory.Fine granularity resource is for the demand assigned small documents of local Linux file system, and file size is 4MB, 8MB, and the number of small documents constantly becomes large along with the increase of diagram data.During operation, coarseness resource and all fine granularity resources all cannot be placed in internal memory, therefore in internal memory, open up bounded buffer window, as required fine granularity resource is loaded into window.When window is occupied full, when needing to load new fine granularity resource, need according to cache replacement algorithm, in selection window, existing fine granularity resource, is swapped out, and is then loaded into new resource.
Two class resources are all divided into the Block of fixed size, and all Block have unique Block ID, can access Block randomly according to BlockID.Block is the basic allocation unit of storage resources, is responsible for distributing and reclaiming Block by Block pool manager.When summit neighborhood Block doubly linked list head less than time, need to distribute the head of new Block as chained list for newly inserting limit; When deleting the old limit of summit neighborhood, need the Block of release and matter sky.All Block or be in idle condition, or belong to certain summit neighborhood of specifying.
During Resourse Distribute, dynamically can select storage resources according to the scale of data.First distribute from coarseness resource, as far as possible object adopts memory organization data.When coarseness resource exhaustion, start dynamically to distribute fine granularity resource desirably, store the data of overflowing internal memory.Limit in the neighborhood of summit stores in order with timestamp, can scintigram data termly, is deleted by expired diagram data, dump in fine granularity resource from coarseness resource, from but data internal memory keep fresh.
Adopt the storage allocation strategy of self-adapting data scale, both can use memory organization data as possible, thus improve the randomness of data access, when allowing again diagram data scale to increase, data are spilt into disk from internal memory, solves the problem of management of magnanimity diagram data.
The strange land prefetch policy of application perception
In the present invention, in order to accelerate the efficiency of strange land data access, the mechanism looked ahead in the strange land proposing a kind of aware application feature.By this mechanism, more enough throughputs improving strange land data access significantly.
The access of diagram data, is similar to the data that access order stores, has obvious locality.The algorithm that figure calculates is generally based on graph traversal algorithm.For breadth first traversal, during access vertex v, the accessed probability in summit of the k rank neighborhood (going out the set formed through whole summits of bee-line k bar Bian Keda from vertex v) of v increases along with k and reduces, therefore, when processing the request of access of v, system can predict summit most possibly accessed in the recent period according to vertex position information, the data on vertex v and summit most possibly accessed is in the recent period combined, is transferred to requestor.The data buffer storage that obtains in advance in this locality, when next time obtains strange land data, is first checked local buffer memory by requestor, if existed, and directly access this locality, otherwise to strange land request msg.
A lot of figure Computational frame sequentially can read whole summits of the subgraph of diagram data, now, can adopt the mode that order is looked ahead.Way choice of looking ahead will meet the data access feature that figure calculates, and can provide multiple strategy of looking ahead, and allows user select therefrom to select best prefetch policy.
Example 1 data access example
The present invention tests the present invention and compares is the write performance of Redis and Neo4j under different concurrency, and this write performance has got the mean value under ten data sets, and wherein in test pattern 3, the present invention uses NYNN to represent.
As shown in Figure 3, the local write performance of the present invention is obviously better than Redis and Neo4j, and the mechanism of the design of metadata and data transmission format and data write have impact on the performance write.The division of diagram data first of the present invention adopts between the wide apex region of continuous print, can pass through data write request first, by metadata cache in this locality, after this uses the metadata being buffered in this locality to carry out addressing data, which reduces the IO expense of addressing.When write data, in the Memory Mapping File and its that oneself write is local.Redis and Neo4j both provides querying command/language easy to use, first user is write the request of data, is packaged into querying command, then send to server by local network, be responsible for resolve command by server, complete inquiry.
The present invention tests the present invention and compares is that Redis and Neo4j reads performance under different concurrency, this read performance got ten data sets under mean value.
As shown in Figure 4, local read operation, the present invention can reach GB/s magnitude, and Redis and Neo4j is 10MB/s ~ 100MB/s, and diagram data directly can be passed through mmap interface mappings in the address space of client process by the present invention, operates.And Redis and Neo4j adopts querying command to server request msg.
In order to the storage efficiency of response diagram data-storage system, the present invention compares the storage die swell ratio of multiple system, namely stores the data of as much, and that sees internal memory and disk takies die swell ratio.
As shown in Figure 5, contrast Redis, Neo4j and disk service condition of the present invention, can significantly observe, Neo4j, as the storage system based on disk, takies disk space maximum, time data store, need stored attribute values and attribute-name, and property value adopts the form of character string to store.Disk storage space needed for Redis is minimum, before the data persistence of Redis, data is compressed.And in the use of internal memory, the internal memory of Redis uses maximum, because total data is loaded into internal memory by Redis.Neo4j and the present invention can configure the size of required memory.
The present invention tests the present invention and compares is the remote write performance of Redis and Neo4j under different concurrency, and this write performance has got the mean value under ten data sets.
As shown in Figure 6, remote write data, performance of the present invention is apparently higher than Neo4j and Redis.Exchanges data of the present invention adopts the form transmission byte stream of message.And Redis and Neo4j adopts querying command or language, reduce the transfer efficiency of data.In addition back-end data write of the present invention adopts multiple threads, the efficiency of raising.Redis adopts the single line of event-based model to go out process.
The present invention tests the present invention and compares is that long-range under different concurrency of Redis and Neo4j reads performance, this read performance got ten data sets under mean value.
As shown in Figure 7, long-range read operation, present invention employs data pre-fetching mechanism, can bulk transfer valid data.And Redis and Neo4j does not adopt prefetch mechanisms.
Example 2 Data Update example
Testing the present invention herein and comparing is the Data Update performance of Redis and Neo4j, and this performance has got the mean value under ten data sets.
As shown in Figure 8, for the incremental update of data, present invention employs the allocation scheme of chain type, the data that data are inserted and deletion causes can be eliminated and frequently move problem.And Redis and Neo4j have employed the mode of order-assigned, along with the renewal of data, need to move data, reorientation data, in this process, cannot new query processing be carried out.
Example 3 data remote access example
Test the performance difference of the present invention when having prefetch mechanisms and non-prefetched herein.The invention provides several data prefetch mechanisms, be convenient to upper layer application more oneself application background select best data pre-fetching mode.This experiment adopts BFS to look ahead, and test total limit number that the algorithm used is statistical graph data centralization figure, the thinking of algorithm is: be n burst by the vertex partition of figure, each burst specifies a thread process, and all thread parallels access the diagram data in the present invention.Queue is put in the summit will accessed in request of access by thread, then the summit of processing queue head, access top neighborhood of a point, statistics limit, summit number.If when terminal in front is positioned at burst belonging to thread, then this terminal is put into queue.The stem of continuous circular treatment queue, until queue is empty.But all threads complete the statistics on summit, result are added up, and algorithm terminates.This algorithm is BFS ergodic algorithm, can be good at utilizing BFS prefetch mechanisms, and prefetch data, in follow-up access, has very high hit rate.
As shown in Figure 9, the invention provides several data prefetch mechanisms, such as order is looked ahead, and BFS looks ahead and DFS looks ahead.What the Computational frame on upper strata can select best mode of looking ahead batch obtains diagram data from strange land, and in diagram data access process, each to read data volume fewer, although the digital independent of adjacent time is foreseeable.By chunk data is saved bit by bit in the access of large quantities of small amount of data, can bandwidth resources be made full use of, reduce IO start-up time, the data throughput that we can be higher by lower concurrent acquisition like this.

Claims (10)

1., towards a diagram data storage means for extensive social networks, the steps include:
1) data storage manager adopts Key-Value mode to store to the diagram data that receives, wherein with the summit ID of diagram data for Key, with summit neighborhood for Value;
2) data of each summit neighborhood are stored: many limits be connected with this neighbor assignment summit, summit are stored in the memory block Block of fixed size with timestamp in order, and shared memory block Block is formed Block doubly linked list by usage chain list structure, the index information of the attribute information on this summit and this Block doubly linked list head of sensing and afterbody is stored in a data structure Vertex.
2. the method for claim 1, is characterized in that, described memory block Block carries out allocate and recycle by the Block pool manager that is responsible for storage resources distribution; Memory block Block in described Block doubly linked list comprises three parts: Part I is the pointer of the previous memory block Block pointing to current memory block Block in Block doubly linked list, Part II is the pointer of the rear Block pointing to current memory block Block in Block doubly linked list, and Part III is the limit that current memory block Block stores.
3. method as claimed in claim 2, it is characterized in that, described step 2) in, when a You Xin limit, summit needs to be stored into summit neighborhood, sequence is temporally stabbed in this new limit, the Block checking Block doubly linked list head whether less than, if less than, then this new limit is added the head at this Block; If the Block of head is full, then distributes a new memory block Block by Block pool manager and add the head of this Block doubly linked list to, then this new limit being stored in this new memory block Block.
4. the method as described in claim 1 or 2 or 3, is characterized in that, be provided with a data field in memory block Block described in each, interval for the timestamp preserving deposited limit; When the limit in opposite vertexes field is deleted, choose on one side from the memory block Block of Block doubly linked list afterbody, if the timestamp on this limit is less than the interval lower bound of timestamp of afterbody memory block Block, then terminate deletion action, if the timestamp on this limit is greater than the upper bound in this timestamp interval, then remove this memory block Block from this Block doubly linked list end; If the timestamp on this limit is positioned at this timestamp interval, then scan the limit in this memory block Block, delete the limit more Zao than the timestamp on this limit.
5. the method for claim 1, is characterized in that, storage resources is divided into two classes by described data storage manager: coarseness storage resources and fine granularity storage resources, arrange bounded buffer window in the internal memory of described data storage manager; Wherein said coarseness storage resources for storing the large files of setting, and resides in internal memory, and described fine granularity storage resources is for storing demand assigned small documents; Fine granularity storage resources is loaded into this bounded buffer window by described data storage manager as required, when this bounded buffer window is occupied full, when needing to load new fine granularity storage resources, adopt cache replacement algorithm to be swapped out by fine granularity storage resources existing in this bounded buffer window, be then loaded into new fine granularity storage resources.
6. method as claimed in claim 5, it is characterized in that, described coarseness storage resources and described fine granularity storage resources are all divided into the described memory block Block of fixed size, and each memory block Block has unique Block ID.
7. method as claimed in claim 5, it is characterized in that, first described data storage manager distributes the memory block Block of coarseness storage resources, when coarseness storage resources exhausts, starts dynamically to distribute fine granularity storage resources desirably at memory block Block.
8. method as claimed in claim 5, it is characterized in that, described data-carrier store is scintigram data termly, deleted by expired diagram data and dump in described fine granularity storage resources from described coarseness storage resources.
9. based on a graph data query method for the diagram data of method storage described in claim 1, it is characterized in that, when data storage manager receives the request of access of access vertex v, this vertex v and k rank neighborhood thereof are transferred to requestor by data storage manager; Return data is buffered in this locality by requestor, and during inquiry next time, first check local buffer memory, the summit if there is no inquired about, then send to described data storage manager by request of access.
10. method as claimed in claim 9, is characterized in that, the k rank neighborhood of described vertex v is the set formed through whole summits of bee-line k bar Bian Keda from this vertex v.
CN201510229346.3A 2015-05-07 2015-05-07 A kind of diagram data storage and querying method towards extensive social networks Active CN104899156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510229346.3A CN104899156B (en) 2015-05-07 2015-05-07 A kind of diagram data storage and querying method towards extensive social networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510229346.3A CN104899156B (en) 2015-05-07 2015-05-07 A kind of diagram data storage and querying method towards extensive social networks

Publications (2)

Publication Number Publication Date
CN104899156A true CN104899156A (en) 2015-09-09
CN104899156B CN104899156B (en) 2017-11-14

Family

ID=54031830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510229346.3A Active CN104899156B (en) 2015-05-07 2015-05-07 A kind of diagram data storage and querying method towards extensive social networks

Country Status (1)

Country Link
CN (1) CN104899156B (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653609A (en) * 2015-12-24 2016-06-08 中国建设银行股份有限公司 Memory-based data processing method and device
CN105868025A (en) * 2016-03-30 2016-08-17 华中科技大学 System for settling fierce competition of memory resources in big data processing system
CN106919628A (en) * 2015-12-28 2017-07-04 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of diagram data
CN107220188A (en) * 2017-05-31 2017-09-29 莫倩 A kind of automatic adaptation cushion block replacement method
WO2018099299A1 (en) * 2016-11-30 2018-06-07 华为技术有限公司 Graphic data processing method, device and system
CN109255055A (en) * 2018-08-06 2019-01-22 四川蜀天梦图数据科技有限公司 A kind of diagram data access method and device based on packet associated table
CN109491988A (en) * 2018-11-05 2019-03-19 北京中安智达科技有限公司 A kind of data real time correlation method for supporting full dose to update
CN109582808A (en) * 2018-11-22 2019-04-05 北京锐安科技有限公司 A kind of user information querying method, device, terminal device and storage medium
CN110168533A (en) * 2016-12-15 2019-08-23 微软技术许可有限责任公司 Caching to subgraph and the subgraph of caching is integrated into figure query result
CN110377622A (en) * 2019-06-19 2019-10-25 深圳新度博望科技有限公司 Data capture method, data retrieval method and request responding method
WO2020073641A1 (en) * 2018-10-11 2020-04-16 国防科技大学 Data structure-oriented data prefetching method and device for graphics processing unit
CN111090653A (en) * 2019-12-20 2020-05-01 东软集团股份有限公司 Data caching method and device and related products
CN111274165A (en) * 2018-12-05 2020-06-12 核桃运算股份有限公司 Data management device, method and computer storage medium thereof
CN111460197A (en) * 2020-01-21 2020-07-28 中国国土勘测规划院 Method for identifying vector elements of homeland plane intersection
CN111523000A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Method, device, equipment and storage medium for importing data
CN111858612A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Data accelerated access method and device based on graph database and storage medium
CN112236760A (en) * 2018-07-27 2021-01-15 浙江天猫技术有限公司 Method, system, computer readable storage medium and equipment for updating graph data
CN113127660A (en) * 2021-05-24 2021-07-16 成都四方伟业软件股份有限公司 Timing graph database storage method and device
CN113449152A (en) * 2021-06-24 2021-09-28 西安交通大学 Image data prefetcher and prefetching method
CN113672610A (en) * 2021-10-21 2021-11-19 支付宝(杭州)信息技术有限公司 Graph database processing method and device
CN113672590A (en) * 2021-07-22 2021-11-19 浙江大华技术股份有限公司 Data cleaning method, graph database device and computer readable storage medium
CN113722520A (en) * 2021-11-02 2021-11-30 支付宝(杭州)信息技术有限公司 Graph data query method and device
CN113760971A (en) * 2021-11-09 2021-12-07 通联数据股份公司 Method, computing device and storage medium for retrieving data of a graph database
CN113779286A (en) * 2021-11-11 2021-12-10 支付宝(杭州)信息技术有限公司 Method and device for managing graph data
CN114138776A (en) * 2021-11-01 2022-03-04 杭州欧若数网科技有限公司 Method, system, apparatus and medium for graph structure and graph attribute separation design
CN115188421A (en) * 2022-09-08 2022-10-14 杭州联川生物技术股份有限公司 Gene clustering data preprocessing method, device and medium based on high-throughput sequencing
WO2023083234A1 (en) * 2021-11-11 2023-05-19 支付宝(杭州)信息技术有限公司 Graph state data management
US11748506B2 (en) 2017-02-27 2023-09-05 Microsoft Technology Licensing, Llc Access controlled graph query spanning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049280A (en) * 2011-10-14 2013-04-17 浪潮乐金数字移动通信有限公司 Method for achieving key macro definition function and mobile terminal
CN104281816A (en) * 2014-10-14 2015-01-14 厦门智芯同创网络科技有限公司 Rainbow table parallel system design method and device based on MapReduce
CN104462328A (en) * 2014-12-02 2015-03-25 深圳中科讯联科技有限公司 Blended data management method and device based on Hash tables and dual-circulation linked list

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049280A (en) * 2011-10-14 2013-04-17 浪潮乐金数字移动通信有限公司 Method for achieving key macro definition function and mobile terminal
CN104281816A (en) * 2014-10-14 2015-01-14 厦门智芯同创网络科技有限公司 Rainbow table parallel system design method and device based on MapReduce
CN104462328A (en) * 2014-12-02 2015-03-25 深圳中科讯联科技有限公司 Blended data management method and device based on Hash tables and dual-circulation linked list

Cited By (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653609A (en) * 2015-12-24 2016-06-08 中国建设银行股份有限公司 Memory-based data processing method and device
CN105653609B (en) * 2015-12-24 2019-08-09 中国建设银行股份有限公司 Data processing method memory-based and device
CN106919628A (en) * 2015-12-28 2017-07-04 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of diagram data
WO2017114164A1 (en) * 2015-12-28 2017-07-06 阿里巴巴集团控股有限公司 Graph data processing method and apparatus
CN105868025B (en) * 2016-03-30 2019-05-10 华中科技大学 A kind of system solving memory source keen competition in big data processing system
CN105868025A (en) * 2016-03-30 2016-08-17 华中科技大学 System for settling fierce competition of memory resources in big data processing system
WO2018099299A1 (en) * 2016-11-30 2018-06-07 华为技术有限公司 Graphic data processing method, device and system
US11256749B2 (en) 2016-11-30 2022-02-22 Huawei Technologies Co., Ltd. Graph data processing method and apparatus, and system
CN110168533B (en) * 2016-12-15 2023-08-08 微软技术许可有限责任公司 Caching of sub-graphs and integrating cached sub-graphs into graph query results
CN110168533A (en) * 2016-12-15 2019-08-23 微软技术许可有限责任公司 Caching to subgraph and the subgraph of caching is integrated into figure query result
US11748506B2 (en) 2017-02-27 2023-09-05 Microsoft Technology Licensing, Llc Access controlled graph query spanning
CN107220188A (en) * 2017-05-31 2017-09-29 莫倩 A kind of automatic adaptation cushion block replacement method
CN107220188B (en) * 2017-05-31 2020-10-27 中山大学 Self-adaptive buffer block replacement method
CN112236760A (en) * 2018-07-27 2021-01-15 浙江天猫技术有限公司 Method, system, computer readable storage medium and equipment for updating graph data
CN112236760B (en) * 2018-07-27 2024-06-07 浙江天猫技术有限公司 Graph data updating method, system, computer readable storage medium and equipment
CN109255055A (en) * 2018-08-06 2019-01-22 四川蜀天梦图数据科技有限公司 A kind of diagram data access method and device based on packet associated table
CN109255055B (en) * 2018-08-06 2020-10-30 四川蜀天梦图数据科技有限公司 Graph data access method and device based on grouping association table
WO2020073641A1 (en) * 2018-10-11 2020-04-16 国防科技大学 Data structure-oriented data prefetching method and device for graphics processing unit
US11520589B2 (en) 2018-10-11 2022-12-06 National University Of Defense Technology Data structure-aware prefetching method and device on graphics processing unit
CN109491988B (en) * 2018-11-05 2021-12-14 北京中安智达科技有限公司 Data real-time association method supporting full-scale updating
CN109491988A (en) * 2018-11-05 2019-03-19 北京中安智达科技有限公司 A kind of data real time correlation method for supporting full dose to update
CN109582808A (en) * 2018-11-22 2019-04-05 北京锐安科技有限公司 A kind of user information querying method, device, terminal device and storage medium
CN111274165A (en) * 2018-12-05 2020-06-12 核桃运算股份有限公司 Data management device, method and computer storage medium thereof
CN110377622A (en) * 2019-06-19 2019-10-25 深圳新度博望科技有限公司 Data capture method, data retrieval method and request responding method
CN111090653B (en) * 2019-12-20 2023-12-15 东软集团股份有限公司 Data caching method and device and related products
CN111090653A (en) * 2019-12-20 2020-05-01 东软集团股份有限公司 Data caching method and device and related products
CN111460197A (en) * 2020-01-21 2020-07-28 中国国土勘测规划院 Method for identifying vector elements of homeland plane intersection
CN111460197B (en) * 2020-01-21 2023-07-25 中国国土勘测规划院 Method for identifying vector elements of homeland planar intersection
CN111523000A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Method, device, equipment and storage medium for importing data
WO2021139230A1 (en) * 2020-07-28 2021-07-15 平安科技(深圳)有限公司 Method and apparatus for accelerated data access based on graph database
CN111858612A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Data accelerated access method and device based on graph database and storage medium
CN111858612B (en) * 2020-07-28 2023-04-18 平安科技(深圳)有限公司 Data accelerated access method and device based on graph database and storage medium
CN113127660A (en) * 2021-05-24 2021-07-16 成都四方伟业软件股份有限公司 Timing graph database storage method and device
CN113449152A (en) * 2021-06-24 2021-09-28 西安交通大学 Image data prefetcher and prefetching method
CN113449152B (en) * 2021-06-24 2023-01-10 西安交通大学 Image data prefetcher and prefetching method
CN113672590B (en) * 2021-07-22 2024-06-07 浙江大华技术股份有限公司 Data cleaning method, graph database device and computer readable storage medium
CN113672590A (en) * 2021-07-22 2021-11-19 浙江大华技术股份有限公司 Data cleaning method, graph database device and computer readable storage medium
CN113672610B (en) * 2021-10-21 2022-02-15 支付宝(杭州)信息技术有限公司 Graph database processing method and device
CN113672610A (en) * 2021-10-21 2021-11-19 支付宝(杭州)信息技术有限公司 Graph database processing method and device
WO2023066221A1 (en) * 2021-10-21 2023-04-27 支付宝(杭州)信息技术有限公司 Graph database processing
CN114138776A (en) * 2021-11-01 2022-03-04 杭州欧若数网科技有限公司 Method, system, apparatus and medium for graph structure and graph attribute separation design
CN113722520A (en) * 2021-11-02 2021-11-30 支付宝(杭州)信息技术有限公司 Graph data query method and device
CN113722520B (en) * 2021-11-02 2022-05-03 支付宝(杭州)信息技术有限公司 Graph data query method and device
CN113760971B (en) * 2021-11-09 2022-02-22 通联数据股份公司 Method, computing device and storage medium for retrieving data of a graph database
CN113760971A (en) * 2021-11-09 2021-12-07 通联数据股份公司 Method, computing device and storage medium for retrieving data of a graph database
CN113779286A (en) * 2021-11-11 2021-12-10 支付宝(杭州)信息技术有限公司 Method and device for managing graph data
WO2023083234A1 (en) * 2021-11-11 2023-05-19 支付宝(杭州)信息技术有限公司 Graph state data management
CN113779286B (en) * 2021-11-11 2022-02-08 支付宝(杭州)信息技术有限公司 Method and device for managing graph data
CN115188421A (en) * 2022-09-08 2022-10-14 杭州联川生物技术股份有限公司 Gene clustering data preprocessing method, device and medium based on high-throughput sequencing

Also Published As

Publication number Publication date
CN104899156B (en) 2017-11-14

Similar Documents

Publication Publication Date Title
CN104899156A (en) Large-scale social network service-oriented graph data storage and query method
US9582421B1 (en) Distributed multi-level caching for storage appliances
US9959279B2 (en) Multi-tier caching
US11392544B2 (en) System and method for leveraging key-value storage to efficiently store data and metadata in a distributed file system
US10289315B2 (en) Managing I/O operations of large data objects in a cache memory device by dividing into chunks
CN101556557B (en) Object file organization method based on object storage device
CN102158546B (en) Cluster file system and file service method thereof
CN107463447B (en) B + tree management method based on remote direct nonvolatile memory access
US8793466B2 (en) Efficient data object storage and retrieval
US10997153B2 (en) Transaction encoding and transaction persistence according to type of persistent storage
US11314689B2 (en) Method, apparatus, and computer program product for indexing a file
CN110663019A (en) File system for Shingled Magnetic Recording (SMR)
CN112000287B (en) IO request processing device, method, equipment and readable storage medium
CN113377868B (en) Offline storage system based on distributed KV database
CN100424699C (en) Attribute extensible object file system
US9307024B2 (en) Efficient storage of small random changes to data on disk
CN109376125A (en) A kind of metadata storing method, device, equipment and computer readable storage medium
CN107181773B (en) Data storage and data management method and device of distributed storage system
CN110968266B (en) Storage management method and system based on heat degree
CN111177019B (en) Memory allocation management method, device, equipment and storage medium
CN112148736A (en) Method, device and storage medium for caching data
CN111694765A (en) Mobile application feature-oriented multi-granularity space management method for nonvolatile memory file system
CN114490443A (en) Shared memory-based golang process internal caching method
CN106294189B (en) Memory defragmentation method and device
CN108664217A (en) A kind of caching method and system reducing the shake of solid-state disc storaging system write performance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant