CN104899156A

CN104899156A - Large-scale social network service-oriented graph data storage and query method

Info

Publication number: CN104899156A
Application number: CN201510229346.3A
Authority: CN
Inventors: 周薇; 包秀国; 马宏远; 程工; 冉攀峰; 刘春阳; 王卿; 韩冀中; 庞琳; 李雄; 贺敏; 刘玮
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2015-05-07
Filing date: 2015-05-07
Publication date: 2015-09-09
Anticipated expiration: 2035-05-07
Also published as: CN104899156B

Abstract

The invention discloses a large-scale social network service-oriented graph data storage and query method. A data storage manager stores received graph data in a Key-Value way, wherein the vertex ID (Identity) of the graph data is taken as the Key, a vertex neighbourhood is taken as the Value, and the data of each vertex neighbourhood is stored; and a plurality of edges connected with the vertex neighbourhood are orderly stored in a memory block with a fixed size via a timestamp, a double linked list is formed, and the attribution information and the index information of the vertex are stored in a data structure. When the data storage manager receives an access request that the vertex v is accessed, the data storage manager transmits the vertex v and a k-order neighbourhood of the vertex v to a requester; and the requester caches stored data locally, firstly checks local cache during next query, and sends the access request to the data storage manager if no queried vertexes exist. The graph data storage and query method can meet a scene which is dynamically updated and is suitable for processing data sparseness, and random access.

Description

A kind of diagram data towards extensive social networks stores and querying method

Technical field

The present invention relates to a kind of diagram data towards extensive social networks to store and querying method, belong to software technology field.

Background technology

At present, the main flow way that diagram data stores is by diagram data through pre-service, is converted into the record on limit and summit, is stored in the large files of distributed file system with the form of sequential data set.During access diagram data, access the large files of storage figure data in the mode of order scanning.This organizational form cannot provide effective data storage and access performance more for the figure computing application taking turns iteration, in order to improve the access performance of diagram data, the memory management technology of diagram data becomes an important trend, as Trinity, Giraph etc.

Neo4j is the chart database adopting Key-Value memory model, and the most basic storage cell is summit and limit, as breadth traversal BFS, obtains the performance of a certain summit neighborhood operation obviously than low as the system of basic storage cell using summit neighborhood.The all data of Neo4j are stored on disk, adopt the access of memory cache expedited data, can the size of regulating memory buffer memory, obtain best performance.

Trinity is the figure computing engines based on internal memory cloud designed by Microsoft Research, Asia.Adopt the Key-Value based on internal memory to store, wherein key is unique ID on figure summit, and in order to locate addressing cell, cell is the contiguous memory block of bytes of the random length of the adjacent vertex information comprised in this summit neighborhood.Each vertex correspondence cell, a cell of diagram data are stored in trunk, and trunk is the contiguous memory that size is no more than 2GB.

Redis also adopts Key-Value to store, but Redis can only Processing Cluster internal memory can hold under diagram data.

Neo4j is the chart database adopting Key-Value memory model, but when diagram data scale is significantly greater than the size of memory cache, its performance is also along with remarkable reduction; When accessing strange land data, Neo4j adopts the mode of asking as required, cannot make full use of bandwidth.

Trinity is the figure computing engines based on internal memory cloud designed by Microsoft Research, Asia.When data frequent updating, the operations such as insertion and deletion can make cell become large or reduce.After cell becomes greatly, existing storage area cannot hold, and just needs to open up larger storage space and loads cell, cell is moved to new position from origin-location; After cell reduces, the part reduced in the space of storage cell originally vacant go out, a lot of memory fragmentation can be produced.As can be seen here, the increase of cell and reduce and can form a large amount of memory fragmentations, thus weaken the utilization factor of internal memory.Trinity adopts the mode of internal memory deflation and the larger internal memory of predistribution to solve the problem of external fragmentation.But no matter adopt which kind of method, it is inevitable that the carrying of data in internal memory is moved, and this operation is very consuming time.Therefore, under the scene of diagram data frequent updating, the performance of Trinity is undesirable.In Trinity, when the whole diagram data of Program too big to fit in memory, part trunk can be positioned on disk.When the cell asked is not at internal memory, just need target trunk all to change to internal memory, because diagram data access has randomness, the locality of data access is very poor, the cell of multiple continuous request, likely not in the trunk changing to internal memory, therefore needs new trunk to change to.In remote data access, Trinity adopts the mode of asking as required.

Redis also adopts Key-Value to store, and when the data volume of diagram data constantly increases, when exceeding memory size, can only rely on and increase machine interstitial content mode.When cluster scale increases, need to repartition data, and do Data Migration.The Data Placement of Redis transfers to client to complete, and serve end program cannot act on behalf of the request of strange land diagram data.

From above research work, the storage of own characteristic to diagram data of the data access demand that figure calculates and diagram data proposes challenge, the chart database of traditional data storage method such as file system or distributed file system and Key-Value memory model is difficult to support diagram data to access efficiently and upgrade, and then can not effectively support that figure calculates.Therefore, need to design efficient diagram data storage system according to the access characteristics of diagram data and the feature of diagram data itself.

Summary of the invention

The object of the invention is to, for the magnanimity of social network data, frequent updating, random access and the feature such as openness, provide a kind of diagram data towards extensive social networks to store and querying method.

The present invention can meet the random access of neighborhood.The present invention devises a kind of three grades of diagram data storage organizations based on neighborhood.Be basic storage cell with summit neighborhood, summit neighborhood is the doubly linked list be made up of fixed size internal memory block, and comprise the information of all incidence edges on this summit in this doubly linked list, block can by directly address.

The present invention adopts the Data renewal mechanism of zero movement.We devise a kind of memory organization method of applicable diagram data frequent updating.When diagram data upgrades, be responsible for distribution and the recovery of block by block pond.With reclaim cause the reducing or increase of summit neighborhood time, the head and the tail of our just amendment double linked list, and need not tighten storage space elimination external fragmentation, also need not distribute larger storage space, data be moved newly assigned space from original place.This mechanism not only increases the utilization rate of internal memory, and solves deflation and mobile operation that other solutions face.

The present invention adopts the remote pre-fetch strategy of aware application.The data access amount of the single request of data of diagram data is few, causes remote I/O operation frequently.Use this strategy, when current strange land data request processing, the desired data of prediction subsequent calculations, then this predicted data and current request data together return to client process, are buffered in its this locality.Significantly can reduce IO number by this mechanism, predictability and the validity of data access are provided.The nomography that analysis and summary of the present invention is common, according to often taking turns in renewal which summit needed in reference diagram, these algorithms can be divided into two classes, i.e. neighborhood processing and non-neighborhood algorithm.In neighborhood processing, each takes turns the value only needing to access adjacent vertex when iteration upgrades each vertex attributes values.And in non-neighborhood algorithm, upgrade the value that each vertex attributes values also needs to access other summit except neighborhood.

Technical scheme of the present invention is:

Towards a diagram data storage means for extensive social networks, the steps include:

1) data storage manager adopts Key-Value mode to store to the diagram data that receives, wherein with the summit ID of diagram data for Key, with summit neighborhood for Value;

2) data of each summit neighborhood are stored: many limits be connected with this neighbor assignment summit, summit are stored in the memory block Block of fixed size with timestamp in order, and shared memory block Block is formed Block doubly linked list by usage chain list structure, the index information of the attribute information on this summit and this Block doubly linked list head of sensing and afterbody is stored in a data structure Vertex.

Further, described memory block Block carries out allocate and recycle by the Block pool manager that is responsible for storage resources distribution; Memory block Block in described Block doubly linked list comprises three parts: Part I is the pointer of the previous memory block Block pointing to current memory block Block in Block doubly linked list, Part II is the pointer of the rear Block pointing to current memory block Block in Block doubly linked list, and Part III is the limit that current memory block Block stores.

Further, described step 2) in, when a You Xin limit, summit needs to be stored into summit neighborhood, sequence is temporally stabbed in this new limit, the Block checking Block doubly linked list head whether less than, if less than, then this new limit is added the head at this Block; If the Block of head is full, then distributes a new memory block Block by Block pool manager and add the head of this Block doubly linked list to, then this new limit being stored in this new memory block Block.

Further, in memory block Block described in each, be provided with a data field, interval for the timestamp preserving deposited limit; When the limit in opposite vertexes field is deleted, choose on one side from the memory block Block of Block doubly linked list afterbody, if the timestamp on this limit is less than the interval lower bound of timestamp of afterbody memory block Block, then terminate deletion action, if the timestamp on this limit is greater than the upper bound in this timestamp interval, then remove this memory block Block from this Block doubly linked list end; If the timestamp on this limit is positioned at this timestamp interval, then scan the limit in this memory block Block, delete the limit more Zao than the timestamp on this limit.

Further, storage resources is divided into two classes by described data storage manager: coarseness storage resources and fine granularity storage resources, arrange bounded buffer window in the internal memory of described data storage manager; Wherein said coarseness storage resources for storing the large files of setting, and resides in internal memory, and described fine granularity storage resources is for storing demand assigned small documents; Fine granularity storage resources is loaded into this bounded buffer window by described data storage manager as required, when this bounded buffer window is occupied full, when needing to load new fine granularity storage resources, adopt cache replacement algorithm to be swapped out by fine granularity storage resources existing in this bounded buffer window, be then loaded into new fine granularity storage resources.

Further, described coarseness storage resources and described fine granularity storage resources are all divided into the described memory block Block of fixed size, and each memory block Block has unique Block ID.

Further, first described data storage manager distributes the memory block Block of coarseness storage resources, when coarseness storage resources exhausts, starts dynamically to distribute fine granularity storage resources desirably at memory block Block.

Method as claimed in claim 5, is characterized in that, described data-carrier store is scintigram data termly, is deleted by expired diagram data and dump in described fine granularity storage resources from described coarseness storage resources.

A kind of graph data query method, is characterized in that, when data storage manager receives the request of access of access vertex v, this vertex v and k rank neighborhood thereof are transferred to requestor by data storage manager; Return data is buffered in this locality by requestor, and during inquiry next time, first check local buffer memory, the summit if there is no inquired about, then send to described data storage manager by request of access.

Further, the k rank neighborhood of described vertex v is the set formed through whole summits of bee-line k bar Bian Keda from this vertex v.

Compared with prior art, good effect of the present invention is:

(1) can meet and dynamically update.Social network user frequently produces data, and in actual applications, data analysis has very high requirement to the ageing of data, and the new data that the behavior of user produces needs constantly to add.

(2) scene of process Sparse is applicable to.In social networks, the difference of good friend's number of user is a lot, and a few users (as large V), have a lot of good friend, and good friend's number of numerous domestic consumer is rare, and from statistics angle, social network data is the extreme sparse graph that average out-degree is very little.

(3) random access is applicable to.In the process of access diagram data, be different from access order data set, generally first access certain summit, then access adjacent vertex, again access the adjacent vertex of adjacent vertex, so constantly from a summit, from the close-by examples to those far off, finally complete whole graph traversal.

Accompanying drawing explanation

Data storage management in Fig. 1 the inventive method;

Fig. 2 is the memory model of summit neighborhood;

Fig. 3 is the local realistic Comparative result figure that tests;

Fig. 4 is that experimental result comparison diagram is read in this locality;

Fig. 5 is real data and storage resource consumption comparison diagram;

Fig. 6 is the experimental result comparison diagram of remote write;

Fig. 7 is the long-range experimental result comparison diagram read;

Fig. 8 is the experimental result comparison diagram of Data Update;

The performance comparison figure of Fig. 9 remote pre-fetch and non-prefetched.

Embodiment

Be described principle of the present invention and feature below in conjunction with accompanying drawing, example, only for explaining the present invention, is not intended to limit scope of the present invention.

The present invention can be divided into three parts: the Organization of Data strategy of zero movement, the memory management policy of self-adapting data scale and application perception strange land prefetch policy.

The Organization of Data strategy of zero movement

Diagram data incremental update refers to deletes out-of-date old limit by the new limit insertion figure neutralization produced from figure.All the time the movable footprint of social media user is produced fresh diagram data to need to be stored, and the out-of-date diagram data no longer had use value in diagram data is deleted.

Adopt Key-Value to store herein, be Key with summit ID, be Value with summit neighborhood.The Organization of Data feature of summit neighborhood has: 1) adopt and be stored in Block doubly linked list by orderly for limit adjacent for summit with timestamp, be convenient to insert new limit from the head of chained list, delete old limit from afterbody.2) in order to reduce the storage cost of limit at doubly linked list, be stored sequentially in the memory block Block of fixed size on many limits be connected with summit, the block block storing the limit on same summit forms Block doubly linked list.Block is responsible for distributing by the Block pool manager being responsible for storage resources distribution specially and reclaims.3) summit neighborhood is made up of data structure Vertex and Block chained list, and wherein Vertex is the structure of user-defined C language, wherein contains the attribute information on summit and points to the index information of Block doubly linked list head and afterbody.

When neighborhood inserts new limit, sequence is temporally stabbed in new limit apicad, the Block first checking chained list head whether less than, if not full, just add the head at this Block; If the Block of head is full, and still there is the new limit be inserted into, then distribute new Block by Block pool manager, add the head of chained list to, filling new Block with being inserted into limit.Repeat this process, until upgrade the new limit of full.When deleting the aging limit in summit neighborhood, from the Block of Block double linked list afterbody, in order to delete old limit fast, have special data field to preserve in Block timestamp interval that institute deposits limit.If want the timestamp on the limit of deleting to be less than the lower bound in timestamp interval, then terminate deletion action; When the timestamp on limit to be deleted is greater than the upper bound in timestamp interval, then removes this Block from chained list end, reclaim deleted Block by Block stored reservoir; If the timestamp when limit to be deleted appears in timestamp interval, then scan the limit in this Block, delete the limit older than the timestamp on limit to be deleted, and end operation.

Adopt Block doubly linked list to store and organize summit neighborhood in the mode going out limit that timestamp is orderly, have plurality of advantages.The first, the frequent dynamic assignment of small amount memory headroom avoiding adopting single edge to store and the extra metadata of storing excess.Because the renewal of diagram data is relatively more frequent, and the data volume on every bar limit is fewer, and renewal rewards theory causes the frequent dynamic assignment of small amount internal memory; And the metadata in limit also strengthens relative to the ratio of real data, cause the allocative efficiency of memory and utilization factor to reduce, and the storage of challengeing limit by oneself is also unfavorable for access summit neighborhood.Second, avoid take summit whole go out the mode of limit global storage be unfavorable for the problem that diagram data dynamically updates, adopt overall storage, summit whole go out limit store blob (the unfixed contiguous memory space of length of dynamic assignment), new limit update can cause former storage space to exhaust, enough large new storage space must be opened up, move original data, and insert new limit.Old edge contract can cause the vacant a considerable number of disabled internal fragmentation of storage space to operation.The regular stringency of usual employing, data mobile rearranges diagram data, to eliminate fragment.Although this organizational form access summit neighborhood has very high performance, the random delta of stringency meeting appreciable impact diagram data is with new efficiency.

In the present system, take into full account the feature of diagram data and the access characteristics to diagram data, devise a kind of summit neighborhood memory model of diagram data, as shown in Figure 2.Each Block is made up of three parts, and Part I is the pointer pointing to previous block, and Part II is the pointer pointing to a rear block, and Part III is the limit that this block stores.This model can support that diagram data upgrades effectively.

Diagram data storage means in this paper is different from the mode following points that industry uses in Data Update:

First, the data of Redis and Neo4j store the mode adopting continuous dispensing, but constantly upgrade along with data, exceed memory allocated block time, need to distribute larger contiguous memory block, then original data moved in new block, and new data is inserted.Data Update frequently, can cause frequently moving of data.Although Redis have employed one and allocates mechanism in advance, the internal memory of every sub-distribution, more than the internal memory number of actual needs, increases to tackle following possible data.But fundamentally do not solve the problem that internal storage data moves.And the mode that the present invention adopts link to distribute, by block stored reservoir management free memory block, the memory block distributed is linked into doubly linked list, and the renewal of data occurs in chained list end, need not cause moving and copying of data.

Secondly, in Redis and Neo4j, summit and the limit of diagram data store respectively, and the summit neighborhood of the direct storage figure of method in this paper.Store the diagram data of identical scale, Redis and Neo4j needs the network I/O starting more more number.The number of times that Redis and Neo4j starts I/O is figure vertex number and limit number sum, and the number of times that the present invention starts I/O is figure vertex number, and in social media graph structure, the limit number of figure is far longer than figure vertex number.Even if can bulk data transfer be adopted, because Redis and Neo4j data access adopts the query language of similar SQL, need the data of write to be converted to batch query statement in client, and perform through resolving at server end, conversion and parsing have impact on efficiency.And the present invention itself adopts summit and neighborhood thereof as Organization of Data unit, in remote access process, client adopts the data of buffer buffer memory target machine to be written, when buffer zone is full, or time out event occurs, just the batch data in buffer is sent to target machine.

Finally, each Data Update of Neo4j is all transaction operation, and the period of service needs data write to be synchronized in disk file.And Redis adopts the mechanism of RDB, by backstage I/O thread by internal memory sometime the data of section be written in disk file, each ablation process can block other and upgrade.And the present invention is when upgrading, adopt mmap interface, set up between region of memory and disk file and map, data are written to region of memory, are responsible for Data Update to be synchronized in disk file by operating system nucleus.

Comprehensive above difference, the present invention has multiple advantage in design in the process doing Data Update, can reach higher more new capability.

The memory management policy of self-adapting data scale

The data access that figure calculates has very high randomness, adopts memory organization data can improve data random access efficiency.Then, the data scale of diagram data has magnanimity, and the limited memory of cluster cannot load large-scale graph data.Propose the memory management scheme of self-adapting data scale herein, preferential use memory organization diagram data, when memory source exhausts, allows that data spill in disk.Concrete way is: the operable storage resources of diagram data management system is divided into two classes, and a class is coarseness storage resources, and another kind of is fine granularity storage resources.

The pre-assigned large files of the corresponding local Linux file system of coarseness resource, file size is that 2GB, 4GB are even larger, and prerequisite is no more than cluster individual node available memory space.During operation, coarseness resource is all loaded, and resides in internal memory.Fine granularity resource is for the demand assigned small documents of local Linux file system, and file size is 4MB, 8MB, and the number of small documents constantly becomes large along with the increase of diagram data.During operation, coarseness resource and all fine granularity resources all cannot be placed in internal memory, therefore in internal memory, open up bounded buffer window, as required fine granularity resource is loaded into window.When window is occupied full, when needing to load new fine granularity resource, need according to cache replacement algorithm, in selection window, existing fine granularity resource, is swapped out, and is then loaded into new resource.

Two class resources are all divided into the Block of fixed size, and all Block have unique Block ID, can access Block randomly according to BlockID.Block is the basic allocation unit of storage resources, is responsible for distributing and reclaiming Block by Block pool manager.When summit neighborhood Block doubly linked list head less than time, need to distribute the head of new Block as chained list for newly inserting limit; When deleting the old limit of summit neighborhood, need the Block of release and matter sky.All Block or be in idle condition, or belong to certain summit neighborhood of specifying.

During Resourse Distribute, dynamically can select storage resources according to the scale of data.First distribute from coarseness resource, as far as possible object adopts memory organization data.When coarseness resource exhaustion, start dynamically to distribute fine granularity resource desirably, store the data of overflowing internal memory.Limit in the neighborhood of summit stores in order with timestamp, can scintigram data termly, is deleted by expired diagram data, dump in fine granularity resource from coarseness resource, from but data internal memory keep fresh.

Adopt the storage allocation strategy of self-adapting data scale, both can use memory organization data as possible, thus improve the randomness of data access, when allowing again diagram data scale to increase, data are spilt into disk from internal memory, solves the problem of management of magnanimity diagram data.

The strange land prefetch policy of application perception

In the present invention, in order to accelerate the efficiency of strange land data access, the mechanism looked ahead in the strange land proposing a kind of aware application feature.By this mechanism, more enough throughputs improving strange land data access significantly.

The access of diagram data, is similar to the data that access order stores, has obvious locality.The algorithm that figure calculates is generally based on graph traversal algorithm.For breadth first traversal, during access vertex v, the accessed probability in summit of the k rank neighborhood (going out the set formed through whole summits of bee-line k bar Bian Keda from vertex v) of v increases along with k and reduces, therefore, when processing the request of access of v, system can predict summit most possibly accessed in the recent period according to vertex position information, the data on vertex v and summit most possibly accessed is in the recent period combined, is transferred to requestor.The data buffer storage that obtains in advance in this locality, when next time obtains strange land data, is first checked local buffer memory by requestor, if existed, and directly access this locality, otherwise to strange land request msg.

A lot of figure Computational frame sequentially can read whole summits of the subgraph of diagram data, now, can adopt the mode that order is looked ahead.Way choice of looking ahead will meet the data access feature that figure calculates, and can provide multiple strategy of looking ahead, and allows user select therefrom to select best prefetch policy.

Example 1 data access example

The present invention tests the present invention and compares is the write performance of Redis and Neo4j under different concurrency, and this write performance has got the mean value under ten data sets, and wherein in test pattern 3, the present invention uses NYNN to represent.

As shown in Figure 3, the local write performance of the present invention is obviously better than Redis and Neo4j, and the mechanism of the design of metadata and data transmission format and data write have impact on the performance write.The division of diagram data first of the present invention adopts between the wide apex region of continuous print, can pass through data write request first, by metadata cache in this locality, after this uses the metadata being buffered in this locality to carry out addressing data, which reduces the IO expense of addressing.When write data, in the Memory Mapping File and its that oneself write is local.Redis and Neo4j both provides querying command/language easy to use, first user is write the request of data, is packaged into querying command, then send to server by local network, be responsible for resolve command by server, complete inquiry.

The present invention tests the present invention and compares is that Redis and Neo4j reads performance under different concurrency, this read performance got ten data sets under mean value.

As shown in Figure 4, local read operation, the present invention can reach GB/s magnitude, and Redis and Neo4j is 10MB/s ~ 100MB/s, and diagram data directly can be passed through mmap interface mappings in the address space of client process by the present invention, operates.And Redis and Neo4j adopts querying command to server request msg.

In order to the storage efficiency of response diagram data-storage system, the present invention compares the storage die swell ratio of multiple system, namely stores the data of as much, and that sees internal memory and disk takies die swell ratio.

As shown in Figure 5, contrast Redis, Neo4j and disk service condition of the present invention, can significantly observe, Neo4j, as the storage system based on disk, takies disk space maximum, time data store, need stored attribute values and attribute-name, and property value adopts the form of character string to store.Disk storage space needed for Redis is minimum, before the data persistence of Redis, data is compressed.And in the use of internal memory, the internal memory of Redis uses maximum, because total data is loaded into internal memory by Redis.Neo4j and the present invention can configure the size of required memory.

The present invention tests the present invention and compares is the remote write performance of Redis and Neo4j under different concurrency, and this write performance has got the mean value under ten data sets.

As shown in Figure 6, remote write data, performance of the present invention is apparently higher than Neo4j and Redis.Exchanges data of the present invention adopts the form transmission byte stream of message.And Redis and Neo4j adopts querying command or language, reduce the transfer efficiency of data.In addition back-end data write of the present invention adopts multiple threads, the efficiency of raising.Redis adopts the single line of event-based model to go out process.

The present invention tests the present invention and compares is that long-range under different concurrency of Redis and Neo4j reads performance, this read performance got ten data sets under mean value.

As shown in Figure 7, long-range read operation, present invention employs data pre-fetching mechanism, can bulk transfer valid data.And Redis and Neo4j does not adopt prefetch mechanisms.

Example 2 Data Update example

Testing the present invention herein and comparing is the Data Update performance of Redis and Neo4j, and this performance has got the mean value under ten data sets.

As shown in Figure 8, for the incremental update of data, present invention employs the allocation scheme of chain type, the data that data are inserted and deletion causes can be eliminated and frequently move problem.And Redis and Neo4j have employed the mode of order-assigned, along with the renewal of data, need to move data, reorientation data, in this process, cannot new query processing be carried out.

Example 3 data remote access example

Test the performance difference of the present invention when having prefetch mechanisms and non-prefetched herein.The invention provides several data prefetch mechanisms, be convenient to upper layer application more oneself application background select best data pre-fetching mode.This experiment adopts BFS to look ahead, and test total limit number that the algorithm used is statistical graph data centralization figure, the thinking of algorithm is: be n burst by the vertex partition of figure, each burst specifies a thread process, and all thread parallels access the diagram data in the present invention.Queue is put in the summit will accessed in request of access by thread, then the summit of processing queue head, access top neighborhood of a point, statistics limit, summit number.If when terminal in front is positioned at burst belonging to thread, then this terminal is put into queue.The stem of continuous circular treatment queue, until queue is empty.But all threads complete the statistics on summit, result are added up, and algorithm terminates.This algorithm is BFS ergodic algorithm, can be good at utilizing BFS prefetch mechanisms, and prefetch data, in follow-up access, has very high hit rate.

As shown in Figure 9, the invention provides several data prefetch mechanisms, such as order is looked ahead, and BFS looks ahead and DFS looks ahead.What the Computational frame on upper strata can select best mode of looking ahead batch obtains diagram data from strange land, and in diagram data access process, each to read data volume fewer, although the digital independent of adjacent time is foreseeable.By chunk data is saved bit by bit in the access of large quantities of small amount of data, can bandwidth resources be made full use of, reduce IO start-up time, the data throughput that we can be higher by lower concurrent acquisition like this.

Claims

1., towards a diagram data storage means for extensive social networks, the steps include:

2. the method for claim 1, is characterized in that, described memory block Block carries out allocate and recycle by the Block pool manager that is responsible for storage resources distribution; Memory block Block in described Block doubly linked list comprises three parts: Part I is the pointer of the previous memory block Block pointing to current memory block Block in Block doubly linked list, Part II is the pointer of the rear Block pointing to current memory block Block in Block doubly linked list, and Part III is the limit that current memory block Block stores.

3. method as claimed in claim 2, it is characterized in that, described step 2) in, when a You Xin limit, summit needs to be stored into summit neighborhood, sequence is temporally stabbed in this new limit, the Block checking Block doubly linked list head whether less than, if less than, then this new limit is added the head at this Block; If the Block of head is full, then distributes a new memory block Block by Block pool manager and add the head of this Block doubly linked list to, then this new limit being stored in this new memory block Block.

4. the method as described in claim 1 or 2 or 3, is characterized in that, be provided with a data field in memory block Block described in each, interval for the timestamp preserving deposited limit; When the limit in opposite vertexes field is deleted, choose on one side from the memory block Block of Block doubly linked list afterbody, if the timestamp on this limit is less than the interval lower bound of timestamp of afterbody memory block Block, then terminate deletion action, if the timestamp on this limit is greater than the upper bound in this timestamp interval, then remove this memory block Block from this Block doubly linked list end; If the timestamp on this limit is positioned at this timestamp interval, then scan the limit in this memory block Block, delete the limit more Zao than the timestamp on this limit.

5. the method for claim 1, is characterized in that, storage resources is divided into two classes by described data storage manager: coarseness storage resources and fine granularity storage resources, arrange bounded buffer window in the internal memory of described data storage manager; Wherein said coarseness storage resources for storing the large files of setting, and resides in internal memory, and described fine granularity storage resources is for storing demand assigned small documents; Fine granularity storage resources is loaded into this bounded buffer window by described data storage manager as required, when this bounded buffer window is occupied full, when needing to load new fine granularity storage resources, adopt cache replacement algorithm to be swapped out by fine granularity storage resources existing in this bounded buffer window, be then loaded into new fine granularity storage resources.

6. method as claimed in claim 5, it is characterized in that, described coarseness storage resources and described fine granularity storage resources are all divided into the described memory block Block of fixed size, and each memory block Block has unique Block ID.

7. method as claimed in claim 5, it is characterized in that, first described data storage manager distributes the memory block Block of coarseness storage resources, when coarseness storage resources exhausts, starts dynamically to distribute fine granularity storage resources desirably at memory block Block.

8. method as claimed in claim 5, it is characterized in that, described data-carrier store is scintigram data termly, deleted by expired diagram data and dump in described fine granularity storage resources from described coarseness storage resources.

9. based on a graph data query method for the diagram data of method storage described in claim 1, it is characterized in that, when data storage manager receives the request of access of access vertex v, this vertex v and k rank neighborhood thereof are transferred to requestor by data storage manager; Return data is buffered in this locality by requestor, and during inquiry next time, first check local buffer memory, the summit if there is no inquired about, then send to described data storage manager by request of access.

10. method as claimed in claim 9, is characterized in that, the k rank neighborhood of described vertex v is the set formed through whole summits of bee-line k bar Bian Keda from this vertex v.