CN104573082A

CN104573082A - Space small file data distribution storage method and system based on access log information

Info

Publication number: CN104573082A
Application number: CN201510042456.9A
Authority: CN
Inventors: 潘少明; 徐正全; 种衍文; 李红; 李明; 汤戈
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2015-01-28
Filing date: 2015-01-28
Publication date: 2015-04-29
Anticipated expiration: 2035-01-28
Also published as: CN104573082B

Abstract

The invention provides a space small file data distribution storage method and system based on access log information. The method includes: dividing a space small file data set into a frequently-accessed sub-set and a non-frequently-accessed sub-set, extracting the access sequence of the frequently-accessed space small file sub-set, calculating the association degree of each frequently-accessed space small file datum, and using the values of the association degrees to form an association matrix; performing magnitude conversion on each value in the association matrix, using an RCM sorting algorithm to rearrange the values, then outputting the values, using a local approximation search method to search for the optimal combination of the rearranged association matrix, using the optimal combination to perform distributed storage on the frequently-accessed space small data, and separately storing the non-frequently-accessed space small file data according to space position neighboring relations.

Description

Based on the space small documents distributed data storage method and system of access log information

Technical field

The invention belongs to the distributed store technical field of space small documents data, particularly relate to a kind of space small documents distributed data storage method and system based on access log information newly.

Background technology

The storage of mass spatial information and fast access are the major issues that spatial Information Service system contemplates solves always, the data volume of conventional spatial Information Service system as collection every day of NASA Systeme pour l'Observation de la Terre reaches 2TB, store to obtain parallel fast access becoming crucial to the reasonable layout of these data, the solution that wherein a class is important is by carrying out distributed store to data to realize improving data access efficiency to the concurrent access of data.

More typical distributed file storage system mainly comprises as GFS (Google file system), HDFS (Hadoop distributed file system) and Lustre etc. at present.But the improvement of these systems in memory property is mainly reflected in the stores processor to large files.As GFS, its storage policy mainly, large files is divided into the block (as 64MB) of regular length, then all blocks are stored in respectively on different storeies to improve concurrent access rate (the list of references Ghemawat S of data, Gobioff H, Shun-Tak L.The Google file system.In:Proceedings of the Nineteenth ACMSymposium on Operating Systems Principles (SOSP ' 03) .Bolton Landing, New York:IEEE, 2003.1 – 15).Another kind of typical memory technology, as RAID (Redundant Array of Independent Disks), is also be stored in different disks after each large data file is divided into several data block respectively to improve the concurrent access to this file.

Although above distributed store method is effective to large file, but for small documents data, due to cannot piecemeal be proceeded, the method applicability stored by piecemeal is not enough, method general is at present simple being stored in by Single document on single storage server, thus be difficult to realize the concurrent access to multiple small documents data, I/O efficiency is not high.

Research shows, all there are a large amount of small documents data in current most of system, as having the file of 99% to be less than 64M in 1,300 ten thousand files at American National energy research scientific algorithm center, the file being less than 64K has accounted for 44% (list of references Carns P especially, Lang S, Ross R, et al..Small-file access in parallel file systems [C] .Parallel & DistributedProcessing, 2009.IPDPS 2009.IEEE International Symposium On.IEEE, 2009:1-11).

In fact, based on the spatial Information Service system of pyramid model, if Google Earth, World Wind etc. are the form storage space data with small documents equally.The earth is divided into the tile data of different resolution by World Wind according to pyramid model, each tile data saves as a file, the size of each tile data is fixed as 512 × 512 pixels, each tile file size is no more than 1MB (list of references Boschetti L, Roy D P, Justice C O.Using NASA ' s World Windvirtual globe for interactive internet visualization of the global MODIS burned area product.Int JRemote Sens, 2008, 29 (11): 3067 – 3072), Google Earth adopts multi-resolution models storage space data equally, the size of each data file is also no more than 64MB (list of references Sample J T, Loup E.Tile-base geospatialinformation system:principle and practices.New York:Springer, 2010.23 – 200).

In a word, distributed store method at present for large file is difficult to the storage being applied to small documents data, and Access Optimization (the non-memory optimization of data is concentrated on for the optimization of small documents data, Access Optimization curstomer-oriented end, and the service-oriented end of storage optimization), as reduced execution time (the list of references J.Kim of data-intensive applications program, A.Chandra, and J.B.Weissma.Using Data Accessibility for Resource Selection in Large-Scale Distributed Systems.IEEE Trans.Parallel Distributed Systems, vol.20, no.6, pp.788-801, June 2009), or reduce expense (the list of references A.L.Chervenak of small documents index information, R.Schuler, M.Ripeanu, M.A.Amer, S.Bharathi, I.Foster, A.Iamnitchi, and C.Kesselman.The Globus Replica Location Service:Design andExperience.IEEE Trans.Parallel Distributed Systems, vol.20, no.9, pp.1260-1272, Sept.2009) etc.But in a distributed system, the performance of access delay time is not only relevant with access method, and relevant with the distributed store pattern of data.Therefore the optimization problem of small documents data is solved not yet at all.

Summary of the invention

For above problem, the invention provides a kind of space small documents distributed data storage method and system based on access log information, utilize the access log information of space small documents data, analyze the mutual relationship between the small documents data of each space, and accordingly distributed store is carried out to space small documents data, to improve the concurrent access rate to space small documents data.

A kind of space small documents distributed data storage method and system based on access log information of the present invention, the technical scheme adopted is:

Based on a space small documents distributed data storage method for access log information, to any one space small documents data type, execution comprises the following steps:

Step 1, by space small documents data set, is divided into the subset of the non-frequent access of subset sums of frequent access according to access frequency difference; Comprise following sub-step,

Step 1.1, obtains each space small documents data access temperature, realizes as follows,

If space small documents data set is F={f ₁, f ₂..., f _n, comprise space small documents data f ₁, f ₂..., f _n, wherein N is total number of space small documents data, and i-th space small documents data markers is f _i, i=1,2 ..., N;

Have accessed space small documents data successively if record in access log information the access log sequence of space small documents data is a=(a ₁, a ₂..., a _m) be space small documents data access sequence vector, a _t∈ [1, N], access sequence number t=1,2 ..., M, wherein M in F the access total degree of small documents data of having living space;

Add up each space small documents data f _ithe number of times λ occurred in access log sequence R _i, with λ _ifor this space small documents data f _iaccess temperature;

Step 1.2, extracts the space small documents data be accessed frequently, realizes as follows according to space small documents data access temperature,

Discriminant parameter λ is preset in input,

If space small documents data f in the small documents data set F of space _iaccess temperature λ _i> λ, then space small documents data f _ifor the space small documents data of frequently accessing, otherwise f _ibelong to the space small documents data of non-frequent access;

Step 1.3, according to the subset of the space small documents data Special composition small documents data set that step 1.2 gained is frequently accessed, realizes as follows,

If the space small documents subset that data are formed of all frequent access is wherein N ₁for the total number of space small documents data of frequently accessing, i-th ₁, j ₁the space small documents data of individual frequent access are labeled as respectively with i ₁, j ₁∈ [1, N ₁];

Step 2, extracts the access sequence of the space small documents data subset of frequent access, comprises and form access sequence according to time order and function order from access log information

R_{1} = {f_{a_{1_{1}}}, f_{a_{2_{1}}}, . . ., f_{a_{M_{1}}}}, A_{1} = (a_{1_{1}}, a_{2_{1}}, . . ., a_{M_{1}})

For frequent addressing space small documents data access sequence vector, access sequence number t ₁=(1 ₁, 2 ₁..., M ₁), wherein M ₁for to F ₁in the access total degree of all frequent addressing space small documents data;

Step 3, utilizes the access sequence segmentation of the space small documents data subset of frequent access to calculate the degree of association of the space small documents data of each frequent access, and by the space small documents data of each frequent access degree of association numerical value composition incidence matrix each other; Comprise following sub-step,

Step 3.1, according to storage server quantity m, frequent addressing space small documents data subset length N ₁calculate frequent access sequence section length n=N ₁/ m;

Step 3.2, carries out segmentation according to access sequence section length to frequent access sequence, realizes as follows,

According to access order, by frequent addressing space small documents data access sequence vector A ₁be divided into some subvectors with n element one group, be expressed as A ₁=(S ₁, S ₂..., S _l), wherein subvector S _k=(a _k1, a _k2..., a _kn), a _kj∈ [1, N ₁], 1≤k≤l, 1≤j≤n; By A ₁in all subvector set be designated as S, S={S _k: k ∈ [1, l] };

Step 3.3, calculates the space small documents data degree of association numerical value each other of frequent access, realizes as follows,

Defined function

Wherein for S _kin all elements composition set; Function represent the space small documents data of frequent access within the access cycle that length is n with whether there is relevance;

Defined function R _s(i ₁, j ₁),

R_{S} (i_{1}, j_{1}) = Σ_{k = 1}^{l} R_{S_{k}} (i_{1}, j_{1}), 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Wherein R _s(i ₁, j ₁) represent S couple with total correlation degree;

Step 3.4, by the space small documents data degree of association numerical value composition incidence matrix R each other frequently accessed _s,

R_{S} = {(R_{S} (i_{1}, j_{1}))}_{N_{1} \times N_{1}}, 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Step 4, exports after utilizing RCM sort algorithm to reset to element numerical value each in incidence matrix after carrying out size conversion;

Step 5, utilizes partial approximation search procedure to circulate successively to find after m best of breed to the incidence matrix after resetting and exports, and method is as follows,

Step 6, utilizes the space small documents data of step 5 gained best of breed to frequent access to carry out distributed store, and separately stores according to locus neighbouring relations the space small documents data of non-frequent access.

And step 4 comprises following sub-step,

Step 4.1, obtains element maximal value in incidence matrix, comprises traversal incidence matrix all elements value, and obtains maximal value R _max;

Step 4.2, carries out size conversion to incidence matrix element numerical value, comprises traversal incidence matrix all elements value, and executable operations R _s(i ₁, j ₁)=R _max-R _s(i ₁, j ₁);

Step 4.3, utilizes standard RCM sort algorithm to reset incidence matrix.

And step 5 comprises following sub-step,

Step 5.1, initialization current iteration number of times d=1;

Step 5.2, adopts partial approximation search procedure to find a best of breed, is included in the block finding a n × n in current matrix, and make matrix element value corresponding in n × n block in this matrix maximum, a corresponding n file forms a best of breed; First time, when performing step 5.2, current matrix was the incidence matrix after step 4 gained is reset; During follow-up execution step 5.2, current matrix is the matrix of a front iteration gained;

Step 5.3, after the search of current iteration execution step 5.2 obtains a best of breed be made up of n file, deletes the incidence matrix element of n file corresponding in incidence matrix, obtains (N ₁-dn) × (N ₁-dn) matrix;

Step 5.4, judges whether d=m-1, otherwise makes d=d+1, performs step 5.3 gained (N with current iteration ₁-dn) × (N ₁-dn) matrix be current matrix, return step 5.2 and carry out next iteration and continue the next combination recently of search, be stop search, obtain m best of breed altogether.

The present invention is also corresponding provides a kind of space small documents distributed data storage system based on access log information, comprises with lower unit,

Space small documents data set pretreatment unit (100), for the space small documents data set by any one space small documents data type, is divided into the subset of the non-frequent access of subset sums of frequent access according to access frequency difference; Comprising with lower module, space small documents data access frequency statistical module (101), for obtaining each space small documents data access temperature, realizing as follows,

Frequent addressing space small documents data set extraction module (102), for extracting the space small documents data be accessed frequently according to space small documents data access temperature, realizes as follows,

Discriminant parameter λ is preset in input,

Frequent addressing space small documents subset builds module (103), for the subset of space small documents data Special composition small documents data set of frequently accessing according to frequent addressing space small documents data set extraction module (102) gained, realize as follows

Space small documents data access vector acquiring unit (200), for extracting the access sequence of the space small documents data subset of frequent access from access log information, comprising and forming access sequence according to time order and function order for frequent addressing space small documents data access sequence vector, access sequence number t ₁=(1 ₁, 2 ₁..., M ₁), wherein M ₁for to F ₁in the access total degree of all frequent addressing space small documents data;

Space small documents data access incidence matrix computing unit (300), access sequence segmentation for the space small documents data subset utilizing frequent access calculates the degree of association of the space small documents data of each frequent access, and by the space small documents data of each frequent access degree of association numerical value composition incidence matrix each other; Comprise with lower module, frequent access sequence section length computing module (301), for according to storage server quantity m, frequent addressing space small documents data subset length N ₁calculate frequent access sequence section length n=N ₁/ m;

Storage server number parameter m is inputted by outside.

Frequent access sequence segmentation module (302), for carrying out segmentation according to access sequence section length to frequent access sequence, realizes as follows,

Space small documents data correlation degree computing module (303), for calculating the space small documents data degree of association numerical value each other of frequent access, realizes as follows,

Defined function

Defined function R _s(i ₁, j ₁),

R_{S} (i_{1}, j_{1}) = Σ_{k = 1}^{l} R_{S_{k}} (i_{1}, j_{1}), 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Wherein R _s(i ₁, j ₁) represent S couple with total correlation degree;

Space small documents data correlation matrix generation module (304), for the space small documents data degree of association numerical value composition incidence matrix R each other that will frequently access _s,

R_{S} = {(R_{S} (i_{1}, j_{1}))}_{N_{1} \times N_{1}}, 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Incidence matrix conversion rearrangement units (400), utilizes RCM sort algorithm to reset rear output after carrying out size conversion to element numerical value each in incidence matrix;

Incidence matrix best of breed search unit (500), for utilizing partial approximation search procedure to find best of breed to the incidence matrix after rearrangement;

Space small documents distributed data storage unit (600), for utilizing the space small documents data of incidence matrix best of breed search unit (500) gained best of breed to frequent access to carry out distributed store, and the space small documents data of non-frequent access are separately stored according to locus neighbouring relations.

And incidence matrix conversion rearrangement units (400) comprises with lower module,

Incidence matrix element maximal value acquisition module (401), for obtaining element maximal value in incidence matrix, comprising traversal incidence matrix all elements value, and obtaining maximal value R _max;

Incidence matrix element value size modular converter (402), for carrying out size conversion to incidence matrix element numerical value, comprises traversal incidence matrix all elements value, and executable operations R _s(i ₁, j ₁)=R _max-R _s(i ₁, j ₁);

Incidence matrix reordering module (403), resets incidence matrix for utilizing standard RCM sort algorithm.

And incidence matrix best of breed search unit (500) comprises with lower module,

Initialization module, for initialization current iteration number of times d=1;

Best of breed search module, a best of breed is found for adopting partial approximation search procedure, be included in the block finding a n × n in current matrix, make matrix element value corresponding in n × n block in this matrix maximum, a corresponding n file forms a best of breed; Best of breed search mould first time, when working, current matrix was the incidence matrix after incidence matrix conversion rearrangement units (400) gained is reset; During best of breed search mould follow-up work, current matrix is the matrix of a front iteration gained;

Matrix update module, for carry out at best of breed search module current iteration work search obtain a best of breed be made up of n file after, by the incidence matrix element of n file corresponding in incidence matrix deletion, obtain (N ₁-dn) × (N ₁-dn) matrix;

Judging output module, for judging whether d=m-1, otherwise making d=d+1, with matrix update module current iteration work gained (N ₁-dn) × (N ₁-dn) matrix be current matrix, order best of breed search module carries out next iteration work and continues the next combination recently of search, is stop search, obtains m best of breed altogether.

The beneficial effect that the present invention has is: space small documents data are due to enormous amount, but there is aggregation in user access activity, major part request concentrates on small part space small documents data, the present invention is by after the temperature classification that conducts interviews to space small documents data for this reason, its degree of association is each other calculated to the space small documents data separate access log information of frequent access, and best distribution storage assembled scheme is found by partial approximation search procedure after composition incidence matrix, and the different scheme distributed store of space small documents data acquisition to different temperature, under limited computational resource consumes, the Optimal Distribution realizing magnanimity space small documents data stores, reach and improve its concurrent access performance, improve the object of the service ability of space information system.Therefore, the present invention can reduce the coincidence during small documents data access of server internal space, thus space small documents data parallel rate of people logging in high between final acquisition server, improve space small documents data, services performance, and reduce calculating data volume, efficiency is higher, has good engineering practice, can be applicable to the technical field of distributed memory of space small documents data under large-scale distributed environment.

Accompanying drawing explanation

Fig. 1 is system architecture schematic diagram in the embodiment of the present invention.

Fig. 2 is space small documents data set pretreatment unit 100 structural representation in the embodiment of the present invention.

Fig. 3 is space small documents data access incidence matrix computing unit 300 structural representation in the embodiment of the present invention.

Fig. 4 is incidence matrix conversion rearrangement units 400 structural representation in the embodiment of the present invention.

Fig. 5 is method flow diagram in the embodiment of the present invention.

Embodiment

Under distributed environment, concurrent access by realizing the piecemeal distributed store of data it is difficult to the access of space small documents data, therefore the mutual relationship between each space small documents data of Water demand, to realize when conducting interviews to space small documents data, asked space small documents data are made to be stored in different storage servers as much as possible, with the parallel acquisition of the realization of maximum possible to space small documents data, thus improve the performance of spatial Information Service system.

Because space small documents data bulk is huge, the storage Combinatorial Optimization computation complexity of large-scale space small documents data is high, search plain time overhead large, need to carry out temperature classification to space small documents data for this reason, and adopt diverse ways to obtain best storage assembled scheme respectively according to different temperature.

Below the concrete enforcement of technical solution of the present invention is provided and advise explanation in detail.

Space of the present invention small documents data, comprise Spatial data types and spatial coordinate location, and each space small documents data are less, are unsuitable for being continued be divided into many parts and store respectively on a different server to improve its concurrent access efficiency.Described access log information is that the log information of each client application addressing space small documents data, comprises accessed space small documents data type and coordinate by the spatial Information Service system of correspondence according to sequence of event.Described access log information is by spatial Information Service system record in operational process, and form includes but not limited to file, database.

Described space small documents packet, containing dissimilar, includes but not limited to SRTM30 (the 30m of global ShuttleRadar Topography Mission terrain data files), SRTM90.

Described a kind of space small documents distributed data storage method and system based on asking log information, the space small documents data for every type process respectively, and described method and system is identical to dissimilar space small documents data handling procedure.

As shown in Figure 5, the technical scheme that method of the present invention adopts is: a kind of space small documents distributed data storage method and system based on access log information, and to any one space small documents data type, execution comprises the following steps:

(1) frequent addressing space small documents data subset extracts: by space small documents data set, be divided into frequent access subset and non-frequent access subset according to access frequency difference; Comprise following sub-step,

1. each space small documents data access temperature is obtained.

If space small documents data set is F={f ₁, f ₂..., f _n, comprise space small documents data f ₁, f ₂..., f _n, wherein N is total number of space small documents data, and i-th space small documents data markers is f _i, i=1,2 ..., N.

Have accessed space small documents data successively if record in access log information the access log sequence of space small documents data is corresponding title A=(a ₁, a ₂..., a _m) be space small documents data access sequence vector, a _t∈ [1, N] (access sequence number t=1,2 ..., M), wherein M in F the access total degree of small documents data of having living space.

Add up each f _i(f _i∈ F) the number of times λ that occurs in access log sequence R _i, then λ _ifor this space small documents data f _iaccess temperature.

2. the space small documents data be accessed frequently are extracted according to space small documents data access temperature.

Input the default discriminant parameter λ of frequent addressing space small documents data,

If space small documents data f in the small documents data set F of space _iaccess temperature λ _i> λ, then space small documents data f _ifor the space small documents data of frequently accessing, otherwise, f _ibelong to the space small documents data of non-frequent access.

3. according to the subset of 2. obtained frequent addressing space small documents data Special composition small documents data set F

If the subset of setting the space small documents data of all frequent access to form as wherein N ₁for the total number of space small documents data of frequently accessing, i-th ₁, j ₁the space small documents data of individual frequent access are labeled as respectively with i ₁, j ₁∈ [1, N ₁].

Can set equally the space small documents data set of non-frequent access as wherein N ₂for the total number of space small documents data of non-frequent access.Wherein N ₁+ N ₂=N.

(2) frequent addressing space small documents data subset access sequence extracts: the access sequence extracting the space small documents data subset of frequent access from access log information;

Access log information have recorded the coordinate of spatial data, and different coordinates represents different data.Therefore can extract the coordinate information of the space small documents data of accessing according to access time sequencing from access log information.During concrete enforcement, specifying information extracting mode can be determined according to the record format of access log information.Coordinate information is the space latitude and longitude coordinates of space small documents data.

Extract access sequence subset according to frequent addressing space small documents data subset, realize as follows,

To small documents data in space in access log information according to access time sequencing, get the space small documents data of wherein frequent access, form the access sequence of the space small documents data subset of frequent access corresponding title for frequent addressing space small documents data access sequence vector, (access sequence number t ₁=(1 ₁, 2 ₁..., M ₁)), wherein M ₁for to F ₁in the access total degree of all frequent addressing space small documents data.

(3) calculation of relationship degree and incidence matrix obtain: utilize the access sequence segmentation of the space small documents data subset of frequently accessing to calculate the degree of association of the space small documents data of each frequent access, and by the space small documents data of each frequent access degree of association numerical value composition incidence matrix each other; Comprise following sub-step,

1. according to storage server quantity, frequent addressing space small documents data subset length N ₁calculate frequent access sequence section length n.

Storage server quantity m can be inputted by outside, such as, inputted by CONFIG.SYS.

By formula n=N ₁/ m calculates frequent access sequence section length n.

2. according to access sequence section length, segmentation is carried out to frequent access sequence.

According to the access order of frequent addressing space small documents data, by frequent addressing space small documents data access sequence vector A ₁be divided into some subvectors with n element one group, be expressed as: A ₁=(S ₁, S ₂..., S _l), wherein subvector S _k=(a _k1, a _k2..., a _kn), a _kj∈ [1, N ₁], 1≤k≤l, 1≤j≤n is A ₁in length be the subvector of n.By A ₁middle all length is that the access vector set of n is designated as S, i.e. A ₁in the S set={ S of all subvectors _k: k ∈ [1, l] }.

3. the space small documents data degree of association numerical value each other of frequent access is calculated

First small documents data interrelated degree in space in each segmentation is calculated, right defined function:

Wherein for S _kin all elements composition set. the meaning of function is, the space small documents data of frequent access within the access cycle that length is n with whether there is relevance.

On this basis, defined function:

R_{S} (i_{1}, j_{1}) = Σ_{k = 1}^{l} R_{S_{k}} (i_{1}, j_{1}), 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1} - - - (2)

Then R _s(i ₁, j ₁) represent S couple with total correlation degree.

The space small documents data degree of association numerical value composition incidence matrix each other of 4. will frequently access.

By all N ₁the space small documents data degree of association matrix representation each other of individual frequent access, can obtain following incidence matrix R _s.

R_{S} = {(R_{S} (i_{1}, j_{1}))}_{N_{1} \times N_{1}}, 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1} - - - (3)

(4) incidence matrix conversion and rearrangement export: export after utilizing RCM sort algorithm to reset after carrying out size conversion to element numerical value each in incidence matrix; Comprise following sub-step,

1. element maximal value in incidence matrix is obtained.

Traversal incidence matrix all elements value, and obtain maximal value R _max.

2. size conversion is carried out to incidence matrix element numerical value.

Traversal incidence matrix all elements value, and executable operations R _s(i ₁, j ₁)=R _max-R _s(i ₁, j ₁), incidence matrix element value size is changed.

3. standard RCM sort algorithm is utilized to reset incidence matrix.

Employing standard RCM sort algorithm is reset incidence matrix, and target is concentrated near diagonal line by nonzero element in incidence matrix.New matrix after resetting is designated as P.Standard RCM sort algorithm is prior art, can list of references Gibbs N E during concrete enforcement, Poole W G, Stockmeyer P K.An algorithm for reducing the bandwidth and profile of asparse matrix.SIAM Journal on Numerical Analysis, 1976,13 (2): 236-250.

(5) search of optimal storage distributed combination exports.

Incidence matrix after resetting (4) gained utilizes partial approximation search procedure to find best of breed to obtain the highest concurrent access rate to these subset space small documents data.Partial approximation search procedure is prior art, can list of references XIA Kai during concrete enforcement, Wen-zhan.Adaptive Genetic Algorithm Based on Local Search Mechanism Quickly Solving TSP.Journal of Zhejiang Institute of Science and Technology, 2014,31 (3).

Incidence matrix after resetting according to (4) gained, iteration uses partial approximation search procedure, often perform partial approximation search and once obtain the best of breed that comprises n file, finally can obtain m combination, be stored in respectively so that follow-up on m storage server.Each combination is made up of n file, n file association angle value is each other the corresponding in a matrix block of a n × n; Be implemented as follows:

1. initialization current iteration number of times d=1;

2. adopt partial approximation search procedure to find a best of breed, be included in the block finding a n × n in current matrix, make matrix element value corresponding in n × n block in this matrix maximum, a corresponding n file forms a best of breed;

First time, when performing 2., current matrix was the incidence matrix after (4) gained is reset, and matrix size is N ₁× N ₁; Follow-up execution 2. time, current matrix is a front iteration gained (N ₁-(d-1) n) × (N ₁-(d-1) matrix n);

3., after 2. search obtains a best of breed be made up of n file in current iteration execution, the incidence matrix element of n file corresponding in incidence matrix is deleted, obtains (N ₁-dn) × (N ₁-dn) matrix, reduce continue search incidence matrix size, can search time be saved;

4. judge whether d=m-1, otherwise make d=d+1, perform 3. gained (N with current iteration ₁-dn) × (N ₁-dn) matrix (after d=d+1 i.e. (N ₁-(d-1) n) × (N ₁-(d-1) is n)) based on as current matrix, return and 2. carry out the next combination recently of next iteration continuation search, stop search, current matrix is n × n, directly can obtain last best of breed be made up of n file, the best of breed that obtains for m-1 time of cyclic search together successively, obtains m best of breed altogether.

(6) space small documents distributed data storage: the space small documents data of the best of breed utilizing (5) finally to obtain to frequent access carry out distributed store, and separately store according to its locus neighbouring relations the space small documents data of non-frequent access.

Embodiment carries out distributed store to obtain the highest concurrent access rate of these space small documents data according to the space small documents data of the best of breed obtained to frequent access.

The best distribution obtained by step (5) stores the space small documents data of combination, there is the low feature of the degree of association each other (namely in incidence matrix after rearrangement, corresponding element value is large after the conversion of matrix element value size), then can the small documents data of having living space in a best of breed be stored in a server, obtain Concurrency Access low each other with this and require (namely achieving concurrent access rate high between different server).

According to the coordinate information of space small documents data, embodiment carries out separately storing according to its Space correlation to non-frequent addressing space small documents data.

For the F of step (1) ₂, adjacent according to position, then stored in the principle of different server, the space small documents data of non-frequent access are stored in the server.

According to spatial data accessing feature, spatial data accessing has the continuity of space access road strength, and therefore, adjacent spatial data has higher probability by simultaneously accessed, therefore, is stored in different servers and can reduces concurrent, improves parallel rate.

During concrete enforcement, the discriminant parameter of described frequent addressing space small documents data, incidence matrix RCM sort algorithm parameter, storage server quantity can be inputted by outside or be preset by those skilled in the art.

See Fig. 1, the present invention is also corresponding provides a kind of space small documents distributed data storage system based on access log information, comprises with lower unit,

Space small documents data set pretreatment unit (100), for the space small documents data set by any one space small documents data type, is divided into the subset of the non-frequent access of subset sums of frequent access according to access frequency difference; See Fig. 2, comprising with lower module, space small documents data access frequency statistical module (101), for obtaining each space small documents data access temperature, realizing as follows,

Discriminant parameter λ is preset in input,

Space small documents data access incidence matrix computing unit (300), access sequence segmentation for the space small documents data subset utilizing frequent access calculates the degree of association of the space small documents data of each frequent access, and by the space small documents data of each frequent access degree of association numerical value composition incidence matrix each other; See Fig. 3, comprise with lower module, frequent access sequence section length computing module (301), for according to storage server quantity m, frequent addressing space small documents data subset length N ₁calculate frequent access sequence section length n=N ₁/ m;

Defined function

Defined function R _s(i ₁, j ₁),

R_{S} (i_{1}, j_{1}) = Σ_{k = 1}^{l} R_{S_{k}} (i_{1}, j_{1}), 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Wherein R _s(i ₁, j ₁) represent S couple with total correlation degree;

R_{S} = {(R_{S} (i_{1}, j_{1}))}_{N_{1} \times N_{1}}, 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

See Fig. 4, incidence matrix conversion rearrangement units (400) comprises further with lower module,

Incidence matrix best of breed search unit (500) comprises with lower module,

Each module specific implementation can be consistent with method concrete steps, and it will not go into details in the present invention.

Specific embodiment described herein is only to the explanation for example of the present invention's spirit.Those skilled in the art can make various amendment or supplement or adopt similar mode to substitute to described specific embodiment, but can't depart from spirit of the present invention or surmount the scope that appended claims defines.

Claims

1. based on a space small documents distributed data storage method for access log information, it is characterized in that: to any one space small documents data type, execution comprises the following steps:

Discriminant parameter λ is preset in input,

Step 2, extracts the access sequence of the space small documents data subset of frequent access, comprises and form access sequence according to time order and function order from access log information for frequent addressing space small documents data access sequence vector, access sequence number t ₁=(1 ₁, 2 ₁..., M ₁), wherein M ₁for to F ₁in the access total degree of all frequent addressing space small documents data;

Defined function

Defined function R _s(i ₁, j ₁),

R_{S} (i_{1}, j_{1}) = Σ_{k = 1}^{l} R_{S_{k}} (i_{1}, j_{1}) 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Wherein R _s(i ₁, j ₁) represent S couple with total correlation degree;

R_{S} = {(R_{S} (i_{1}, j_{1}))}_{N_{1} \times N_{1}} 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Step 5, utilizes partial approximation search procedure to find best of breed to the incidence matrix after resetting;

2., according to claim 1 based on the space small documents distributed data storage method of access log information, it is characterized in that: step 4 comprises following sub-step,

Step 4.3, utilizes standard RCM sort algorithm to reset incidence matrix.

3. according to claim 1 or 2 based on the space small documents distributed data storage method of access log information, it is characterized in that: step 5 comprises following sub-step,

Step 5.1, initialization current iteration number of times d=1;

4., based on a space small documents distributed data storage system for access log information, it is characterized in that: comprise with lower unit,

Discriminant parameter λ is preset in input,

Space small documents data access incidence matrix computing unit (300), access sequence segmentation for the space small documents data subset utilizing frequent access calculates the degree of association of the space small documents data of each frequent access, and by the space small documents data of each frequent access degree of association numerical value composition incidence matrix each other; Comprise with lower module,

Frequent access sequence section length computing module (301), for according to storage server quantity m, frequent addressing space small documents data subset length N ₁calculate frequent access sequence section length n=N ₁/ m;

Defined function

Defined function R _s(i ₁, j ₁),

R_{S} (i_{1}, j_{1}) = Σ_{k = 1}^{l} R_{S_{k}} (i_{1}, j_{1}) 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

Wherein R _s(i ₁, j ₁) represent S couple with total correlation degree;

R_{S} = {(R_{S} (i_{1}, j_{1}))}_{N_{1} \times N_{1}} 1 \leq i_{1} \leq N_{1}, 1 \leq j_{1} \leq N_{1}

5. according to claim 4 based on the space small documents distributed data storage system of access log information, it is characterized in that: incidence matrix conversion rearrangement units (400) comprises with lower module,

6. according to claim 4 or 5 based on the space small documents distributed data storage system of access log information, it is characterized in that: incidence matrix best of breed search unit (500) comprises with lower module,