CN104809210A

CN104809210A - Top-k query method based on massive data weighing under distributed computing framework

Info

Publication number: CN104809210A
Application number: CN201510209691.0A
Authority: CN
Inventors: 何洁月; 罗浩
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2015-04-28
Filing date: 2015-04-28
Publication date: 2015-07-29
Anticipated expiration: 2035-04-28
Also published as: CN104809210B

Abstract

The invention discloses a top-k query optimization method based on massive data under a spark distributed computing framework. The method comprises the following steps: carrying out data partitioning on a massive data set in advance by mainly adopting a data partitioning method similar to gridding; dividing an original data set into different data subsets; then selecting a little amount of appropriate data subsets to substitute for the whole data set for query according to the weight and a query k value endowed by a user to each attribute of a data object. An experiment result proves that the method provided by the invention is relatively high in query speed and favorable in expandability. Compared with a traditional top-k query method and an angle and distance-based data partitioning method, the top-k query method disclosed by the invention has the advantages that the query speed is increased, and information needing to be queried can be fed back to the user in time within a short time.

Description

A kind of based on magnanimity data weighting top-k querying method under distributed computing framework

Technical field

The present invention relates to a kind of data enquire method, particularly a kind of mass data concentrates the top-k querying method of weighting.

Background technology

Top-k inquiry is also referred to as sequence sensitive queries (rank-aware query), be a operation the most basic in database, be also data analysis important tool, especially in business analysis simultaneously, often only need to pay close attention to the most useful data, instead of whole data set.

Top-k inquiry is defined as follows: use D={T ₁, T ₂..., T _nrepresent the set of all data objects, T _irepresent wherein i-th data object, each data object has d to tie up, and is all a point in space.Inquire about Q (f, k) for a top-k, f represents score function, and k represents k the result returning and meet search request.F is weighted sum function, namely for the data object T (t of in sample ₁, t ₂..., t _d), user gives a weights W (w to each attribute of this data object ₁, w ₂..., w _d), the score of each data object is obtained by each property value weighted sum, and namely scoring function is:

f_{W} (T) = Σ_{i = 1}^{d} w_{i} * t_{i}

As long as top-k inquiry finally obtains the result set of k element, just can obtain as long as carry out sequence to the data much smaller than input data set, and not need to process the data of the overall situation.In recent years along with the volatile growth of data scale, the data scale of magnanimity stores data, manages and analyzes and brings great challenge.Top-k inquiry, as a basic operation in data analysis, needs to obtain Query Result fast.Such as: in Taobao's magnanimity commodity, user gives different weight according to self preference to item property, then system meet consumers' demand according to user's request fast return before k commodity.

But be faced with two challenges greatly for mass data top-k inquiry: one is that data scale reaches TB or PB level, and traditional centralized data processing method is no longer applicable; Two is how can obtain Query Result fast and accurately for massive data sets.

Top-k inquiry in traditional centralized data system runs into performance bottleneck, so be not suitable for massive data sets process in mass data.In traditional distributed environment, some research is by improving the efficiency of inquiry to the buffer memory of Query Result, this method does not solve mass data top-k in itself and inquires about problem; The Skyline skyline query that utilizes had carries out data processing, proposes the top-k process framework of DiTo the whole series, but also just in traditional distributed environment.

The solution that top-k problem is the most basic under cloud environment in recent years exactly to all data sort then return before k result, but this method is inquired about and all will be processed raw data set at every turn, cause the workload of redundancy, query time is long, so inadvisable.The people such as RanKloud propose the threshold value being calculated inquiry premature termination under MapReduce framework by statistical study when system cloud gray model, and this method can not ensure to obtain k result accurately.Research is also had existing in a new inquiry and buffer memory inquire about similarity by caching mechanism by comparing, if similarity degree greatly, then need not be inquired about again, although quickening inquiry velocity, Query Result out of true.Have and propose inquire about based on angle and distance Data Placement top-k, but based on the data partition schemes of angle, data coordinates conversion complexity is time-consuming, so be also not suitable for massive data sets process.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of mass data weighting top-k querying method based on distributed computing framework, inquiring about the technical matters that cannot obtain Query Result when processing mass data fast, accurately, easily for solving existing top-k.

Technical scheme: for achieving the above object, the technical solution used in the present invention is:

First following 4 reasonable assumptions are made:

(1), any one data object attribute value all gets nonnegative value, even negative value also by the normalization of data, can become nonnegative value.

(2), data set is relatively fixing, or the renewal speed of data is for whole data set, can ignore within a certain period of time, such as, although be engraved in renewal during the commodity data in Jingdone district, based on huge commodity radix, can think that change is little within certain time period.Therefore, be directed to flow data process, the inventive method is inapplicable.

(3), data are uniformly distributed in space, and concentrate in mass data, this hypothesis meets under a lot of scene.

(4) for an input weights W, meet even if, also can not obtained by normalization.

On the basis of above-mentioned 4 reasonable assumptions, propose a kind of based on magnanimity data weighting top ?k querying method under distributed computing framework, comprise the following steps that order performs:

Step 1, set up data space

First the property value of the data object of all d of comprising attribute is all converted into nonnegative value, and property value is normalized; Set up d dimension coordinate system, the axle of coordinate system and the attribute one_to_one corresponding of data object, form data space by all Data object placements in coordinate system;

Step 2, Data Placement

With the initial point of coordinate system for starting point, whole coordinate system is divided into from inside to outside m region, here m value can not be excessive, otherwise the negative consequence that calculated amount increases can be brought, when current data scale, generally m is taken as 3 ~ 5, such scope reasonably considers data scale and calculated amount, certainly, along with the further increase of later data scale, the value that suitably can increase m is to obtain the object of the minimizing of the data total amount in the region that marks off; By each region from outside to inside serial number be 1,2, m, and all data objects are cooperatively all included by the border in the 1st region and coordinate axis, to any one region, the maximal value of every attribute in this region is identical, and the coordinate of the peripheral boundary in each region meets and has at least the coordinate figure of an axle to be the maximal value of the attribute in this region, be a in the maximal value of the attribute in setting the 1st region ₁prerequisite under, then the maximal value of the attribute in i-th region i=1,2 ..., m.After having divided region according to said method, known in conjunction with hypothesis (3) above, the data volume in each region is equal.

Except outermost region, to all the other each regions, using belong to this region and the attribute of each axle is the point of maximal value as basic point, region all properties value in whole coordinate system being all more than or equal to the respective attributes value at basic point place all divides out, be 1,2 according to serial number from outside to inside ... m-1, using above-mentioned new division region out as judging district.

According to the principle of Skyline, to any two points 1 and point 2, all properties value as fruit dot 1 is all less than a little 2, then put 1 support point 2.Based on above-mentioned principle, if given two data object T ₁and T ₂if, for there is T ₁property value be more than or equal to T ₂corresponding property value and T ₁.t _i>=T ₂.t _i, t _irepresent the property value of i-th attribute, then an any given input weights W (w ₁, w ₂..., w _d), must T be there is ₁score be greater than T ₂score and f _w(T ₁)>=f _w(T ₂).

Based on above-mentioned analysis, the score of the data object in a certain judgement district is inevitable all be greater than the score of the data object in all regions, inner side being positioned at this region belonging to judgement district, the result returned due to algorithm takes from the highest k of score, k is the number of data object in the result set that returns of algorithm, so once the number of data object in a certain judgement district is more than or equal to k, so this k must be obtain from the inside region in the region belonging to this judgement district according to object.Therefore, based on above-mentioned analysis, to judging that district proceeds as follows judgement:

According to number order from small to large, judge N successively _iwhether>=k sets up, wherein N _ifor No. i judges the number of the data object in district, k is the number of data object in the result set that returns of algorithm; Set up when i judging area meets above formula, then terminate to judge, and the region of i is from outside to inside searched for as region of search.

Further, in the present invention, to as in region of search a region of inner side segment, this zone number is i, and divided method is as follows:

Be d+1 block by this Region dividing, wherein d block is search domain to be selected, and all the other regions except search domain to be selected are must search domain;

Search domain to be selected is numbered n=1,2 ..., d, any one the data point T wherein in the n-th search domain to be selected _nj(t _n1, t _n2..., t _nd), t here _njrepresent data point T _njproperty value corresponding to j axle, t _njmeet following 2 inequality:

0≤t _nj≤ 2a _i+1-a _i, 1≤j here≤d and j ≠ n (1)

A _i-a _i+1≤ t _nj≤ a _i, j=n (2) here

In n-th search domain to be selected, if data point T _nj(t _n1, t _n2..., t _nd) to meet property value corresponding to one of them axle be a _iand property value corresponding to all the other axles is 2a _i+1-a _i, then using the maximum boundary point of this data point as the n-th search domain to be selected;

Traverse user is given to each attribute weight w at the maximum boundary point place of each search domain to be selected _jwhether existence meets

\frac{w_{j}}{Σ_{j = 1}^{d} w_{j}} > 0.5 w_{j} :

If there is the attribute weight w meeting the maximum boundary point of above-mentioned condition _j, then region of search range shorter is for comprising the search domain to be selected must retrieved belonging to district and this maximum boundary point in i-1 region from outside to inside, i-th region;

The attribute weight w of the maximum boundary point meeting above-mentioned condition if do not exist _j, then region of search range shorter is comprise must retrieving district in i-1 region from outside to inside and i-th region.

At satisfied judgement district N _iunder the prerequisite that>=k sets up, the region of search being positioned at inner side is divided into by divided method must search domain and search domain to be selected, and from search domain to be selected, selects appropriate part further according to judgment principle and retrieve, and reduces range of search further.According to demonstration above, the score at the maximum boundary point place of each search domain to be selected must be more than or equal to the score of the data point of other positions in this search domain to be selected; Therefore, if the score at the maximum boundary point place of certain search domain to be selected is less than judge district N _ithe score at basic point place, then this retrieval district to be selected just need not retrieved, otherwise then this search domain to be selected then needs retrieval.

For convenience of explanation, choose a search domain to be selected, the coordinate of its maximum boundary point is T (a _i, 2a _i+1-a _i, 2a _i+1-a _i..., 2a _i+1-a _i), judge district N accordingly _ibasic point coordinate be T (a _i+1, a _i+1, a _i+1..., a _i+1); Above-mentioned coordinate is substituted into scoring function, if there is (a _i, 2a _i+1-a _i, 2a _i+1-a _i..., 2a _i+1-a _i) * W> (a _i+1, a _i+1..., a _i+1) * W, here W=(w ₁, w ₂..., w _d), then above formula can be deformed into

2 (a_{i} - a_{i + 1}) w_{1} > (a_{i} - a_{i + 1}) * Σ_{j = 1}^{d} w_{j},

Can obtain thus

\frac{w_{1}}{Σ_{j = 1}^{d} w_{j}} > 0.5,

Then need to retrieve the region to be retrieved corresponding to above-mentioned maximum boundary point; In the manner described above, all regions to be retrieved are judged, unified expression formula can be obtained here can draw further, if certain region to be retrieved satisfies condition, then the attribute weight w making inequality set up _jnecessarily the property value of the maximum boundary point in this region to be retrieved is a _ithe weight of attribute, so during traversal retrieval, as long as be a by the property value of the maximum boundary point in each district to be retrieved _ithe weight of attribute bring into if set up, this district to be retrieved needs retrieval, and once find an attribute weight making above formula set up, just need not continue the district to be retrieved checking other again, the normalized because Attribute Weight has been reformed, the attribute weight of more than 2 or 2 therefore can not be had simultaneously to meet inequality above; So, when selecting district to be retrieved, can judge that whether a maximum attribute weight is full fast according to the Attribute Weight weight values of user's input if met, so correspondingly finding out a maximum boundary point jth attribute is a _idistrict to be retrieved.

Beneficial effect:

One provided by the invention is based on magnanimity data weighting top-k querying method under distributed computing framework, propose a kind of Data Segmentation mode of similar stress and strain model newly, and by judge in district data volume k in data volume and result set number carry out contrasting and tentatively determine region of search, greatly reduce hunting zone; Then reduce the hunting zone of the region of search of inner side further, make final region of search less, improve search efficiency and speed.

According to statistics, the data space having 1,000,000,000 data is divided into m=3 region, the attribute number that the data volume in outermost judgement district comprises with dimension d and each data object presents variation tendency as shown in the table:

Table 1

As seen from table, when dimension d is less than 8, still there are 18 data objects in outermost judgement district, in practical application, often result set data object number is required little, as returned 10 results, as long as the data set therefore inquired about at most in outermost region is just passable, therefore at least reduce the hunting zone of 2/3, filter out a large amount of extraneous data.Therefore, the inventive method significantly improves the top-k query performance under mass data, improves the top-k inquiry velocity of magnanimity higher dimensionality according to collection.

Accompanying drawing explanation

Fig. 1 is that the present invention is to data partition method schematic diagram;

Fig. 2 is that the present invention is to data subdividing method schematic diagram;

Fig. 3 represents that three kinds of different pieces of information dividing mode are with the different query time contrast of data dimension, wherein DistImprove is the inventive method, AngleDistTop_k is based on angle and distance data dividing method, and BasicTop_k does not carry out division query time to raw data set;

Fig. 4 represents in dimension 4 situation, for the contrast of different user input weight query time.

Fig. 5 is the speed-up ratio of the inventive method when different cluster node;

Fig. 6 is that top-k of the present invention inquires about top-k query script figure in concrete enforcement.

Embodiment

Below in conjunction with accompanying drawing, the present invention is further described.

Experiment completes on the spark cluster of 7 nodes, and spark builds on hadoop, uses yarn explorer and the HDFS document storage system of hadoop.In 7 nodes, master node not only did worker node as Driver node, and all the other 6 nodes are worker node.The basic configuration of experimental situation is as following table 2:

Table 2

Use even data set, every bar records 8 attributes, and the integer between each attribute span [0,1000], generates 1,000,000,000 records altogether, nearly 40G data volume.Also generate 4 dimensions, 6 dimension data collection simultaneously, and all tie up with 8 similar, be all stochastic generation data set, and be all 1,000,000,000 records.

If experiment does not have specified otherwise below, is all get average in weight, the experiment done under the condition of k=100, and each inquiry is all done the result of averaging for 10 times.Because the data prediction of inquiry is only used as once, then each inquiry all need not consider data prediction, therefore hereafter query time more do not count the data prediction time.The inventive method is approximately 42mins for 8 dimension data pretreatment times.

Present embodiment as shown in Figure 6, is divided into the step that two are large:

Step 1: data prediction.Mainly according to data dividing method in this paper, raw data set is divided, be divided into different Block, mark is carried out to each Block, be then stored on HDFS disk.In spark, HDFS disk is mainly made up of each worker node disk.Inquire about according to the Data Segmentation mode in claim and data query judgment mode.Concrete division is as follows:

The first step: entirety segmentation from inside to outside

According to homalographic principle, whole data space is divided into m=3 with this from inside to outside and is divided into 3 bigdoses.As shown in Figure 1, intuitively dividing mode is described for two dimension, transverse axis is the corresponding attribute 1 of x-axis, spatial division be bold portion in the corresponding attribute 2, figure of y-axis is 3 deciles by the longitudinal axis, by each region from outside to inside serial number be 1,2,3.Except outermost region, to all the other each regions, using belong to this region and the attribute of each axle is the point of maximal value as basic point, region all properties value in whole coordinate system being all more than or equal to the respective attributes value at basic point place all divides out, be A, B according to serial number from outside to inside.Judge n successively _a>=k, or n _bthe no establishment of>=k, if square A data volume n _a>=k sets up, then only with the data set in inquiry 1 region, otherwise check the data volume n in square B _awhether>=k sets up.Be directed to mass data inquiry generally square A data volume only just can obtain top-k result with data in inquiry 1 region much larger than k, each bigdos to be segmented to reducing data query amount to improve.

Second step: the segmentation of each large regions

Be directed to each subregion 1,2 Further Division, such as bigdos 1, can be divided into (ABC) as shown in Figure 2, D, E tri-regions, and wherein the trizonal area of A, B, C is equal, and A, B, C are as retrieving district, and D, E are as retrieval district to be selected.

Also there is following truth: if for a data object T ₁score so f can be known _w(T ₁) and spatial data points T ₁at straight line projected length be directly proportional, therefore can weigh score function by projected length.

Therefore, suppose that in square A, data set is more than or equal to k, judge that D, E only need verify in such a way the need of retrieval:

If the weight of user's input meets and w ₁=w ₂=0.5, in comparison diagram, the maximum boundary point d in D region is at straight line on subpoint to the basic point a of the distance between initial point and square A at the subpoint of above-mentioned straight line to the distance between initial point, can find that the two is equal, in like manner, distance between the maximum boundary point e in the E region subpoint on above-mentioned straight line to initial point also with the basic point a of square A equal to the distance between initial point of the subpoint of above-mentioned straight line, therefore, without the need to inquiring about D and E region, only with inquiry A, B, C region just can obtain top-k result, and can also A be known according to same reason, B, dotted line top-right part in fig. 2 will inevitably be there is in the top-k result found in C region,

Similar with above-mentioned principle, if w ₁>w ₂, then D region need not be inquired about; If w ₁<w ₂, then without query region E, do not prove one by one at this;

To sum up, following formula can be obtained:

\{\begin{matrix} searchA, B, C, D & if \frac{w_{1}}{w_{1} + w_{2}} > 0.5 \\ searchA, B, C, D & if \frac{w_{2}}{w_{1} + w_{2}} > 0.5 \\ searchA, B, C & else \end{matrix}

Be generalized to d dimension data space, use S _i, 1≤i≤3 represent one in 3 bigdoses divided from inside to outside; S _ij, 1≤j≤d represents bigdos S _iin be similar to the jth sub regions of D or E; S _{i (d+1)}represent bigdos S _iin similar be the subregion of A, B, C.When the weight of data object jth attribute time, then for bigdos S _iinquiry 2 regions wherein are only needed to be respectively S _{i (d+1)}and S _ij; Otherwise for bigdos S _ionly with inquiry region S _{i (d+1)}.

Step 2: query processing.For user one inquiry f (W, k), according to inquiry input on Driver node, selected part data set is inquired about.The present embodiment its get k=100, each data object attribute weight gets average, now only with inquiry S _{1 (d+1)}data area.

As Fig. 3, represent that three kinds of different pieces of information dividing mode are with the different query time contrast of data dimension, wherein DistImprove is the inventive method, and AngleDistTop_k is based on angle and distance data dividing method, and BasicTop_k does not carry out division query time to raw data set; Can find out that the inventive method improves inquiry velocity more than based on angle and distance data dividing method from experiment, inquiry velocity improves about 15%, and is also steady increase along with dimension increases query time, does not occur larger fluctuation.

Due to the weights W=(w of user ₁, w ₂..., w _d) input can affect the size in data query region, as shown in Figure 4, the query time of different weighted value under 4 dimensions, wherein the first kind is have the advantages that extremely be partial to a certain attribute weight, comprises W ₂=(0.06,0.06,0.07,0.8) and W ₄=(0.56,0.14,0.25,0.03), Equations of The Second Kind W ₁=(0.25,0.25,0.25,0.25) is equivalent weight, the 3rd class W ₃=(0.16,0.32,0.34,0.18) is the not obvious weight of deflection.W in figure ₁with W ₃query time is roughly equal, W ₂with W ₄query time is approximately identical, and W ₁with W ₃query time compares W ₂with W ₄query time is short, mainly needs data query block different owing to concentrating different weight to cause at low-dimensional data, W ₂with W ₄for being extremely partial to some attribute weights, causing needing many some data blocks of inquiry, thus causing query time to be greater than w ₁with w ₃.

The extensibility of the inventive method as shown in Figure 5, in 8 dimension data collection speed-up ratio on different nodes, can find out that speed-up ratio is close to desirable speed-up ratio, along with processor and worker doubles, execution speed also can double, and namely data partition method of the present invention is with good expansibility.

The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. based on a magnanimity data weighting top-k querying method under distributed computing framework, it is characterized in that: comprise the following steps that order performs:

Step 1, set up data space

Step 2, Data Placement

With the initial point of coordinate system for starting point, whole coordinate system is divided into from inside to outside m region, by each region from outside to inside serial number be 1,2 ... m, and all data objects are cooperatively all included by the border in the 1st region and coordinate axis, to any one region, the maximal value of every attribute in this region is identical, and the coordinate of the peripheral boundary in each region meets and has at least the coordinate figure of an axle to be the maximal value of the attribute in this region, be a in the maximal value of the attribute in setting the 1st region ₁prerequisite under, then the maximal value of the attribute in i-th region i=1,2 ..., m;

Except outermost region, to all the other each regions, using belong to this region and the attribute of each axle is the point of maximal value as basic point, region all properties value in whole coordinate system being all more than or equal to the respective attributes value at basic point place all divides out, be 1,2 according to serial number from outside to inside ... m-1, using above-mentioned new division region out as judging that district proceeds as follows judgement:

2. according to claim 1 based on magnanimity data weighting top-k querying method under distributed computing framework, it is characterized in that: to as in region of search a region of inner side segment, this zone number is i, and divided method is as follows:

Search domain to be selected is numbered n=1,2 ..., d, any one the data point T wherein in the n-th search domain to be selected _nj(t _n1, t _n2, t _nd), t here _njrepresent data point T _njproperty value corresponding to j axle, t _njmeet following 2 inequality:

0≤t _nj≤ 2a _i+1-a _i, 1≤j here≤d and j ≠ n (1)

A _i-a _i+1≤ t _nj≤ a _i, j=n (2) here

Traverse user is given to each attribute weight w at the maximum boundary point place of each search domain to be selected _jwhether meet

\frac{e_{j}}{Σ_{j = 1}^{d}} > 0.5;