CN104809210A - Top-k query method based on massive data weighing under distributed computing framework - Google Patents

Top-k query method based on massive data weighing under distributed computing framework Download PDF

Info

Publication number
CN104809210A
CN104809210A CN201510209691.0A CN201510209691A CN104809210A CN 104809210 A CN104809210 A CN 104809210A CN 201510209691 A CN201510209691 A CN 201510209691A CN 104809210 A CN104809210 A CN 104809210A
Authority
CN
China
Prior art keywords
region
data
attribute
point
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510209691.0A
Other languages
Chinese (zh)
Other versions
CN104809210B (en
Inventor
何洁月
罗浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201510209691.0A priority Critical patent/CN104809210B/en
Publication of CN104809210A publication Critical patent/CN104809210A/en
Application granted granted Critical
Publication of CN104809210B publication Critical patent/CN104809210B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a top-k query optimization method based on massive data under a spark distributed computing framework. The method comprises the following steps: carrying out data partitioning on a massive data set in advance by mainly adopting a data partitioning method similar to gridding; dividing an original data set into different data subsets; then selecting a little amount of appropriate data subsets to substitute for the whole data set for query according to the weight and a query k value endowed by a user to each attribute of a data object. An experiment result proves that the method provided by the invention is relatively high in query speed and favorable in expandability. Compared with a traditional top-k query method and an angle and distance-based data partitioning method, the top-k query method disclosed by the invention has the advantages that the query speed is increased, and information needing to be queried can be fed back to the user in time within a short time.

Description

A kind of based on magnanimity data weighting top-k querying method under distributed computing framework
Technical field
The present invention relates to a kind of data enquire method, particularly a kind of mass data concentrates the top-k querying method of weighting.
Background technology
Top-k inquiry is also referred to as sequence sensitive queries (rank-aware query), be a operation the most basic in database, be also data analysis important tool, especially in business analysis simultaneously, often only need to pay close attention to the most useful data, instead of whole data set.
Top-k inquiry is defined as follows: use D={T 1, T 2..., T nrepresent the set of all data objects, T irepresent wherein i-th data object, each data object has d to tie up, and is all a point in space.Inquire about Q (f, k) for a top-k, f represents score function, and k represents k the result returning and meet search request.F is weighted sum function, namely for the data object T (t of in sample 1, t 2..., t d), user gives a weights W (w to each attribute of this data object 1, w 2..., w d), the score of each data object is obtained by each property value weighted sum, and namely scoring function is:
f W ( T ) = Σ i = 1 d w i * t i
As long as top-k inquiry finally obtains the result set of k element, just can obtain as long as carry out sequence to the data much smaller than input data set, and not need to process the data of the overall situation.In recent years along with the volatile growth of data scale, the data scale of magnanimity stores data, manages and analyzes and brings great challenge.Top-k inquiry, as a basic operation in data analysis, needs to obtain Query Result fast.Such as: in Taobao's magnanimity commodity, user gives different weight according to self preference to item property, then system meet consumers' demand according to user's request fast return before k commodity.
But be faced with two challenges greatly for mass data top-k inquiry: one is that data scale reaches TB or PB level, and traditional centralized data processing method is no longer applicable; Two is how can obtain Query Result fast and accurately for massive data sets.
Top-k inquiry in traditional centralized data system runs into performance bottleneck, so be not suitable for massive data sets process in mass data.In traditional distributed environment, some research is by improving the efficiency of inquiry to the buffer memory of Query Result, this method does not solve mass data top-k in itself and inquires about problem; The Skyline skyline query that utilizes had carries out data processing, proposes the top-k process framework of DiTo the whole series, but also just in traditional distributed environment.
The solution that top-k problem is the most basic under cloud environment in recent years exactly to all data sort then return before k result, but this method is inquired about and all will be processed raw data set at every turn, cause the workload of redundancy, query time is long, so inadvisable.The people such as RanKloud propose the threshold value being calculated inquiry premature termination under MapReduce framework by statistical study when system cloud gray model, and this method can not ensure to obtain k result accurately.Research is also had existing in a new inquiry and buffer memory inquire about similarity by caching mechanism by comparing, if similarity degree greatly, then need not be inquired about again, although quickening inquiry velocity, Query Result out of true.Have and propose inquire about based on angle and distance Data Placement top-k, but based on the data partition schemes of angle, data coordinates conversion complexity is time-consuming, so be also not suitable for massive data sets process.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of mass data weighting top-k querying method based on distributed computing framework, inquiring about the technical matters that cannot obtain Query Result when processing mass data fast, accurately, easily for solving existing top-k.
Technical scheme: for achieving the above object, the technical solution used in the present invention is:
First following 4 reasonable assumptions are made:
(1), any one data object attribute value all gets nonnegative value, even negative value also by the normalization of data, can become nonnegative value.
(2), data set is relatively fixing, or the renewal speed of data is for whole data set, can ignore within a certain period of time, such as, although be engraved in renewal during the commodity data in Jingdone district, based on huge commodity radix, can think that change is little within certain time period.Therefore, be directed to flow data process, the inventive method is inapplicable.
(3), data are uniformly distributed in space, and concentrate in mass data, this hypothesis meets under a lot of scene.
(4) for an input weights W, meet even if, also can not obtained by normalization.
On the basis of above-mentioned 4 reasonable assumptions, propose a kind of based on magnanimity data weighting top ?k querying method under distributed computing framework, comprise the following steps that order performs:
Step 1, set up data space
First the property value of the data object of all d of comprising attribute is all converted into nonnegative value, and property value is normalized; Set up d dimension coordinate system, the axle of coordinate system and the attribute one_to_one corresponding of data object, form data space by all Data object placements in coordinate system;
Step 2, Data Placement
With the initial point of coordinate system for starting point, whole coordinate system is divided into from inside to outside m region, here m value can not be excessive, otherwise the negative consequence that calculated amount increases can be brought, when current data scale, generally m is taken as 3 ~ 5, such scope reasonably considers data scale and calculated amount, certainly, along with the further increase of later data scale, the value that suitably can increase m is to obtain the object of the minimizing of the data total amount in the region that marks off; By each region from outside to inside serial number be 1,2, m, and all data objects are cooperatively all included by the border in the 1st region and coordinate axis, to any one region, the maximal value of every attribute in this region is identical, and the coordinate of the peripheral boundary in each region meets and has at least the coordinate figure of an axle to be the maximal value of the attribute in this region, be a in the maximal value of the attribute in setting the 1st region 1prerequisite under, then the maximal value of the attribute in i-th region i=1,2 ..., m.After having divided region according to said method, known in conjunction with hypothesis (3) above, the data volume in each region is equal.
Except outermost region, to all the other each regions, using belong to this region and the attribute of each axle is the point of maximal value as basic point, region all properties value in whole coordinate system being all more than or equal to the respective attributes value at basic point place all divides out, be 1,2 according to serial number from outside to inside ... m-1, using above-mentioned new division region out as judging district.
According to the principle of Skyline, to any two points 1 and point 2, all properties value as fruit dot 1 is all less than a little 2, then put 1 support point 2.Based on above-mentioned principle, if given two data object T 1and T 2if, for there is T 1property value be more than or equal to T 2corresponding property value and T 1.t i>=T 2.t i, t irepresent the property value of i-th attribute, then an any given input weights W (w 1, w 2..., w d), must T be there is 1score be greater than T 2score and f w(T 1)>=f w(T 2).
Based on above-mentioned analysis, the score of the data object in a certain judgement district is inevitable all be greater than the score of the data object in all regions, inner side being positioned at this region belonging to judgement district, the result returned due to algorithm takes from the highest k of score, k is the number of data object in the result set that returns of algorithm, so once the number of data object in a certain judgement district is more than or equal to k, so this k must be obtain from the inside region in the region belonging to this judgement district according to object.Therefore, based on above-mentioned analysis, to judging that district proceeds as follows judgement:
According to number order from small to large, judge N successively iwhether>=k sets up, wherein N ifor No. i judges the number of the data object in district, k is the number of data object in the result set that returns of algorithm; Set up when i judging area meets above formula, then terminate to judge, and the region of i is from outside to inside searched for as region of search.
Further, in the present invention, to as in region of search a region of inner side segment, this zone number is i, and divided method is as follows:
Be d+1 block by this Region dividing, wherein d block is search domain to be selected, and all the other regions except search domain to be selected are must search domain;
Search domain to be selected is numbered n=1,2 ..., d, any one the data point T wherein in the n-th search domain to be selected nj(t n1, t n2..., t nd), t here njrepresent data point T njproperty value corresponding to j axle, t njmeet following 2 inequality:
0≤t nj≤ 2a i+1-a i, 1≤j here≤d and j ≠ n (1)
A i-a i+1≤ t nj≤ a i, j=n (2) here
In n-th search domain to be selected, if data point T nj(t n1, t n2..., t nd) to meet property value corresponding to one of them axle be a iand property value corresponding to all the other axles is 2a i+1-a i, then using the maximum boundary point of this data point as the n-th search domain to be selected;
Traverse user is given to each attribute weight w at the maximum boundary point place of each search domain to be selected jwhether existence meets w j Σ j = 1 d w j > 0.5 w j :
If there is the attribute weight w meeting the maximum boundary point of above-mentioned condition j, then region of search range shorter is for comprising the search domain to be selected must retrieved belonging to district and this maximum boundary point in i-1 region from outside to inside, i-th region;
The attribute weight w of the maximum boundary point meeting above-mentioned condition if do not exist j, then region of search range shorter is comprise must retrieving district in i-1 region from outside to inside and i-th region.
At satisfied judgement district N iunder the prerequisite that>=k sets up, the region of search being positioned at inner side is divided into by divided method must search domain and search domain to be selected, and from search domain to be selected, selects appropriate part further according to judgment principle and retrieve, and reduces range of search further.According to demonstration above, the score at the maximum boundary point place of each search domain to be selected must be more than or equal to the score of the data point of other positions in this search domain to be selected; Therefore, if the score at the maximum boundary point place of certain search domain to be selected is less than judge district N ithe score at basic point place, then this retrieval district to be selected just need not retrieved, otherwise then this search domain to be selected then needs retrieval.
For convenience of explanation, choose a search domain to be selected, the coordinate of its maximum boundary point is T (a i, 2a i+1-a i, 2a i+1-a i..., 2a i+1-a i), judge district N accordingly ibasic point coordinate be T (a i+1, a i+1, a i+1..., a i+1); Above-mentioned coordinate is substituted into scoring function, if there is (a i, 2a i+1-a i, 2a i+1-a i..., 2a i+1-a i) * W> (a i+1, a i+1..., a i+1) * W, here W=(w 1, w 2..., w d), then above formula can be deformed into 2 ( a i - a i + 1 ) w 1 > ( a i - a i + 1 ) * Σ j = 1 d w j , Can obtain thus w 1 Σ j = 1 d w j > 0.5 , Then need to retrieve the region to be retrieved corresponding to above-mentioned maximum boundary point; In the manner described above, all regions to be retrieved are judged, unified expression formula can be obtained here can draw further, if certain region to be retrieved satisfies condition, then the attribute weight w making inequality set up jnecessarily the property value of the maximum boundary point in this region to be retrieved is a ithe weight of attribute, so during traversal retrieval, as long as be a by the property value of the maximum boundary point in each district to be retrieved ithe weight of attribute bring into if set up, this district to be retrieved needs retrieval, and once find an attribute weight making above formula set up, just need not continue the district to be retrieved checking other again, the normalized because Attribute Weight has been reformed, the attribute weight of more than 2 or 2 therefore can not be had simultaneously to meet inequality above; So, when selecting district to be retrieved, can judge that whether a maximum attribute weight is full fast according to the Attribute Weight weight values of user's input if met, so correspondingly finding out a maximum boundary point jth attribute is a idistrict to be retrieved.
Beneficial effect:
One provided by the invention is based on magnanimity data weighting top-k querying method under distributed computing framework, propose a kind of Data Segmentation mode of similar stress and strain model newly, and by judge in district data volume k in data volume and result set number carry out contrasting and tentatively determine region of search, greatly reduce hunting zone; Then reduce the hunting zone of the region of search of inner side further, make final region of search less, improve search efficiency and speed.
According to statistics, the data space having 1,000,000,000 data is divided into m=3 region, the attribute number that the data volume in outermost judgement district comprises with dimension d and each data object presents variation tendency as shown in the table:
Table 1
As seen from table, when dimension d is less than 8, still there are 18 data objects in outermost judgement district, in practical application, often result set data object number is required little, as returned 10 results, as long as the data set therefore inquired about at most in outermost region is just passable, therefore at least reduce the hunting zone of 2/3, filter out a large amount of extraneous data.Therefore, the inventive method significantly improves the top-k query performance under mass data, improves the top-k inquiry velocity of magnanimity higher dimensionality according to collection.
Accompanying drawing explanation
Fig. 1 is that the present invention is to data partition method schematic diagram;
Fig. 2 is that the present invention is to data subdividing method schematic diagram;
Fig. 3 represents that three kinds of different pieces of information dividing mode are with the different query time contrast of data dimension, wherein DistImprove is the inventive method, AngleDistTop_k is based on angle and distance data dividing method, and BasicTop_k does not carry out division query time to raw data set;
Fig. 4 represents in dimension 4 situation, for the contrast of different user input weight query time.
Fig. 5 is the speed-up ratio of the inventive method when different cluster node;
Fig. 6 is that top-k of the present invention inquires about top-k query script figure in concrete enforcement.
Embodiment
Below in conjunction with accompanying drawing, the present invention is further described.
Experiment completes on the spark cluster of 7 nodes, and spark builds on hadoop, uses yarn explorer and the HDFS document storage system of hadoop.In 7 nodes, master node not only did worker node as Driver node, and all the other 6 nodes are worker node.The basic configuration of experimental situation is as following table 2:
Table 2
Use even data set, every bar records 8 attributes, and the integer between each attribute span [0,1000], generates 1,000,000,000 records altogether, nearly 40G data volume.Also generate 4 dimensions, 6 dimension data collection simultaneously, and all tie up with 8 similar, be all stochastic generation data set, and be all 1,000,000,000 records.
If experiment does not have specified otherwise below, is all get average in weight, the experiment done under the condition of k=100, and each inquiry is all done the result of averaging for 10 times.Because the data prediction of inquiry is only used as once, then each inquiry all need not consider data prediction, therefore hereafter query time more do not count the data prediction time.The inventive method is approximately 42mins for 8 dimension data pretreatment times.
Present embodiment as shown in Figure 6, is divided into the step that two are large:
Step 1: data prediction.Mainly according to data dividing method in this paper, raw data set is divided, be divided into different Block, mark is carried out to each Block, be then stored on HDFS disk.In spark, HDFS disk is mainly made up of each worker node disk.Inquire about according to the Data Segmentation mode in claim and data query judgment mode.Concrete division is as follows:
The first step: entirety segmentation from inside to outside
According to homalographic principle, whole data space is divided into m=3 with this from inside to outside and is divided into 3 bigdoses.As shown in Figure 1, intuitively dividing mode is described for two dimension, transverse axis is the corresponding attribute 1 of x-axis, spatial division be bold portion in the corresponding attribute 2, figure of y-axis is 3 deciles by the longitudinal axis, by each region from outside to inside serial number be 1,2,3.Except outermost region, to all the other each regions, using belong to this region and the attribute of each axle is the point of maximal value as basic point, region all properties value in whole coordinate system being all more than or equal to the respective attributes value at basic point place all divides out, be A, B according to serial number from outside to inside.Judge n successively a>=k, or n bthe no establishment of>=k, if square A data volume n a>=k sets up, then only with the data set in inquiry 1 region, otherwise check the data volume n in square B awhether>=k sets up.Be directed to mass data inquiry generally square A data volume only just can obtain top-k result with data in inquiry 1 region much larger than k, each bigdos to be segmented to reducing data query amount to improve.
Second step: the segmentation of each large regions
Be directed to each subregion 1,2 Further Division, such as bigdos 1, can be divided into (ABC) as shown in Figure 2, D, E tri-regions, and wherein the trizonal area of A, B, C is equal, and A, B, C are as retrieving district, and D, E are as retrieval district to be selected.
Also there is following truth: if for a data object T 1score so f can be known w(T 1) and spatial data points T 1at straight line projected length be directly proportional, therefore can weigh score function by projected length.
Therefore, suppose that in square A, data set is more than or equal to k, judge that D, E only need verify in such a way the need of retrieval:
If the weight of user's input meets and w 1=w 2=0.5, in comparison diagram, the maximum boundary point d in D region is at straight line on subpoint to the basic point a of the distance between initial point and square A at the subpoint of above-mentioned straight line to the distance between initial point, can find that the two is equal, in like manner, distance between the maximum boundary point e in the E region subpoint on above-mentioned straight line to initial point also with the basic point a of square A equal to the distance between initial point of the subpoint of above-mentioned straight line, therefore, without the need to inquiring about D and E region, only with inquiry A, B, C region just can obtain top-k result, and can also A be known according to same reason, B, dotted line top-right part in fig. 2 will inevitably be there is in the top-k result found in C region,
Similar with above-mentioned principle, if w 1>w 2, then D region need not be inquired about; If w 1<w 2, then without query region E, do not prove one by one at this;
To sum up, following formula can be obtained:
searchA , B , C , D if w 1 w 1 + w 2 > 0.5 searchA , B , C , D if w 2 w 1 + w 2 > 0.5 searchA , B , C else
Be generalized to d dimension data space, use S i, 1≤i≤3 represent one in 3 bigdoses divided from inside to outside; S ij, 1≤j≤d represents bigdos S iin be similar to the jth sub regions of D or E; S i (d+1)represent bigdos S iin similar be the subregion of A, B, C.When the weight of data object jth attribute time, then for bigdos S iinquiry 2 regions wherein are only needed to be respectively S i (d+1)and S ij; Otherwise for bigdos S ionly with inquiry region S i (d+1).
Step 2: query processing.For user one inquiry f (W, k), according to inquiry input on Driver node, selected part data set is inquired about.The present embodiment its get k=100, each data object attribute weight gets average, now only with inquiry S 1 (d+1)data area.
As Fig. 3, represent that three kinds of different pieces of information dividing mode are with the different query time contrast of data dimension, wherein DistImprove is the inventive method, and AngleDistTop_k is based on angle and distance data dividing method, and BasicTop_k does not carry out division query time to raw data set; Can find out that the inventive method improves inquiry velocity more than based on angle and distance data dividing method from experiment, inquiry velocity improves about 15%, and is also steady increase along with dimension increases query time, does not occur larger fluctuation.
Due to the weights W=(w of user 1, w 2..., w d) input can affect the size in data query region, as shown in Figure 4, the query time of different weighted value under 4 dimensions, wherein the first kind is have the advantages that extremely be partial to a certain attribute weight, comprises W 2=(0.06,0.06,0.07,0.8) and W 4=(0.56,0.14,0.25,0.03), Equations of The Second Kind W 1=(0.25,0.25,0.25,0.25) is equivalent weight, the 3rd class W 3=(0.16,0.32,0.34,0.18) is the not obvious weight of deflection.W in figure 1with W 3query time is roughly equal, W 2with W 4query time is approximately identical, and W 1with W 3query time compares W 2with W 4query time is short, mainly needs data query block different owing to concentrating different weight to cause at low-dimensional data, W 2with W 4for being extremely partial to some attribute weights, causing needing many some data blocks of inquiry, thus causing query time to be greater than w 1with w 3.
The extensibility of the inventive method as shown in Figure 5, in 8 dimension data collection speed-up ratio on different nodes, can find out that speed-up ratio is close to desirable speed-up ratio, along with processor and worker doubles, execution speed also can double, and namely data partition method of the present invention is with good expansibility.
The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (2)

1. based on a magnanimity data weighting top-k querying method under distributed computing framework, it is characterized in that: comprise the following steps that order performs:
Step 1, set up data space
First the property value of the data object of all d of comprising attribute is all converted into nonnegative value, and property value is normalized; Set up d dimension coordinate system, the axle of coordinate system and the attribute one_to_one corresponding of data object, form data space by all Data object placements in coordinate system;
Step 2, Data Placement
With the initial point of coordinate system for starting point, whole coordinate system is divided into from inside to outside m region, by each region from outside to inside serial number be 1,2 ... m, and all data objects are cooperatively all included by the border in the 1st region and coordinate axis, to any one region, the maximal value of every attribute in this region is identical, and the coordinate of the peripheral boundary in each region meets and has at least the coordinate figure of an axle to be the maximal value of the attribute in this region, be a in the maximal value of the attribute in setting the 1st region 1prerequisite under, then the maximal value of the attribute in i-th region i=1,2 ..., m;
Except outermost region, to all the other each regions, using belong to this region and the attribute of each axle is the point of maximal value as basic point, region all properties value in whole coordinate system being all more than or equal to the respective attributes value at basic point place all divides out, be 1,2 according to serial number from outside to inside ... m-1, using above-mentioned new division region out as judging that district proceeds as follows judgement:
According to number order from small to large, judge N successively iwhether>=k sets up, wherein N ifor No. i judges the number of the data object in district, k is the number of data object in the result set that returns of algorithm; Set up when i judging area meets above formula, then terminate to judge, and the region of i is from outside to inside searched for as region of search.
2. according to claim 1 based on magnanimity data weighting top-k querying method under distributed computing framework, it is characterized in that: to as in region of search a region of inner side segment, this zone number is i, and divided method is as follows:
Be d+1 block by this Region dividing, wherein d block is search domain to be selected, and all the other regions except search domain to be selected are must search domain;
Search domain to be selected is numbered n=1,2 ..., d, any one the data point T wherein in the n-th search domain to be selected nj(t n1, t n2, t nd), t here njrepresent data point T njproperty value corresponding to j axle, t njmeet following 2 inequality:
0≤t nj≤ 2a i+1-a i, 1≤j here≤d and j ≠ n (1)
A i-a i+1≤ t nj≤ a i, j=n (2) here
In n-th search domain to be selected, if data point T nj(t n1, t n2..., t nd) to meet property value corresponding to one of them axle be a iand property value corresponding to all the other axles is 2a i+1-a i, then using the maximum boundary point of this data point as the n-th search domain to be selected;
Traverse user is given to each attribute weight w at the maximum boundary point place of each search domain to be selected jwhether meet e j &Sigma; j = 1 d > 0.5 ;
If there is the attribute weight w meeting the maximum boundary point of above-mentioned condition j, then region of search range shorter is for comprising the search domain to be selected must retrieved belonging to district and this maximum boundary point in i-1 region from outside to inside, i-th region;
The attribute weight w of the maximum boundary point meeting above-mentioned condition if do not exist j, then region of search range shorter is comprise must retrieving district in i-1 region from outside to inside and i-th region.
CN201510209691.0A 2015-04-28 2015-04-28 One kind is based on magnanimity data weighting top k querying methods under distributed computing framework Expired - Fee Related CN104809210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510209691.0A CN104809210B (en) 2015-04-28 2015-04-28 One kind is based on magnanimity data weighting top k querying methods under distributed computing framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510209691.0A CN104809210B (en) 2015-04-28 2015-04-28 One kind is based on magnanimity data weighting top k querying methods under distributed computing framework

Publications (2)

Publication Number Publication Date
CN104809210A true CN104809210A (en) 2015-07-29
CN104809210B CN104809210B (en) 2017-12-26

Family

ID=53694032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510209691.0A Expired - Fee Related CN104809210B (en) 2015-04-28 2015-04-28 One kind is based on magnanimity data weighting top k querying methods under distributed computing framework

Country Status (1)

Country Link
CN (1) CN104809210B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777091A (en) * 2016-12-14 2017-05-31 大连大学 The double filtering searching systems of the Skyline based on many medical factors under mobile O2O environment
CN106777095A (en) * 2016-12-14 2017-05-31 大连交通大学 The double filtering search methods of the Skyline based on many medical factors under mobile O2O environment
CN108491541A (en) * 2018-04-03 2018-09-04 哈工大大数据(哈尔滨)智能科技有限公司 One kind being applied to distributed multi-dimensional database conjunctive query method and system
CN110245022A (en) * 2019-06-21 2019-09-17 齐鲁工业大学 Parallel Skyline processing method and system under mass data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314521A (en) * 2011-10-25 2012-01-11 中国人民解放军国防科学技术大学 Distributed parallel Skyline inquiring method based on cloud computing environment
CN103177130A (en) * 2013-04-25 2013-06-26 苏州大学 Continuous query method and continuous query system for K-Skyband on distributed data stream

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314521A (en) * 2011-10-25 2012-01-11 中国人民解放军国防科学技术大学 Distributed parallel Skyline inquiring method based on cloud computing environment
CN103177130A (en) * 2013-04-25 2013-06-26 苏州大学 Continuous query method and continuous query system for K-Skyband on distributed data stream

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DIMITRIOS SKOUTAS 等: "Top-k Dominant Web Services Under Multi-Criteria Matching", 《EDBT’09 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON EXTENDING DATABASE TECHNOLOGY:ADVANCES IN DATABASE》 *
MAN LUNG YIU 等: "Multi-dimensional top-k dominating queries", 《THE VLDB JOURNAL》 *
张彬 等: "度量空间中的Top-k反向Skyline查询算法", 《计算机研究与发展》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777091A (en) * 2016-12-14 2017-05-31 大连大学 The double filtering searching systems of the Skyline based on many medical factors under mobile O2O environment
CN106777095A (en) * 2016-12-14 2017-05-31 大连交通大学 The double filtering search methods of the Skyline based on many medical factors under mobile O2O environment
CN108491541A (en) * 2018-04-03 2018-09-04 哈工大大数据(哈尔滨)智能科技有限公司 One kind being applied to distributed multi-dimensional database conjunctive query method and system
CN110245022A (en) * 2019-06-21 2019-09-17 齐鲁工业大学 Parallel Skyline processing method and system under mass data

Also Published As

Publication number Publication date
CN104809210B (en) 2017-12-26

Similar Documents

Publication Publication Date Title
CN103455531B (en) A kind of parallel index method supporting high dimensional data to have inquiry partially in real time
CN103927346B (en) Query connection method on basis of data volumes
CN104809210A (en) Top-k query method based on massive data weighing under distributed computing framework
CN104063376A (en) Multi-dimensional grouping operation method and system
CN102012936B (en) Massive data aggregation method and system based on cloud computing platform
CN107798346A (en) Quick track similarity matching method based on Frechet distance threshold
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN104778277A (en) RDF (radial distribution function) data distributed type storage and querying method based on Redis
CN102169491B (en) Dynamic detection method for multi-data concentrated and repeated records
CN104899326A (en) Image retrieval method based on binary multi-index Hash technology
CN105159971A (en) Cloud platform data retrieval method
CN105183792A (en) Distributed fast text classification method based on locality sensitive hashing
Qian et al. Grid-based Data Stream Clustering for Intrusion Detection.
CN109446293B (en) Parallel high-dimensional neighbor query method
CN107656989A (en) The nearest Neighbor perceived in cloud storage system based on data distribution
Xu et al. Balancing reducer workload for skewed data using sampling-based partitioning
CN104794237A (en) Web page information processing method and device
Yang et al. Top k probabilistic skyline queries on uncertain data
CN105354243B (en) The frequent probability subgraph search method of parallelization based on merger cluster
CN105808631A (en) Data dependence based multi-index Hash algorithm
Ashok et al. Improved performance of unsupervised method by renovated K-means
Li et al. A new R-tree spatial index based on space grid coordinate division
Huang et al. Pisa: An index for aggregating big time series data
TWI770477B (en) Information processing device, storage medium, program product and information processing method
Qi et al. PreKar: A learned performance predictor for knowledge graph stores

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171226

CF01 Termination of patent right due to non-payment of annual fee