CN104937593A - System and method for database searching - Google Patents

System and method for database searching Download PDF

Info

Publication number
CN104937593A
CN104937593A CN201480005413.7A CN201480005413A CN104937593A CN 104937593 A CN104937593 A CN 104937593A CN 201480005413 A CN201480005413 A CN 201480005413A CN 104937593 A CN104937593 A CN 104937593A
Authority
CN
China
Prior art keywords
threshold
data acquisition
key
pattern
filtrator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201480005413.7A
Other languages
Chinese (zh)
Inventor
亚历山大·罗沙可夫斯基
谢尔盖·各勒夫可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN104937593A publication Critical patent/CN104937593A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24557Efficient disk access during query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In one embodiment, a method for searching a database includes receiving, by a processor from a user, a message, indicating a query 184, where the query comprises a pattern and determining, by the processor, a first threshold in accordance with a data set of the database 186. The method also includes comparing, by the processor, the pattern to a first key of the data set to produce a comparison 188 and determining, by the processor whether to jump to a second key of the data set or scan to a third key of the data set.

Description

For the system and method for database search
The earlier application right of priority of the denomination of invention that application claims is submitted on February 19th, 2014 to be No. 14/184582 U.S. Non-provisional Patent application case of " system and method (System and Method for Database Searching) for database search " and the denomination of invention submitted on February 19th, 2013 the be No. 61/766299 U.S. Provisional Patent Application case of " system and method (System and Method for aFast Key Pattern Search for a Multidimensional Database Index) for the fast key pattern search of multi-dimensional database index ", the content of these two earlier applications is incorporated in Ben Wenben in the mode introduced.
Technical field
The present invention relates to the system and method for database, particularly relate to the system and method for database search.
Background technology
Modern data warehouse comprises trillion records usually, and every bar record all comprises multiple attribute.Business intelligence task, such as analysis and consult, Data Environments (OLAP), data mining etc. should obtain the answer that the temporal analysis for data is inquired about relatively quickly.Due to data volume, so extra index is problematic, and these inquiries are by the full scan answer of usage data.Even when Data distribution8 bunch in time, full scan may take a long time.Conventional relationship data warehouse technology usually combines with non-relation distributed processing system(DPS) or is substituted by non-relation distributed processing system(DPS).Extendability and performance requirement are vital for business intelligence application.
Summary of the invention
Embodiment method for search database comprises processor and receives a message of inquiring about from the instruction of user, and wherein said inquiry comprises pattern; And described processor is according to the data set determination first threshold of described database.The first key that described method also comprises the more described pattern of described processor and described data set compares to produce, and described processor compares according to described the triple bond that the second key determining to jump to described data set with described first threshold is still scanned up to described data set, comprise when the absolute value of described comparison is greater than described first threshold, jump to described second key of described data set, with when the described absolute value of described comparison is less than or equal to described first threshold, be scanned up to the described triple bond of described data set, wherein said first key and described triple bond are continuous print.
Another embodiment method for search database comprises processor and receives a message of inquiring about from the instruction of user, and wherein said inquiry comprises pattern; And the first key of the data set of the more described pattern of described processor and described database compares to produce.Described method also comprises described processor and determines to jump or scan according to the order of sequence according to described comparison with the result and described processor that produce record according to the described record result that compares.In addition, described method comprises described processor the result of described record is sent to described user.
A kind of embodiment computing machine comprises processor and database, and described database comprises multi-dimensional database index.Described computing machine also comprises the computer-readable recording medium of the program that a kind of storage is performed by described processor.Described program comprises the instruction carrying out following operation: receive the message from user, wherein said message instruction inquiry, and described inquiry comprises pattern; And according to the data set determination first threshold of described database.Described program also comprises the instruction carrying out following operation: the first key of more described pattern and described data set compares to produce.In addition, described program comprises the instruction carrying out following operation: compare according to described the triple bond that the second key determining to jump to described data set with described first threshold is still scanned up to described data centralization, comprise when the absolute value of described comparison is greater than described first threshold, jump to described second key of described data set, with when the described absolute value of described comparison is less than or equal to described first threshold, be scanned up to the described triple bond of described data set, wherein said first key and described triple bond are continuous print.
Rather broadly outline the feature of embodiments of the invention above, object allows people can understand hereafter detailed description of the present invention better.The additional features that hereinafter embodiment of the present invention will be described and advantage, it forms the subject matter of claims of the present invention.It will be understood by one of ordinary skill in the art that disclosed concept easily can realize other structure of identical object of the present invention or the basis of process with making an amendment or being designed for specific embodiment.Those skilled in the art should be further appreciated that this type of equivalent constructions does not depart from the spirit and scope of the present invention proposed in appended claims.
Accompanying drawing explanation
In order to more completely understand the present invention and advantage thereof, with reference now to the description hereafter carried out by reference to the accompanying drawings, wherein:
Fig. 1 shows the example of gz curve;
Fig. 2 shows the mask projecting coordinate vector;
Fig. 3 shows the example structure of the solution track of dot pattern search problem (PSP);
Fig. 4 shows the example structure of the solution track of scope PSP;
Fig. 5 shows the process flow diagram of the embodiment method of database search;
Fig. 6 shows the figure of the reptile of some combination of filters and the query time of grasshopper strategy;
Fig. 7 shows the figure of the reptile of different pieces of information storage and the query time of grasshopper strategy;
Fig. 8 shows the figure of the reptile of call detail record (CDR) data set and the query time of grasshopper strategy;
Fig. 9 shows the figure of the reptile of Transaction Processing Performance Council's decision support (TPC-DS) data set and the query time of grasshopper strategy;
Figure 10 shows another figure of the reptile of TPC-DS data set and the query time of grasshopper strategy; And
Figure 11 shows the block scheme of the embodiment of general-purpose computing system.
Unless otherwise directed, the corresponding label otherwise in different figure and symbol are often referred to for corresponding part.Drawing each figure is related fields in order to embodiment is clearly described, therefore may not draw in proportion.
Embodiment
Although should be understood that the illustrative embodiment hereafter providing one or more embodiment at first, the current known or existing technology of arbitrary number can be used to implement disclosed system and/or method.The present invention never should be limited to hereafter illustrated described illustrative embodiment, graphic and technology, comprise illustrated herein and the exemplary design described and embodiment, but can revise in the full breadth of the scope of appended claims and its equipollent.
Customer data is converted to key-value pair by using dictionary and particular key compound by the multi-dimensional database used in data warehouse.Key-value pair can be stored by keystroke sequence.The hyperspace of likely key be furnished with space filling curve, each like this may a point on key homologous thread.Use these points of very large numeric parameterization.For the inquiry of the point based on any attribute of customer data, scope or set filtrator, and the pattern search problem based on compound keys can be converted into based on the combination of the above inquiry of these attributes.
Embodiment performs quick subset and filters in the ordered set of integer characterizing compound keys closes.One embodiment is used in inquires about without the need to the temporal analysis accelerated when extra index for data warehouse.Multiple embodiment can be used for point, scope and set constraint on multiple attributes of combination in any.Coordinate filter is equal constraint, and range filter is Operations of Interva Constraint, and set filtrator is subset restriction.Employ the combination of skipping major part and creeping without crucial order.The feature that can store according to bottom data determines whether to jump adaptively.
Interim on-line analytical processing (OLAP) inquiry a kind ofly can participate in variable being placed multiple filtrator and the inquiry of polymerizable measurement value at some.OLAP embodiment can use dictionary to encode to use continuous integral number to dimension attribute value.For the orderly attribute with surrogate, surrogate can be integer.For unordered attribute, integer may be continuous print.For unordered attribute, can retention order.
The cartesian product in dimension attribute territory forms compound keys space.Vector relies on F and compound keys is mapped with being formed containing the vector measured subsequently.Multi-dimensional database technology gives space filling curve based on giving compound keys space, and a single point on each element homologous thread in such space, vice versa.There is various ways to select this curve.In an example, employ broad sense z curve (gz curve).In gz curve, use integer to encode to each point on curve, wherein integer is from the value of the component of compound keys.The inquiry arbitrarily with point, scope or set filtrator for multi-dimensional database is converted into the pattern search problem on gz curve.
In OLAP field, there is vector function rely on:
F:(D 1...D N)→(M 1...M M)。
Independent variable D idimension (dimension attribute), and dependent variable M imeasure.Use other functional dependence relating to dimension attribute to strengthen this dependence.Independent attribute is senior dimension attribute.Grouping in the Attribute domain that they cause them to rely on, thus cause carrying out converging operation to measuring.Dependence can be formed without loop digraph (DAG).
Integer can be used to encode to dimension attribute.If attribute is round values, so this attribute can be used when not carrying out extra coding.If attribute is not round values, so create encoder dictionary.For natural order attribute, coding retains this order.The dense coding of being undertaken by continuous integral number can be used.Or, do not use dense coding.In an example, the radix of each dimension attribute is the power of 2.
Represent dimension attribute with encryption description subsequently, such as, be expressed as integer or byte arrays.Dictionary and dependence can provide the constant enquiry time.
Interested dimension attribute can participate in the formation of compound keys.Comprising advanced property at key can increase openness to model, but without the need to the connection at query time place.
Use the coding of compound keys that the data of functional dependence F are converted to key assignments form.Memory unit is responsible for the data of preservation key assignments form and is obtained relevant key-value pair at query time.
For the simple queries of multi-dimensional database by some attribute constraint be the subset of their value and request meets the data of constraint.In an example, dimension attribute exists a class constraint, such as, point, scope and set constraint on Attribute domain.
When inquire about arrive time, they are also converted into integer by the property value that system searching relates to dictionary about intrafascicular.These integers are subsequently for the formation of the constraint to compound keys.These integers are passed to memory unit for obtaining relevant key-value pair, and these relevant key assignments are polymerized and store.Finally, reuse dictionary and net result is transformed go back to primitive attribute territory.
No matter how compound keys produces, and it all provides the integer coding of hot spot in key search volume, and key search volume is the cartesian product in encoded attributes territory.Therefore, compound keys in search volume for space filling curve provides numeric parameter.In one example, the integer compound keys of binary representation form is built up by the bit of the key participating in component, has reserved the bit order of each component like this.This process results the key of regular length.
The shape of gz curve depends on the mode method of the bit composition compound keys of component.Fig. 1 shows some shapes of Two Variables, and wherein the bit of horizontal and vertical dimension is labeled as x and y respectively.Curve 150 is depicted as traditional isotropy z curve that bit order is yxyxyx.Curve 152 is depicted as odometric curve, and bit order is yyyxxx, then presses x sequence corresponding to key being pressed y sequence.Compare other dimension, odometric curve does one's utmost support dimension.The position of log by the single adjacent sectional answer of the inquiry with filtrator being positioned to curve in high-order dimension, but for position by the filtrator in the dimension of low level, is answered and is dispersed on curve.Curve 154 shows bit order yyxyxx, and curve 156 shows bit order xxyyyx.In FIG, shadow region is the example of the fundamental region of order 4.
Exemplary method is applied to the combination in any of the subset of any dimension and point, scope and the set filtrator not necessarily in full set.These methods can improve the performance of the interim OLAP query of basic key assignments storage system arbitrarily, and bottom key assignments storage system makes preserve data by the order of compound keys and support certain simple operations.When storage system effectively supports action required, significant performance gain may be there is.Method only has some feature of storage substantially, the ratio of such as sequential access and random access cost.Suppose that one inquires about, calculate specific threshold, if inquiry runs into suitable obstacle when creeping, after so exceeding this threshold value, inquiry will be jumped.Available algebraic method definite threshold also explains this threshold value by method of geometry.
One example has seldom about the knowledge of storage system, but has skip capability.It can skip the key in the same unit of storage, or it can rebound landing store different units on.Can the out of Memory that provides of usage data storage, such as, by the border of the subregion of key interval division on gz curve.These subregions may correspond to the page in such as UB tree or Hbase region.Subregion can be layering and specific to storage.The method can determine whether the interior content perhaps skipping this region checking this region subsequently.Can parallel processing such as HBase region and each subregion.In addition, in this subregion, the dimension of this problem can be reduced.Then, the method directly at the enterprising line operate of key of the carried out factorization reduced, and can not recover original key.
Point on one or more attribute retrains the bit pattern related in retainingf key, and therefore inquiring about question variation is the fixed mode search problem (PSP) of closing about keyset.Scope and set constraint cause more complicated pattern.If n is the total number of bits in compound keys, the space S of so all keys is it is the n dimensional linear space in residual error group;
Bit forms orderly base e in S 1..., e nand the element of S alphabetically arranges by coefficient, these coefficients and integer sequence consensus.
In an example, employ and there is n binary digital bit shielding operational symbol of closing with the set of integers of binary representation at the most.In another example, collection of functions extracts calendar portion from extraction calendar, and wherein calendar represents date and time.In other example, use the logs such as multiresolution measuring system.Or, service time sequence, Fourier transform or wavelet transformation.
Mask is the operational symbol projecting d dimension coordinate linear subspaces S.Suppose d base vector e i1..., e id, operational character m masks a remaining n-d coordinate.S (m) represents the subspace that mask m projects.If the subspace that two or more masks project is non-intersect between two, so these two or more masks are non-intersect, and that is, they are any common base element of tool not.
For gz curve, mask m dcorresponding to each dimension attribute D.Mask defines its bit location in compound keys.Mask is applied to the contribution margin that compound keys obtains D.Mask corresponding to different dimensional attribute is non-intersect.
A is the subset of S, represents the compound keys of multi-dimensional database factual data.The arbitary inquiry for multi-dimensional database with the coordinate filter D=p on attribute D is converted into pattern search problem (P): find all x ∈ A, make x & m d=p.For the intersection m of attribute mask and the intersection p of associative mode, there is multiple attribute D ion coordinate filter D i=p iinquiry be converted into the Similar Problems of the solution finding x & m=p.
Any mask on S can be regarded as corresponding to certain virtual attribute.Therefore, the inquiry with multiple coordinate filter is equivalent to the inquiry of a single point filtrator had on suitable virtual attribute.
The inquiry with range filter D ∈ [a, b] is also converted into PSP (R): find all x ∈ A, make x & m d∈ [a, b].But being different from an inquiry, may be problematic by two or more this type of inquiry inquiry merged into for the single similar expression of certain virtual attribute.Element can meet some pattern simultaneously.
Change collection query in a similar fashion.Suppose E={a i..., a n, filtrator D ∈ E corresponds to PSP (S): x & m d∈ E.The filtrator of multiple attribute can be merged into the inquiry of the similar expression for certain virtual attribute.Due to the cartesian product that the constrain set obtained is coordinates restriction, for practical purpose, its radix may be excessive, and therefore multi-mode search also may be used for practical purpose.The solution of any search condition also meets restriction range x & m d∈ [min (E), max (E)].
About set a solution of pattern search problem by checking in full scan mode, the schema constraint of each elements A is realized.If on average, the solution of pattern search problem than brute-force solution sooner and always faster than brute-force solution, so the solution of this pattern search problem can be called as effectively.The set that this average relative retrains in the random pattern of the fixed Combination of the attribute constraint to any suitable number, and this is on average the mean value of all these combinations.According to this definition, allow an efficient algorithm poorer than the full scan in some patterns, but do not allow it poorer than full scan on average.
If set the cartesian product of at least two subsets of S can be expressed as, so factorization can be carried out.Except S self, the set meeting all elements of (P) class constraint can carry out factorization.For (R) class and the constraint of (S) class, the example can carrying out the subset of factorization comprises the interval with common prefix or the set with commonality schemata.
{ S ja kind ofly S is divided into the subregion that can carry out the subset of factorization, such as each subset has the factor of himself.Caused by set A j=A ∩ S jto arbitrarily the subregion of division also can carry out factorization.
When process can carry out the subregion of factorization, especially process between key zone or subregion that the set with commonality schemata divides time, example grasshopper method provides extra advantage.When bottom storage implementation prefix or commonality schemata compression, exemplary method may be especially effective.
Example grasshopper method is creeped by built-up sequence and is skipped major part without the crucial full scan avoiding performing data.In order to be applicable to any bottom data structure, in the equipment room separative power that pattern search, data store and use in pattern matcher.
Key assignments storage can comprise key-value pair, and the key of these key-value pairs is element.Support to obtain, scan and find operation if data store, so data storage is basic.Acquisition operates the key that has in x ∈ A and obtains suitable value.Scan operation has the key in x ∈ A and obtains next key in A.The searching operation key had in x ∈ S also obtains next key be more than or equal in the A of x.Obtain the statistics of A, such as radix, first and the cost of last key can ignore.The data storage of subregion can provide scoping rules for each element of subregion and process the element of basic data storage.
Adaptation is assisted pattern search and is had the function performing matching operation, mismatch operation and prompting and operate.For x ∈ S, matching operation informs whether x meets given schema constraint.Mismatch operation refer to, for x ∈ S, if x meets given schema constraint, then return 0, otherwise return to the position of unmatched the highest-order bit, this position instruction more than this position or below partial mismatch.For the element x ∈ S with mismatch y, prompting suggestion for operation next element h ∈ S, h>x, can meet schema constraint in theory.
The operation of storage and adaptation is different.Store and know all of affiliated set A, but mask and pattern are known nothing.On the other hand, all of relevant mask and pattern known by adaptation, but know nothing set A.For different embodiment, in adaptation, there is variant.
In order to search inquiry in data storage, reptile, frog and grasshopper method are collected the data of the given set of modes of coupling and these couplings are put in bag.Reptile, frog and grasshopper all have an adaptation.By using adaptation, interval [PSP is surrounded in the theory inquiry that they can calculate on S min, PSP max] and itself and interval [min (A), max (A)] are intersected to obtain actual encirclement interval [a, b].
Reptile is scanned in order.Example reptile pseudo-code is:
bag=0;x=a;
while x≤b{
if Match(x),add(x,Get(x))to bag;
x=Scan(x);
}
Frog jumps as early as possible.Example frog pseudo-code is:
bag=0;x=a;
while x≤b{
y=Mismatch(x)
if y=0,add(x,Get(x))to bag,x=Scan(x);
else x=Seek(Hint(x,y));
}
If grasshopper only jumps when the absolute value of mismatch is greater than threshold value t, the absolute value of mismatch is not higher than threshold value t, and so grasshopper creeps.Example grasshopper pseudo-code is:
bag=0;x=a;
while x≤b{
y=Mismatch(x)
if y=0,add(x,Get(x))to bag,x=Scan(x);
else if|y|≤t,x=Scan(x);
else x=Seek(Hint(x,y));
}
When prompting operation is without any suggestion, it returns ∞, and corresponding loop termination.
The cost model for scan method can be developed.All three kinds of methods are that these x of coupling PSP constraint perform the scanning of identical number and obtain operation, so they can be excluded outside cost estimating.For the element not solving PSP, reptile performs coupling and scan operation, and frog performs mismatch, prompting and searching operation, and grasshopper performs coupling sometimes and scan operation performs mismatch, prompting and searching operation sometimes.Suppose can ignore compared with the operation that the operation required time of adaptation and data store, the cost of reptile is N 0cost (scanning), the cost of frog is N 1cost (searching), the cost of grasshopper is N 2cost (searching)+N 3cost (scanning), wherein N 0the number of mismatch element, N 1the number of times that frog jumps, N 2the number of times that grasshopper jumps, N 3it is the number of times that grasshopper creeps.
R can be defined as:
R = cos t ( s c a n ) cos t ( s e e k ) .
R is the attribute that data store, and this attribute is determined by experiment.
If N 1<N 0r, then frog can complete before reptile.Item N 0can estimate according to the corresponding selection distribution of the value participating in attribute.But, to N 0guestimate be:
card(A)·(1-2 d-n)。
Therefore, if there is following situation, then frog is better:
N 1<R·card(A)·(1-2 d-n)。
Do not rely on the geometric configuration of mask on the right side of equation, namely attribute participates in the mode of key compound.On the contrary, N 1extremely rely on mask.
Be certain to surpass reptile if grasshopper determines frog in advance, so grasshopper can arrange threshold value t=0 to serve as frog.But there is the situation that frog is defeated by reptile certainly, if such as mask is only made up of the first bit, so every two points of S solve PSP.Adaptation cannot advise that any ratio accurately jumps to down a bit better method, i.e. failure strategy.
If grasshopper determines that reptile is always won, so threshold value can be set to t=n by it, thus prevents any jump.But this strategy does not meet efficiency standard.
For mask m, exist and be mapped to complementary mask on residue n-d coordinate or common mask ~m.All can recover any n-dimensional vector x to the mapping S (m) and S (~ m), that is:
x=m(x)|~m(x)。
Mask definition easily extensible is to compensating mask set m altogether 1..., m, has and S (m i) the orthogonal subspace of scope.
Mask m can project the bit e in S (m) by ascending order i1..., e id.Tail (m) is defined as e i-1, head (m) is defined as i d.If mask m projects on adjacent bit position, then mask m is adjacent, or equivalently, head (m)=tail (m)+d.
Fig. 2 shows some terms.Point 100 represents dextrosinistral coordinate vector.Mask m is mapped on coordinate 4,6,9,10 and 11.Coordinate 1 to 3 forms tail, and coordinate 12 to 15 forms head, and coordinate 5,7 and 8 forms hole.
For mask m and the element e of the base of S i, m>i, m=i and m<i are that m is at base vector e respectively i+1..., e n, e iand e i..., e i-1on projection.Therefore:
m=m >i|m =i|m <i
There is the similarity relation of projecting space.One or more projection can be empty.Similarly, pattern p can be analyzed to:
p=p|p =i|p <i
All coordinates are all the subspaces of 0 (1) element representation be 0 t(1 t).0 mfor 0 s (m).
Part order on set mask is defined as:
m 1 > m 2 &DoubleLeftRightArrow; t a i l ( m 1 ) &GreaterEqual; h e a d ( m 2 ) .
In the subregion of the mask m divided by adjacent mask, the standard subregion of m is the subregion with minimum mask number.They are listed by from higher bit position to the descending of low bit.
The form of least member of the fixed mode p in coupling S is 0 ~ m| p, and greatest member PSP maxform be 1 ~ m| p.The encirclement that these elements form fixed mode search problem is interval.Although they depend on p, their difference, spread (m, PSP) does not depend on p.That is, spread (m, PSP)=1 ~ m| 0.
For scope or the set mode with least member a and greatest member b, we obtain PSP min=0 ~ m| a and PSP max=1 ~ m| b.For adjacent mask, spread only depends on the difference of b – a.But this is not genuine in the ordinary course of things.
As described above, gz curve is used as the space filling curve of the cartesian product T of N number of Attribute domain of integer.Each territory D iradix be all 2 power, and its mode participating in the element forming gz curve is by territory mask m direpresent.
Because gz curve is through space T, the fundamental region T of order r rfor having the amount 2 corresponding with the interval of gz curve rrectangle frame, wherein r=0 ..., n.Each T 0region is a single point and T n=T.Interval end is consistent with the corresponding power of 2.Each fundamental region comprises low order fundamental region.Be copy each other to all regions of definite sequence r, and the shape of gz curve in them is identical.Their numbers in T are 2 n+1-r.When without when obscuring generation, also will be known as basis in the correspondence interval on gz curve.
The solution track of the PSP on gz curve to comprise between given zone or bunch, in some cases, they deteriorate to a little.Space be bunch between gap, do not comprise the gap at curve end place.Represent that the specified quantitative of the track of some PSP comprises number of clusters, bunch length, total void length and single gap lengths.
In this example, m is any mask projecting d dimension, { m iit is its standard subregion.So, x & m=p is a PSP.So, the track of PSP is by total length spread (m, PSP)-2 n-dthe length that separates of space be length 2 tail (m)interval 2 n-d-tail (m).Interval gap lengths is part summation:
&Sigma; i &GreaterEqual; j &lsqb; 2 h e a d ( m i ) - 2 t a i l ( m i ) &rsqb; .
For the continuous mask with d bit, fundamental region T head (m)interior 2 tail (m)2 of size din individual adjacent interval, only an interval meets the given fixed mode constraint with mask m, and that interval is also fundamental region T tail (m).Also have 2 n-head (m)this type of region individual.Therefore, the track of the point met on the gz curve of constraint is length is 2 head (m)– 2 tail (m)the length that separates of space be 2 tail (m)2 n-head (m)individual bunch.Fig. 3 shows Figure 160 of the structure of the track of diagram point PSP.
At each fundamental region T tail (m)conclude after the basic argument of interior repetition.For next component of mask, consider that the coupling bunch with previously having identified replaces S.Consider the gap at edges of regions place.
In one example, m is the mask projecting d dimension, and p is the element of S (m).For subset fixed mode search problem PSP (m, p) finds all elements x ∈ A to make x & m=p.Any mask subregion { m ialso can cause pattern subregion { p i.And if only if for the Match of elemental composition m of each i, A ion p itime, then the p on the Match of elemental composition m of A.
For mismatch operation, adaptation checks PSP (m i, p i), each inspection bit.If x & is m i≠ p i, then e jthe highest-order bit, then e jdo not mate p ithe highest-order bit of pattern.If x & is m i>p i, then adaptation returns j, and if x & m i<p i, then adaptation Fan Hui – j.If x & is m i=p i, then adaptation proceeds to PSP (m i+1, p i+1) etc.If mismatch do not detected, then adaptation returns 0.
I is the mark mask on S, and namely this mask projects on S.
For prompting operation, suppose the mismatch at an element x ∈ S and j place, position, if mismatch is negative value and the mismatch at indicating positions j place, then adaptation returns:
hint(x,j)=x j|1 I=j|m m<j|0~ m<j
The highest-order bit position of change is j.Geometrically, this means that an x belongs to certain fundamental region T j-1, this fundamental region T j-1not with the intersection of locus of PSP.By changing this bit, result is placed in next this type of region.Owing to j not having bit be changed, therefore result is comprising the same base one's respective area T of x jin.
If mismatch be on the occasion of, the geometric meaning of so this operation is similar, but with next fundamental region T of PSP intersection of locus j-1be positioned at the different fundamental regions than the fundamental region comprising x more high-order.In order to find this region, find growth point g i, its be x & ( ~m) >jin zero setting (0) bit j on minimum position.If this position does not exist, so search terminates and returns ∞.Otherwise, return the value of prompting hint (x, g).
The times N that frog jumps can be estimated 1.Only jump when mismatch being detected and x belongs to certain space.After jump, frog is landed in lower cluster.Therefore, the number of times of jump can not exceed space number 2 n-d-tail (m)– 1.If this numeral is less than Rcard (A) (1-2 d-n), then frog was terminated before reptile.This is applicable to some masks, such as, be applicable to adjacent mask without a head, because n=d+tail (m).In this example: R > R 1 ( m , A ) = 2 n - d - t a i l ( m ) - 1 c a r d ( A ) &CenterDot; ( 1 - 2 d - n ) .
In another is estimated, in S, there is being uniformly distributed of A.D a=card (A)/card (S) is the average density of A.In all spaces, desired point number is d a(spread (m, PSP) – 2 n – d).This can be rewritten as:
c a r d ( A ) . 2 n - m &OverBar; - 2 n - d 2 n = c a r d ( A ) &CenterDot; ( 1 - 2 - d - 2 - n m &OverBar; ) .
Therefore, to N 1estimation can be rewritten as:
( 1 - 2 - d - 2 - n m &OverBar; ) < R &CenterDot; ( 1 - 2 d - n ) ,
Or be written as:
R > R 2 ( m ) = 1 - 2 - d - 2 - n m &OverBar; 1 - 2 d - n .
Scan and 1 is less than to searching ratio R, but R 2(m) <1.Obtain when m is and projects the adjacent anury mask on d lowest bit position minimum value.Therefore,
min ( m &OverBar; ) = 2 d - 1 ,
And
R 2 ( m ) &le; 1 - 2 - d - 2 - n ( 2 d - 1 ) 1 - 2 d - n = ( 1 - 2 d - n ) ( 2 - d - 2 - n ) 1 - 2 d - n = 1 - 2 - d < 1.
In one example, m is the mask of the bit projecting d dimension or S, and A is the nonvoid subset of S.R 1(m, A) is defined as:
R 1 ( m , A ) = 2 n - d - t a i l ( m ) - 1 c a r d ( A ) &CenterDot; ( 1 - 2 d - n ) .
In addition, R 2m () is defined as:
R 2 ( m ) = 1 - 2 - d - 2 - n m &OverBar; 1 - 2 d - n .
If the scanning that data store meets searching ratio and estimates R>min (R 1(m, A), R 2(m)), so frog strategy may surpass reptile strategy.In one example, this condition of grasshopper method validation.If this condition is set up, so threshold value is set to 0, and grasshopper will be followed after frog.
But grasshopper has other selection surpassing reptile.Only jump when adaptation detects non-zero mismatch, this currentElement x ∈ A belong to the track of PSP bunch between space time occur.Grasshopper method determines that the size in space enough comprises the element of the A of enough numbers, and therefore when grasshopper jumps to lower cluster, it skips them.If the size in space enough comprises X element, so when grasshopper runs into first element in space, grasshopper will skip X – 1 element, and reptile accesses each element.Relatively reptile method and grasshopper strategy relate to and compare XN 2cost (scanning) and N2 cost (searching).The grasshopper triumph as X>1/R.Suppose that scanning is to searching ratio R, grasshopper determines whether there is the space comprising the sufficient length being no less than X element.
If the element of A is evenly distributed, so length is that the average number of the element of A in the space of L is estimated as d al.Therefore, the length in space should be greater than 1/ (d ar).From subregion { m ilast element start to assess each gap lengths a series of part and until summation is greater than 1/ (d ar).If element m jmiddle generation aforesaid operations, so by t=tail (m j) be set to threshold value.If summation is forever large not, so threshold value is set to n and serves as reptile by grasshopper method.D can be calculated in advance aand R.
In one example, m is use standard subregion { m iproject the mask of the bit of d dimension or S, wherein A is the nonvoid subset of S.In addition, R is that the scanning of data storage is to finding ratio and j 0the minimum value of j, wherein part and exceed
2 n c a r d ( A ) &CenterDot; R .
If value j 0exist, so there is threshold value t=tail (m j0) grasshopper method may surpass reptile method.
If for x, adaptation returns negative mismatch Zhi – y, and so x belongs to fundamental region T yin space and the lower cluster of PSP track is arranged in identical fundamental region.If mismatch be on the occasion of, so in the larger space of x between two fundamental regions than y more high-order.Therefore, grasshopper should operate two different threshold values and jump more after running into positive mismatch.
In another embodiment, can by the ability finding a position to strengthen sweep test, this position belongs to the position of last point of certain bunch in PSP track.Then, the method selects the element that ran into before this terminal blindly.This means, checking inequality instead of checking coupling.The cost of these two operations is roughly the same.After running into the element not meeting inequality, still verify whether it mates this pattern.The end of compute cluster very simply but also bear extra cost.Its efficiency depends on how to implement memory interface.
An embodiment method is used for subregion case.If data are carried out subregion and carries out parallel scan to subregion, so grasshopper method is from determining to obtain extra returns specific to the threshold value of specific part.
When this subregion can carry out factorization, then there is the commonality schemata being used for all elements.In an example, exist and carry out subregion by interval.If the interval L in S can carry out factorization, so there is common prefix pattern P and be mapped to d lcorresponding prefix mask M in individual dimension l, make L=P|L`, wherein L` is (n – d l) interval in dimension space.Some storages can use prefix compress technique also only to preserve the n – d of every key to the single copy preserving prefix lbit.If this storage also uses dimensionality reduction to provide access to the key blocked, so efficiency increases.If this access is unavailable, so the multiple storer distribution of storage execution and copy are to assemble total length key.
Feature modeling prefix according to L is rational, and extra reduction is possible.Such as, formula can be:
S′=S(m)∩S(M L),
It realizes by mask operation.If then m ' is corresponding crossing mask.If p m '≠ P m ', then whole interval L is positioned at as meticulous mismatch outside PSP track, and can be skipped over.Then, if m '=m:
S ( M L ) &Subset; S ( m ) ,
Therefore, whole interval L is positioned at PSP track as meticulous coupling, is added in bag by point wherein like this in unsight situation.Otherwise the mask in PSP can be replaced by m "=m m ', and pattern is replaced by p m ".When calculated threshold, S (M l) dimension can reduce dimension n.
In another embodiment, existence range constraint.In one example, schema constraint is class (R):
x&m∈[a,b]。
First, embodiment method checks whether a=b.As a=b, there is some constraint, and when a is not equal to b, existence range retrains.Then, determine whether interval can carry out factorization.Calculate the maximum common prefix p of a and b.If this common prefix exists, so:
[a,b]=[p|a′,p|b′]=p|[a′,b′],
Institute in this interval a little all has same prefix p.This causes mask m to be decomposed into prefix and suffix mask:
m=m prefix|m suffix
Original PSP is converted to the system of two PSP:
x&m prefix=p,
And:
x&m suffix∈[a′,b′]。
In one example, the track of original PSP is the subset of track configuration.Therefore, subregion case can be used.For scope particular technology, the interval [a, b] can not carrying out factorization can be considered.
Element a and b has different higher bit positions 0 and 1 respectively.Otherwise they have common prefix.If all bits of a are all 0 and all bits of b are all 1, then interval is complete.For between complete section, all elements of A is all solution.
If interval can carry out factorization and its suffix interval [a ', b '] is complete, so interval is that suffix is complete.Such as, interval [12,15] are that suffix is complete, because [12,15]=12| [0,3], but interval [11,14] are imperfect, because [11,14]=8| [3,6].For between suffix complete section, original scope PSP is converted into a PSP.
Suppose interval imperfect and can not factorization, interval by inswept almost whole corresponding d n-dimensional subspace n.Therefore, PSP track is almost whole space S.Interval is less, more points of proximity example, and there is the chance that more multi-hop crosses larger space.
The track of some PSP has the interval of equal length, has gap between these intervals.Range constraint is not such situation.
In examples of ranges, m is any mask projecting d dimension, and { m iit is its standard subregion.X & m=[a, the b] radix that to be scope PSP, r be [a, b], r i[a mi, bm i] radix.The track of PSP usually comprise different length bunch, these bunches are separated by the space of overall length:
spread(m,PSP)–r ..2 n-d
Spread can be calculated as:
b|1 ~m-a|0 ~m+1。
Independent gap lengths is part summation:
&Sigma; i &GreaterEqual; j &lsqb; 2 h e a d ( m i ) - r i &CenterDot; 2 t a i l ( m i ) &rsqb; .
If each fundamental region T in S head (m)interior mask m is adjacent, the same with an inquiry, and so the track of PSP is size is r2 tail (m)single interval, wherein r=b – a+1 is length of an interval degree.Therefore, the space between interval is:
2 head(m)-r·2 tail(m)
In non-adjacent situation, be different from an inquiry, the part PSP of each mask is not independently.2nd PSP depends on the state of first problem.Such as, for PSP1:
x&m 1∈[a m1,b m1]。
In this example, x solves original PSP certainly, and does not consider the PSP for the second mask.If x & is m 1<a m1or x & m 1>b m1, so x certainly not solution.If x & is m 1=a m1or x & m 1=b m1, so x is solution, and it solves the 2nd PSP, PSP2 as follows:
x&m 2∈[a m2,1 m2],
Represented by PSP2 (a), or
x&m 2∈[0 m2,b m2],
Represented by PSP2 (b).Corresponding interval one of them or both can deteriorate to a little.
In one example, r 2(a) and r 2b () is the length of an interval degree of PSP2 (a) and PSP2 (b) respectively.At each fundamental region T head (m1)in, it is r that the track of PSP1 has length 12 tail (m1)single interval, be that 1 rank are interval.The track of the original PSP in fundamental region is included in this interval, and comprises the interval with following length:
(r 1-2)·2 tail(m)
This corresponds to interior section:
(a m1,b m1)=[a m1+1,b m1-1]
If the latter's non-NULL.In 1 interval, rank but outside this interior section, at each fundamental region T head (m2)the PSP2 (a) of middle existence corresponding to the interior section left side and the right that lay respectively at 1 interval, rank and the interval series in two 2 rank of PSP2 (b).One in 2 intervals, rank in each series is adjacent with 1 interval, rank from respective side, the total interval number in this fundamental domain at most:
2·(2 tail(m1)-head(m2)-1)。
This is illustrated by the track configuration 170 in Fig. 4.
If mask has three components, so scheme to change in a similar manner.Each 2 intervals, rank have their interior section belonging to PSP track.There are two serial 3 rank in 2 intervals, rank with the space of its interior section interval.
For mismatch operation, adaptation checks each PSP (m i, [a i, b i]), each inspection one.If x & is m i∈ (a i, b i), adaptation returns 0, instruction coupling.If then e jthe highest-order bit, then e jdo not mate [a i, b i] the highest-order bit of pattern.If x & is m i>b i, then adaptation returns j, and if x & m i<a i, then adaptation Fan Hui – j.If x & is m i=a ior x & m i=b i, then adaptation proceeds to PSP (m i+1, [a i+1, b i+1]), wherein interval is [a mi+1, 1 mi+1] or [0 mi+1, b mi+1].
In one example, I is the mark mask on S, and namely this mask projects on whole S.
For prompting operation, suppose element x ∈ S and matched position j, if mismatch is negative value, then adaptation calculates as follows and points out h in advance 1:
x I>j|1 I=j|0 ~m<j
If pointed out not in [a, b] in advance, [a, b] depends on the rank of x, then prompting is corrected as:
h 1|a m<j
The highest-order bit position of change is j.
If mismatch be on the occasion of, then growth point g determined by adaptation, and growth point g is x & m (~ m) >jin zero setting (0) j on minimum position.If this position does not exist, so search for end, and return ∞.Otherwise, as above use g instead of j to calculate prompting.
To the process of the subregion case of range query and some inquire about similar, but for each interval in subregion, calculate suitable scope and limit.
In another embodiment, performance set inquiry.There is the schema constraint of (S) class: x & m ∈ E, wherein E is certain set.In an example, this set is sorted.First, check that the spread of E is to determine whether it equals its radix.If the spread of E equals the radix of E, then E is scope.This also eliminated individual element set.
Then, determine whether this set can carry out factorization.Calculate the maximum commonality schemata p of all elements E.If this commonality schemata exists, so E=p|E ', and along with the decomposition of mask, m=m common| m residue.Original PSP is converted to the system of two PSP, x & m common=p and xxx.For first problem, track configuration is known, and the track of original PSP is its subset.Use a consideration for inquiry.For set particular technology, considering can not the set E of factorization.
Because a set comprises multiple independent point, the track of set PSP is the intersection of the track of corresponding point PSP.In this suggestion PSP track all bunches all there is formed objects and their total number in card (E) factor to inquire about different.As in range query, if set is almost whole space S (m), so solution space may be larger.In addition, depend on the distance between set element, single pore size difference is very large.
But, because set is completely contained in scope [min (E), max (E)], therefore to the estimation in the space of the perimeter of suitable fundamental region and range query similar.Jump if their size is not enough to adjustment, then there is option for finding enough large space corresponding to gap between set element.
The decomposing trajectories of set PSP is not the intersection of a PSP by adaptation.On the contrary, set PSP is decomposed into similar portions set PSP by the component of mask subregion by adaptation.
As in range query, next PSP each depends on the state of last PSP.For having constraint x & m 1∈ E 1a PSP, if x & m 1≠ E 1, so search for and be interrupted immediately as clean mismatch.If y=x & is m 1∈ E 1, then the subset E being reduced to E will be searched for further 2(y), the subset E of E 2y () is as prefix matching y.Next PSP will be x & m 2=E 2, wherein x & m 2=E 2.All these elements followed the tracks of by adaptation, and alternatively an adjacent element under them to determine correct mismatch position.With each next PSP, E iradix reduce rapidly.
Similarly, when providing prompting, coupling finds suitable least member, and it can move to suitable least member from current location.
As in scope PSP, the subregion undertaken by interval brings new aspect, and between given zone, set PSP can be changed into scope or some PSP.
In other embodiments, process and multiplely to retrain simultaneously.Due to while PSP track be the common factor of the track of multiple independent PSP, therefore the length in space is cumulative.Therefore, single threshold value may be set.
When there is multiple all types of schema constraint, adaptation is from execution reduces schema constraint.Single fixed mode is combined into by from the factorization of coordinate filter and scope and the fixed mode of interval query gained.The interval PSP of complete residual error is eliminated.Adaptation has a PSP and/or multiple scope and gathers PSP.Adaptation then for each PSP take multiple independent adaptation, and make their competition the highest mismatch position.When giving mismatch, calculate prompting with the institute's Constrained meeting same time.If mismatch is negative value, so calculates pre-prompting, and if necessary correct each independent adaptation.When mismatch be on the occasion of time, minimum growth position competed by all adaptations, and equally with in negative mismatch subsequently operates.
In one example, use various threshold testing grasshopper strategy and itself and reptile strategy compared, reptile strategy comprises the threshold value 0 using frog strategy.Owing to having comparatively Low threshold, therefore with shorter jump for cost metric order increase.
Fig. 5 shows flow process Figure 180 of the method for database search.First, in step 182, index is created.Create the data dictionary relevant with all relevant dimension attributes, every bar record is converted to compound key-value pair, compound key-value pair is by key sort, and key is by the very large integer of the byte representation of one group of regular length.Z can be used to sort and to form key.Key-value pair keystroke sequence stores.In one example, the integer by obtaining from the dictionary of component is encoded to compound keys.In one example, mixed by bit, the bit of one of them attribute can be close to the bit of next attribute.
Then, in step 184, database receives inquiry.This inquire-receive is from user.Inquiry can have the combination of point on any attribute, scope or set filtrator and these filtrators multiple for customer data.
Then, in step 186, Database threshold value.This threshold value can be depending on the density of scan rate and data acquisition, is wherein data acquisition calculated threshold.In an example, for inquiring about definite threshold separately.Point, scope may be different with the threshold value of collection query.Or, threshold value based on data acquisition, independent of independent inquiry.In an example, threshold value is set to 0, and method is jumped all the time.In another example, threshold value is set to n, and method is creeped all the time.Or, threshold value is set to the integer between 0 and n.Independent threshold value can be set for positive mismatch and negative mismatch.
In step 188, database OK button whether match index.One example directly together with key, and does not operate together with component.Execution pattern is searched for.Broad sense z curve can be used, wherein there is the two-dimensional square of the little zs forming larger z.There is the coffin with the value of compound keys, there is in rectangle vertical and horizontal line.If key match index, so system proceeds to step 190 to record this coupling.Then, in step 196, database continues to search for next coupling in order, and returns step 188.If key does not mate with index, so system proceeds to step 192.
In step 192, database determination mismatch.Mismatch just can be or be negative.OK button and the inconsistent the highest-order bit of index, mismatch that Here it is.
Then, in step 194, database determines whether be greater than from the mismatch of step 192 threshold value set up in step 186.If mismatch is less than or equal to threshold value, so in step 196, database continues to search in order.If mismatch is greater than threshold value, so in step 198, database jumps.When the projection of the currentElement to the ranged space misses scope, next matching candidate projects the nearest element that scope starts.But this jump length is less than the jump length in dot pattern search.In fact, if scope is almost all ranged spaces, the nearly all point so in given area all meets scope.
In step 198, database jumps.Jump can be made according to last defective key.After a jump, whether database proceeds to step 188 and mates with index with OK button.
Example adaptation code is write with Java language, has the key being expressed as byte arrays.Implement not signed comparatively big integer arithmetic sum and press bit computing.Application programming interface (API) is for creating framework and query filter device.Pluggable storage adapter interface uses different data storages to test.Test distributed data scene and internal memory scene.For internal memory scene data storage adapter based on B+ tree and MVStore, based on B+ tree key assignments storage be positioned at H2 database of increasing income after.For large data test, use Apache HBase, this is that a kind of distributed key assignments of increasing income from Hadoop family stores.In this adapter, call grasshopper algorithm by Hbase coprocessor mechanism.Data partition is key range or region by Hbase, and region server node has been assigned in each region.Coprocessor is conducive to accessing each region, and this is conducive to the grasshopper strategy based on subregion conversely.The statistics in each region followed the tracks of by another coprocessor.
For memory test, use the random access memory (RAM) of the laptop computer with i5 central processing unit (CPU) and the 16Gb run in 64 bit Windows 7 operating systems.Data generate randomly or read from file in computer run process.Imitate framework call detail record (CDR) with produce in telecommunications identical.Existence range is 2 to 2 14between 16 dimension attributes.Total compound keys length is 116, produces 15 byte keys.Use the data acquisition containing 100,000,000 records.It is 12Gb that maximum Java piles size.The single-threaded operation of memory test.
Such as, distributed storage test uses following configuration: the operation version had on the commercial Linux bunch of upper Hadoop installed is 128 regions on 12 region server nodes of the Hbase of 0.94.Employ some data acquisitions, comprise one have CDR framework and 1.5 hundred million record data acquisition, one have 10 attributes and 1,460,000 record data acquisitions and have 5 attributes and 5.5 hundred million record the set of Transaction Processing Performance Council's decision support (TPC-DS) reference data.Be SELECT COUNT (1) FROM dataset WHERE filter by query express in Structured Query Language (SQL) (SQL), its middle filtrator is point, scope or set constraint on some dimension attributes of data acquisition.Query filter device value generates at random in computer procedures.For internal memory scene, perform three limit inquiries with the attribute filtrator of point, scope and set constraint at the most and combine.For large data scene, Stochastic choice attribute.Each inquiry operation 10 times, uses reptile and the grasshopper strategy with different threshold value.Eliminate minimum and maximum number of run, for each strategy, residue number of run is averaged.Calculate the mean value of all combinations.
The log key composite strategy of leading attribute produces low-down delay.For other situation, grasshopper only creeps.Grasshopper strategy entirety is effective.For temporary query, the individual bit position staggered by the descending of attribute radix produces better result.In many cases, the temporary query on each attribute is accelerated.Provide the improvement of the full scan to any gz curve combination type, but for each attribute and be not all necessary.
When dimension increases, the number of the attribute of grasshopper technology can be utilized can be restricted to the number of the sufficiently high attribute of mask head.In key, the number of useful bit is roughly log 2(card (A) R).Therefore, threshold value is set to n – t closest to optimal selection.These t bits can be distributed between most popular attribute.
In one example, for grasshopper, achieve optimum.For internal storage data set, average than reptile slow 3 to 5 times of frog, and grasshopper is faster than reptile.For distributed data set, frog and grasshopper surpass the some orders of magnitude of reptile.In optimal threshold, the grasshopper on CDR data acquisition faster than frog 6.5%, and the grasshopper faster than frog 13% on TPC-DS data acquisition.For the data acquisition of 14.5 hundred million records, two strategies are overlapping, and threshold value is 0.
The grasshopper with the threshold value suitably selected is poor unlike reptile.
Fig. 6 shows for reptile 202 and grasshopper 204, uses the figure of the query time that TreeMap is ms as the unit that data store.The combination of filtrator comprises the have point (P) of 16 dimensions, scope (R) and the set (S) be based upon on 100,000,000 data sets recorded and retrains.Use these results of limit multiple measurement.Slow 4.3 times of frog (not shown) average specific reptile.More multiple constraint, the performance gain of grasshopper strategy is larger.
Fig. 7 shows the inquiry of comparing for the query time between reptile 212 and the internal storage data storage of grasshopper 214, and all data store all to benefit from and use grasshopper strategy.For the data storage with the single-point filtrator in 16 dimensions of the data set of 100,000,000 records illustrates the query time using TreeMap, MV-store and basic B+Tree, in units of ms.Use these results of limit multiple measurement.At least slow than reptile 3.8 times of frog (not shown).
For memory test, in most of the cases, the threshold value that the grasshopper of theory calculate jumps is all best.Store for internal storage data, measuring and scanning the scope finding ratio R is 0.35 to 0.8.For the CDR data acquisition of 100,000,000 records, Threshold is close to 95.Therefore, 21 (116-95) individual key bit is all useful, and 16 dimensions all can benefit from grasshopper strategy.
In an example, the optimal threshold of the CDR data acquisition of 1.5 hundred million records is 64, has 52 useful bits.
In Hbase, area data is divided internally into block.It is favourable for skipping block, but can not obtain block statistics from coprocessor.Search in these blocks is sequenced, therefore finds operation slowly, unless it skips all these blocks.
The time of having inquired about is determined by the slowest node.If data are not equally distributed, so result is difficult to prediction.Fig. 8 shows two of each region on the CDR data acquisition tactful test duration, it illustrates reptile 232 and grasshopper 234 result.Fig. 8 shows the query time in every HBase region of the data storage using HBase as the single-point filtrator on the data set for 16 dimensions using random combine to record and 1.5 hundred million records, in units of ms.Frog (not shown) average specific grasshopper slow 6.5%.
Fig. 9 shows the result of the TPC-DS data acquisition of reptile 242 and grasshopper 244.This figure be depicted as use Hbase as have 5.5 hundred million record TPC-DS data set 5 dimensions on single-point and many-tap filters data store logarithmic scale on query time, in units of ms.Frog (not shown) average specific grasshopper slow 13%.
Figure 10 shows the result of 14.5 hundred million set of records ends of 222 reptiles and 224 grasshoppers.The figure shows use Hbase to store as the data for the single-point in 10 dimensions of 14.6 hundred million line data sets and many-tap filters logarithmic scale on query time, in units of ms.Threshold value is 0, therefore grasshopper and frog strategy overlap.
Figure 11 shows the block scheme of disposal system 270, can be used for realizing equipment disclosed herein and method.Particular device can utilize shown all parts, or the subset of only parts, and integrated horizontal can be different with equipment.In addition, equipment can comprise the Multi-instance of parts, such as multiple processing unit, processor, storer, transmitter, receiver etc.Disposal system can comprise the one or more input equipment of outfit, as the processing unit of microphone, mouse, touch-screen, keypad, keyboard etc.In addition, disposal system 270 can be equipped with one or more output device, such as, and loudspeaker, printer, display etc.Processing unit can comprise central processing unit (CPU) 274, storer 276, mass storage facility 278, video adapter 280 and be connected to the I/O interface 288 of bus.
Bus can be one or more in some bus architectures of any type, comprises memory bus or memory controller, peripheral bus, video bus etc.CPU274 can comprise the data into electronic data processing of any type.Storer 276 can comprise the system storage of any type, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous dram (SDRAM), ROM (read-only memory) (ROM) or its combination etc.In an embodiment, the program used when the ROM used when storer can be included in start and executive routine and the DRAM of data-carrier store.
Mass storage facility 278 can comprise the memory devices of any type, and it is for storing data, program and out of Memory, and makes these data, program and out of Memory by bus access.Mass storage facility 278 can comprise in following item one or more: solid magnetic disc, hard disk drive, disc driver, CD drive etc.Mass storage facility can comprise hardware data compression circuit board etc.
Video adapter 280 and I/O interface 288 provide interface with coupled external input-output device to processing unit.As shown in the figure, the example of input-output device comprises the display being coupled to video adapter and the mouse/keyboard/printer being coupled to I/O interface.Miscellaneous equipment can be coupled to processing unit, and can utilize additional or less interface card.Such as, can use serial interface card (not shown) that serial line interface is supplied to printer.
Processing unit also comprises one or more network interface 284, and it can comprise the wireless link of the wire links such as Ethernet cable and/or access node or heterogeneous networks.Network interface 284 allows processing unit by network and remote unit communication.Such as, network interface provides radio communication by one or more transmitter/emitting antenna and one or more receiver/receiving antenna.In one embodiment, processing unit be coupled to LAN (Local Area Network) or wide area network for data processing and and remote equipment, such as other processing unit, internet, remote storage facility etc. communicate.
Although provide some embodiments in the present invention, should be understood that without departing from the spirit or scope of the present invention, system and method disclosed in this invention can embody with other particular forms many.Example of the present invention should be regarded as illustrative and nonrestrictive, and the present invention is not limited to the details given by Ben Wenben.Such as, various element or parts can combine or merge in another system, or some feature can be omitted or not implement.
In addition, without departing from the scope of the invention, describe and be illustrated as discrete or independent technology, system, subsystem and method in various embodiment and can carry out combining or merging with other system, module, technology or method.Show or be discussed as coupled to each other or direct-coupling or communication other also can adopt power mode, mechanical system or alternate manner and be indirectly coupled by a certain interface, equipment or intermediate member or communicate.Other change, example that is alternative and that change can be determined when not departing from spiritual and disclosed scope herein by those skilled in the art.

Claims (22)

1. for a method for search database, it is characterized in that, described method comprises:
Receive the message of inquiring about from the instruction of user by processor, wherein said inquiry comprises pattern;
By the data acquisition determination first threshold of described processor according to described database;
The first key by the more described pattern of described processor and described data acquisition compares to produce; And
Compare the triple bond that the second key determining to jump to described data acquisition with described first threshold is still scanned up to described data acquisition by described processor comprise according to described
When the absolute value of described comparison is greater than described first threshold, jump to described second key of described data acquisition, and
When the described absolute value of described comparison is less than or equal to described first threshold, be scanned up to the described triple bond of described data acquisition, wherein said first key and described triple bond are continuous print.
2. method according to claim 1, is characterized in that, comprises further according to described data acquisition generating indexes.
3. method according to claim 2, is characterized in that, wherein generates described index and comprises many records of described data acquisition are converted to multiple compound key-value pair.
4. method according to claim 1, is characterized in that, wherein determines that described first threshold comprises and determines described first threshold according to the searching-scan rate of described data acquisition and the density of described data acquisition.
5. method according to claim 1, is characterized in that, wherein determines that described first threshold comprises and determines described first threshold according to described inquiry.
6. method according to claim 1, is characterized in that, wherein said inquiry comprises coordinate filter.
7. method according to claim 1, is characterized in that, wherein said inquiry comprises range filter.
8. method according to claim 1, is characterized in that, wherein said inquiry comprises set filtrator.
9. method according to claim 1, it is characterized in that, wherein said inquiry comprises the first coordinate filter and second point filtrator, the first range filter and the second range filter, the first set filtrator and the second set filtrator, described first coordinate filter and described first range filter, described first coordinate filter and described first set filtrator or described first range filter and described first and gathers filtrator.
10. method according to claim 1, is characterized in that, wherein divides described data acquisition.
11. methods according to claim 1, is characterized in that, wherein do not divide described data acquisition.
12. methods according to claim 1, is characterized in that, wherein described first key of more described pattern and described data acquisition comprises the number of bits reducing described pattern.
13. methods according to claim 1, is characterized in that, wherein determine that described first threshold comprises and described first threshold is set to 1.
14. methods according to claim 1, is characterized in that, wherein determine that described first threshold comprises the integer being set to described first threshold to be greater than 1.
15. methods according to claim 1, it is characterized in that, comprise further according to described data acquisition determination Second Threshold, wherein determine to jump or scanning comprise when described compare for during negative value more described compare with described first threshold and when described compare on the occasion of time more describedly to compare and described Second Threshold.
16. 1 kinds, for the method for search database, is characterized in that, described method comprises:
Receive the message of inquiring about from the instruction of user by processor, wherein said inquiry comprises pattern;
The first key by the data acquisition of the more described pattern of described processor and described database compares to produce;
Result is recorded to produce the result of record according to described comparison by described processor;
Determine to jump or scan according to the order of sequence according to described comparison by described processor; And
Send the result of described record to described user by described processor.
17. methods according to claim 16, is characterized in that, wherein said pattern comprises coordinate filter.
18. methods according to claim 16, is characterized in that, wherein said pattern comprises range filter.
19. methods according to claim 16, is characterized in that, wherein said pattern comprises set filtrator.
20. methods according to claim 16, it is characterized in that, wherein said pattern comprises the first coordinate filter and second point filtrator, the first range filter and the second range filter, the first set filtrator and the second set filtrator, described first coordinate filter and described first range filter, described first coordinate filter and described first set filtrator or described first range filter and described first and gathers filtrator.
21. methods according to claim 16, is characterized in that, wherein determining jumps or scan comprises more described comparison and threshold value.
22. 1 kinds of computing machines, is characterized in that, comprising:
Processor;
Database, comprises multi-dimensional database index; And
Computer-readable recording medium, stores the program performed by described processor, and described program comprises the instruction proceeded as follows:
Receive the message from user, wherein said message instruction inquiry, and described inquiry comprises pattern,
According to the data acquisition determination first threshold of described database,
First key of more described pattern and described data acquisition compares to produce; And
Still be scanned up to triple bond in described data acquisition according to described the second key determining to jump to described data acquisition with described first threshold that compares, comprise
When the absolute value of described comparison is greater than described first threshold, jump to described second key of described data acquisition, and
When the described absolute value of described comparison is less than or equal to described first threshold, be scanned up to the described triple bond of described data acquisition, wherein said first key and described triple bond are continuous print.
CN201480005413.7A 2013-02-19 2014-02-19 System and method for database searching Pending CN104937593A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201361766299P 2013-02-19 2013-02-19
US61/766,299 2013-02-19
PCT/US2014/017220 WO2014143514A1 (en) 2013-02-19 2014-02-19 System and method for database searching

Publications (1)

Publication Number Publication Date
CN104937593A true CN104937593A (en) 2015-09-23

Family

ID=51352065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480005413.7A Pending CN104937593A (en) 2013-02-19 2014-02-19 System and method for database searching

Country Status (4)

Country Link
US (1) US20140236960A1 (en)
EP (1) EP2948890A4 (en)
CN (1) CN104937593A (en)
WO (1) WO2014143514A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110398370A (en) * 2019-08-20 2019-11-01 贵州大学 A kind of Method for Bearing Fault Diagnosis based on HTS-CNN model

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239576A (en) * 2014-10-09 2014-12-24 浪潮(北京)电子信息产业有限公司 Method and device for searching for all lines in column values of HBase list
CN104537003B (en) * 2014-12-16 2018-01-09 北京中交兴路车联网科技有限公司 A kind of general high-performance data wiring method of Hbase databases
CN104699839B (en) * 2015-03-31 2021-03-02 北京奇艺世纪科技有限公司 File searching method and device
CN106933833B (en) * 2015-12-30 2020-04-07 中国科学院沈阳自动化研究所 Method for quickly querying position information based on spatial index technology
CN105930441B (en) * 2016-04-18 2019-04-26 华信咨询设计研究院有限公司 A kind of radio monitoring data query method
CN107577680B (en) * 2016-07-05 2021-04-09 北京嘀嘀无限科技发展有限公司 Real-time full-text retrieval system based on HBase big data and implementation method thereof
CN107391765A (en) * 2017-09-01 2017-11-24 云南电网有限责任公司电力科学研究院 A kind of power network natural calamity data warehouse model implementation method
US10747783B2 (en) * 2017-12-14 2020-08-18 Ebay Inc. Database access using a z-curve
CN109284434A (en) * 2018-09-12 2019-01-29 东莞数汇大数据有限公司 Web page contents crawling method, system and storage medium based on R language
CN109299106B (en) * 2018-10-31 2020-09-22 中国联合网络通信集团有限公司 Data query method and device
JP7131314B2 (en) * 2018-11-09 2022-09-06 富士通株式会社 Information management program, information management method, information management device, information processing program, information processing method, and information processing device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6513028B1 (en) * 1999-06-25 2003-01-28 International Business Machines Corporation Method, system, and program for searching a list of entries when search criteria is provided for less than all of the fields in an entry
CN102306176A (en) * 2011-08-25 2012-01-04 浙江鸿程计算机***有限公司 On-line analytical processing (OLAP) keyword query method based on intrinsic characteristic of data warehouse
US20120215801A1 (en) * 2000-04-07 2012-08-23 Washington University Method and Apparatus for Adjustable Data Matching
CN102663114A (en) * 2012-04-17 2012-09-12 中国人民大学 Database inquiry processing method facing concurrency OLAP (On Line Analytical Processing)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4891781A (en) * 1987-03-04 1990-01-02 Cylink Corporation Modulo arithmetic processor chip
US5924088A (en) * 1997-02-28 1999-07-13 Oracle Corporation Index selection for an index access path
US6353821B1 (en) * 1999-12-23 2002-03-05 Bull Hn Information Systems Inc. Method and data processing system for detecting patterns in SQL to allow optimized use of multi-column indexes
US20020002550A1 (en) * 2000-02-10 2002-01-03 Berman Andrew P. Process for enabling flexible and fast content-based retrieval
GB2359641B (en) * 2000-02-25 2002-02-13 Siroyan Ltd Mapping circuitry and method
US6931418B1 (en) * 2001-03-26 2005-08-16 Steven M. Barnes Method and system for partial-order analysis of multi-dimensional data
DE60300019D1 (en) * 2003-02-18 2004-09-09 Tropf Hermann Database and method for organizing the data elements
TW201006175A (en) * 2008-07-31 2010-02-01 Ibm Method, apparatus, and computer program product for testing a network system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6513028B1 (en) * 1999-06-25 2003-01-28 International Business Machines Corporation Method, system, and program for searching a list of entries when search criteria is provided for less than all of the fields in an entry
US20120215801A1 (en) * 2000-04-07 2012-08-23 Washington University Method and Apparatus for Adjustable Data Matching
CN102306176A (en) * 2011-08-25 2012-01-04 浙江鸿程计算机***有限公司 On-line analytical processing (OLAP) keyword query method based on intrinsic characteristic of data warehouse
CN102663114A (en) * 2012-04-17 2012-09-12 中国人民大学 Database inquiry processing method facing concurrency OLAP (On Line Analytical Processing)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEXANDER RUSSAKOVSKY: "Hopping over Big Data:Accelerating Ad-hoc OLAP Queries with Grasshopper Algorithms", 《COMPUTER SCIENCE》 *
FRANK RAMSAK等: "Integrating the UB-Tree into Database System Kernel", 《PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASE,MORGAN KAUFMANN PUBLISHERS INC》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110398370A (en) * 2019-08-20 2019-11-01 贵州大学 A kind of Method for Bearing Fault Diagnosis based on HTS-CNN model
CN110398370B (en) * 2019-08-20 2021-02-05 贵州大学 Bearing fault diagnosis method based on HTS-CNN model

Also Published As

Publication number Publication date
EP2948890A4 (en) 2016-04-06
EP2948890A1 (en) 2015-12-02
US20140236960A1 (en) 2014-08-21
WO2014143514A1 (en) 2014-09-18

Similar Documents

Publication Publication Date Title
CN104937593A (en) System and method for database searching
Zhang et al. Privbayes: Private data release via bayesian networks
Deshpande et al. Independence is good: Dependency-based histogram synopses for high-dimensional data
Hu et al. Matching large ontologies: A divide-and-conquer approach
US9189520B2 (en) Methods and systems for one dimensional heterogeneous histograms
CN105701200A (en) Data warehouse security OLAP method on memory cloud computing platform
US10210280B2 (en) In-memory database search optimization using graph community structure
US20200401563A1 (en) Summarizing statistical data for database systems and/or environments
CN104137095A (en) System for evolutionary analytics
Roantree et al. A heuristic approach to selecting views for materialization
Golfarelli et al. Materialization of fragmented views in multidimensional databases
Aluç et al. chameleon-db: a workload-aware robust RDF data management system
Al-Amin et al. Big data analytics: Exploring graphs with optimized SQL queries
Ordonez-Ante et al. A workload-driven approach for view selection in large dimensional datasets
Saha et al. Symbolic support graph: A space efficient data structure for incremental tabled evaluation
König et al. A framework for the physical design problem for data synopses
Rodrigues et al. Virtual partitioning ad-hoc queries over distributed XML databases
CN112667859A (en) Data processing method and device based on memory
Venkat et al. A Succinct, Dynamic Data Structure for Proximity Queries on Point Sets.
Serrano et al. Condensed representation of frequent itemsets
Buccafurri et al. Fast range query estimation by n-level tree histograms
Fazzinga et al. A compression-based framework for the efficient analysis of business process logs
Mampaey et al. Summarising data by clustering items
König et al. Automatic tuning of data synopses
Labzioui et al. New Approach based on Association Rules for Building and Optimizing OLAP Cubes on Graphs

Legal Events

Date Code Title Description
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150923