CN102214176A

CN102214176A - Method for splitting and join of huge dimension table

Info

Publication number: CN102214176A
Application number: CN2010101427190A
Authority: CN
Inventors: 甘亮; 李爱平; 贾焰; 韩伟红; 刘健; 金鑫
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2010-04-02
Filing date: 2010-04-02
Publication date: 2011-10-12
Anticipated expiration: 2030-04-02
Also published as: CN102214176B

Abstract

The invention provides a method for the splitting of a huge dimension table, which comprises the steps of: extracting an attribute value of one hierarchical dimension from each table entry in the huge dimension table and a value domain range of a join key corresponding to the attribute value; saving all the attribute values in the hierarchical dimension and the value domain ranges of the join keys corresponding to the attribute values in a sub-table; and repeating the above steps until the information of all the hierarchical dimensions in the huge dimension table are saved in the corresponding sub-tables. The invention further provides a method for the join of the huge dimension table. According to the method, the dimension table is compressed and the appropriate compressed sub-table can be called in the memory of the dimension table during table join, and the sub-table, being smaller than the original dimension table, can reside in the memory constantly to avoid a larger number of unnecessary disc I/O (Input/Output) operations.

Description

The cutting of super large dimension table and table method of attachment

Technical field

The present invention relates to database field and data analysis field, particularly a kind of cutting of super large dimension table and table method of attachment.

Background technology

Data processing is an important directions in computer research field.According to the form that exists of data, data processing is divided into to the processing of static data with to the processing of dynamic data (being data stream).It is data-centered that static data is handled, and whole data set is stored in huge, the metastable centralized storage medium, and prepare to accept the user data requests (i.e. " inquiry ") that arrives at random at any time.In the life cycle of data set, most data are stablized constant, and frequent variations is the inquiry that the user may submit at any time.The processing mode that has all adopted static data to handle in the multiple application such as data base management system (DBMS), information retrieval system, data warehouse.But in some applications, as internet management system, securities exchange system, telecommunication system, financial transaction system, itself has high fluidity data, user inquiring is then relatively stable, this just makes that the data processing of these application no longer is the processing to static data, but to the processing of dynamic data.When dynamic data is handled, since institute to be processed to as if online, lasting high-speed data-flow, and because the restriction of storage space, received data can not be saved in the storer fully, simultaneously must handle these data stream uninterruptedly, without delay again, to obtain real-time result, therefore, processing mode to static data is not adapted at using in the Dynamic Data Processing process, and Dynamic Data Processing need adopt new data structure and computing method.

Traditional system R is mainly towards basic, daily issued transaction, trade transactions as bank, therefore be also referred to as online transaction processing system (On-Line TransactionProcessing, be called for short OLTP), but always can not be satisfactory to the support of business decision analysis Useful Information (being analyzing and processing) for how utilizing existing mass data to extract, therefore, father E.F.Codd by relational database has proposed OLAP (on-line analytical processing in 1993, On-LineAnalytical Processing), OLAP makes the analyst, managerial personnel or operational staff can come dissolving from the raw data transfer from multiple angles, can really be what the user understood, and the information that truly reflects enterprise's dimension characteristic is carried out fast, consistent, alternatively access, thus acquisition is to a class software engineering of more understanding in depth of data.The target of OLAP is to satisfy decision support or satisfy inquiry specific under the multidimensional environment and the form demand, and its technological core is " dimension " this notion, so OLAP also can be described as the set of multidimensional data analysis instrument.Compare with the Transaction Processing OLTP that is adopted in the traditional relevant database, OLAP is mainly used in the data warehouse, and it can support complicated analysis operation, stresses decision support, and visual and understandable Query Result is provided.

Processing and OLAP to data stream are two separate notions originally, but in real time data multidimensional analysis field, as real-time network security monitoring data analysis, bank transaction record analysis in real time, both have obtained combination closely.Because data stream itself has quick variation, magnanimity and potential unlimited characteristics, and when on-line analytical processing, need data are done a large amount of operations, influenced the real-time of Data Stream Processing, therefore, in the prior art, those skilled in the art have proposed employing data stream cube (StreamCube) and have improved inquiry velocity, to overcome the contradiction between data stream magnanimity and the real-time.Described data stream cube is meant the data cube (Cube) that the data flow data is set up, it is made up of a plurality of predefined traffic aggregate Query Results, data cube wherein then be a kind of can express-analysis the data structure of data, its allows from multidimensional in addition modeling and observation of data.

In the prior art, the cubical structure of data stream mainly be may further comprise the steps: received data stream is shown to be connected with the dimension table; Result after his-and-hers watches connect does and assembles inquiry; Result after the storing collection inquiry.Why data stream and dimension table will be done when making up the data stream cube that table is connected is to be single level, simple grain degree because of data flow data, is connected with the dimension table by the data stream tuple and can obtains detailed attributes information multi-level, many granularities.Because it is to make up the cubical steps necessary of data stream that data stream list connects, therefore, improves the data stream list joint efficiency and will help to improve the cubical formation efficiency of data stream.

There is the method for attachment of several data stream table in the prior art, connects (Hash join), nested loop connection (Nested-Loop Join) and ordering merging as Hash and be connected (Sort-Merge join).These existing methods have range of application separately, but also have defective separately.In the table of data flow system connected, the table of income internal memory was shown for tieing up as hash connecting method, when the dimension table limits greater than internal memory, needed to tie up the remaining data of showing in the reading disk repeatedly, and the I/O expense is excessive.When streaming rate acquires a certain degree, may make data in the data stream fail timely connection processing and be dropped, cause net result incorrect, maybe the result that can only obtain being similar to.The scale of dimension table is big more, and this problem is outstanding all the more.

Summary of the invention

It is big to the objective of the invention is to overcome existing data stream list method of attachment I/O expense, the shortcoming that real-time is relatively poor, thus the method for attachment of a kind of super large dimension table is provided.

To achieve these goals, the invention provides a kind of cutting method of super large dimension table, comprising:

Step 1), from each list item of described super large dimension table, extract a level dimension property value and with the codomain scope of the pairing connecting key of this property value;

Step 2), be kept in the sublist with all properties value in the dimension of described level and with the codomain scope of the corresponding connecting key of each property value;

Step 3), repetition above-mentioned steps, the information of the dimension of all levels is saved in the corresponding sublist in described super large dimension table.

In the technique scheme, before described step 1), also comprise the step of the list item of described super large dimension table being done ordering by the value of connecting key field.

In the technique scheme, described ordering is done ascending sort according to the value of connecting key field.

The present invention also provides the method for attachment of a kind of super large dimension table, comprising:

Step 1), the cutting method that adopts described super large dimension table are divided into a plurality of sublists with described super large dimension table according to the level of dimension, comprise the attribute information of a certain level in the described super large dimension table or some level in the described sublist;

Step 2), the sublist that is generated for the compression back is set up index;

Step 3), after receiving user's query requests, call corresponding sublist according to described index, realize that the table of data in the data stream connects.

In the technique scheme, described step 3) comprises:

Step 3-1), the data stream tuple in the data stream is according to user's query requests query steps 2) index created, call corresponding sublist, from described sublist, read corresponding attribute field;

Step 3-2), the attribute field that will be read from the sublist of each dimension merges, and obtains connecting the result.

In the technique scheme, in described step 2) in, the index of setting up for described sublist is a kind of in B+Tree index, B-tree index, the binary tree index.

The invention has the advantages that:

The present invention is by compression dimension table, then when table connects with proper compression after sublist call in dimension table internal memory because the more original dimension table of sublist is littler, therefore can memory-resident, avoid a large amount of unnecessary magnetic disc i/os operations.

Description of drawings

Fig. 1 is the process flow diagram of super large dimension table of the present invention method of attachment;

Fig. 2 is the example of data stream Eventlog;

Fig. 3 is the fact table of data stream Eventlog shown in Figure 2 and the synoptic diagram of dimension table.

Fig. 4 (a) is the exemplary plot of a super large dimension table;

Fig. 4 (b)-Fig. 4 (d) is the exemplary plot of the dimension of super large shown in Fig. 4 (a) table through resulting sublist after the cutting;

Fig. 5 is the synoptic diagram of the B+tree index set up for the sublist of attribute city among Fig. 4 (b).

Embodiment

Before the specific embodiment of the present invention is elaborated, at first related notion related among the present invention is specified.

Dimension (Dimension): the special angle of people's observed data is the generic attribute when considering a problem, and the set of a generic attribute constitutes a dimension, as time dimension, geographical dimension etc.

The level (Level) of dimension: certain special angle of people's observed data (be certain dimension) can also exist different each of level of detail to describe the aspect, as comprise many levels such as date, month, season, year in time dimension.

The member (Member) of dimension: a value of dimension is the description of data item position in certain dimension.As " certain year in such a month, and on such a day " is the description of position on time dimension.

Dimension table (Dimension Table): the form of expression of dimension in relational database specifically is expressed as a tables of data.

Tolerance (measure): the value of Multidimensional numerical, as the sales volume in (in January, 2000, Shanghai, notebook computer, sales volume).

Fact table (Fact Table): comprise the external key of measuring and being associated with the dimension table.

After related related notion is done above-mentioned explanation in to the present invention, the present invention is illustrated below in conjunction with the drawings and specific embodiments.

Can know by analysis the data stream list connection procedure, the major obstacle that the data stream list joint efficiency improves is to be used to realize the restriction of the resource of computer system of Data Stream Processing own, and resource limit described herein comprises the restriction of CPU processing power and the restriction of memory size.The restriction of CPU processing power is because for the data stream tuple that arrives at a high speed, CPU does not have enough fast ability in time to handle the tuple of these arrival.The restriction of memory size is that computing machine can't all be put into free memory with these tuples because of the data stream tuple for a large amount of arrival.At the These characteristics of resource limit, the present invention proposes corresponding table method of attachment.

Before the specific implementation step to the inventive method elaborates, at first data structure involved among the present invention is done corresponding explanation.

Mention in background technology, the object of table attended operation comprises data stream and dimension table.At first data stream is illustrated below.

In real time data multidimensional analysis field, data stream provides all the required information of inquiring about for data query.The basic composition unit of data stream is called as the data stream tuple, and the type of the data that comprised in the data stream tuple of same type is basic identical.For example, in a data stream Eventlog that internet management system sent, each data stream tuple all comprises ID, SrcIP, DstIP, EvenTypeID, InOutID attribute, i.e. sign, source address, destination address, event type sign, equipment gateway sign.In Fig. 2, provided in the example of data stream Eventlog, each row wherein represent a described data stream tuple, can use r[1], r[2], r[3], r[4] represent this four generic attribute of SrcIP, DstIP, EvenTypeID and InOutID wherein respectively.In each attribute of data stream tuple, generally all include abundant in content information.Do not handle if the information in the data stream tuple is not done certain processing, will be difficult to so from the data of the dynamic change of magnanimity, find out the needed data of user fast.Therefore, after computer system receives traffic flow information, need do certain processing, be beneficial to follow-up realization of searching fast the bulk information that is comprised in the data stream.Still be example with data stream Eventlog noted earlier, owing to include abundant in content information separately in SrcIP, DstIP, the attributes such as EvenTypeID, InOutID, and these information all are subordinated to the class at place separately, therefore the information of showing to preserve SrcIP, DstIP, EvenTypeID, InOutID four generic attributes with SrcIPaddress, DstIPaddress, Event and four dimensions of Inout respectively in to the processing procedure of data stream then includes the external key that is associated with aforementioned four dimension tables in fact table.Provided the synoptic diagram of fact table that data stream Eventlog generated and dimension table among Fig. 3.As can be seen from the figure, in each dimension table, provided the concrete attribute information that each dimension is comprised.

More than be the explanation to data stream, though mentioned the dimension table in to data flow specification process, not detailed, as another object of table attended operation, for the ease of understanding, notion, the content to the dimension table elaborates below.

As previously mentioned, the dimension table is to be used among the OLAP representing that the dimension of set of same generic attribute is in the form of expression of relational database.As what mentioned in the dimension table notion, what preserved in dimension table is same generic attribute, can have hierarchical relationship usually between these attributes.With the SrcIP attribute among the data stream Eventlog noted earlier is example, a SrcIP address comprises the information of this place city, IP address, province, country usually, obviously, have hierarchical relationship according to area size between country, province, city, so have hierarchical relationship between attribute informations such as city, province, country in the IP address.Fig. 4 (a) has provided an example of the table of SrcIP dimension described in Fig. 3, from this example as can be seen, this dimension table comprises attribute informations such as IP, city, province and country, and has hierarchical relationship between city wherein, province and country attribute.

In the past in the face of in the declarative procedure of Data Stream Processing as can be seen, data stream is after treatment, most information wherein are stored in each dimension table, so the scale of dimension table will inevitably become big along with increasing of received data stream.From another angle, the scale of dimension table is also relevant with the hierachy number of contained attribute in this dimension table, and for a dimension table, the hierachy number of attribute wherein is many more, and the scale of this dimension table is just big more.The change of dimension table scale makes greatly the memory headroom that will occupy when should the dimension table calling in the internal memory become greatly.Therefore, be necessary to reduce the scale of single dimension table, to avoid repeating to call problem owing to the limited dimension table that is brought of memory size.

Make the scale of an independent dimension table diminish and the dimension table in the original information a kind of possible implementation method of not losing be that original dimension table is carried out cutting, the attribute of a certain level or some level in original dimension table is grouped in the sublist, thereby original dimension table is divided into a plurality of sublists of being made up of the attribute information of same level.As a kind of preferred implementation, original dimension table is being cut in the process of sublist, a sublist comprises the attribute information of original dimension table one level.With the dimension table SrcIPaddress shown in Fig. 4 (a) is example, and the cutting process of this dimension table is illustrated.Dimension table shown in Fig. 4 (a) includes a plurality of data item, includes before each data item ID number that is used to identify this data item.Include IP, city, province and country in each data item at interior a plurality of attributes, mention in front, have hierarchical relationship between attribute city, province wherein and the country.Therefore, in the cutting process, at first, will tie up table and do ascending sort, and read one by one according to the value of connecting key (connecting key is meant the sign indicating number that connects fact table and Wei Biao).In the dimension table about the IP address, the general ip field is the connecting key of this dimension table and fact table, therefore just does ascending sort according to the value of IP address.In Fig. 4 (a), each data item in the dimension table has been done ascending sort according to the value of IP address.Then, from original dimension table, calculate a certain layer l _iThe codomain scope [start, end] of the pairing connecting key of value v1 of attribute.For example, from Fig. 4 (a) as can be seen, for the attribute city in this one deck of city, the scope that is worth for the IP address of C1 is [1,8].At last, with resulting l in the preamble _iValue v1 and [start, the end] of the attribute of layer put into sublist S as a tuple, repeat aforesaid operations, the processing of all data item in finishing former dimension table.The described dimension table of Fig. 4 (a) obtains Fig. 4 (b), Fig. 4 (c), Fig. 4 (d) after operating through above-mentioned cutting.As in Fig. 4 (a), the value of attribute city is that the item of C1 is that 1 to IP value is 8 from the IP value, and therefore in Fig. 4 (b), the value of attribute city is that the IP_Start value of the item of C1 is 1, and the value of IP_End is 8; Similarly, the value of attribute city is that the IP_Start value of the item of C2 is 9, and the value of IP_End is 12; The value of attribute city is that the IP_Start value of the item of C3 is 13, and the value of IP_End is 15; The value of attribute city is that the IP_Start value of the item of C4 is 16, and the value of IP_End is 19; The value of attribute city is that the IP_Start value of the item of C5 is 20, and the value of IP_End is 23; The value of attribute city is that the IP_Start value of the item of C6 is 24, and the value of IP_End is 29.

It more than is the detailed description of data structure related in the his-and-hers watches attended operation.From to knowing the above-mentioned explanation, if super large dimension table all can be given cutting according to the method described above, in the table connection procedure, can call the less sublist of a certain scale of former super large dimension table so according to the requirement of data query, reduce the dimension table and called in the possibility that accesses internal memory repeatedly.Specific implementation step to the inventive method is illustrated below.

Step 100), at first, the dimension table is carried out cutting, to obtain the littler sublist of a plurality of scales.The correlation technique of the dimension table having been done cutting describes in detail hereinbefore, comprises the attribute information of a certain level in original dimension table or some level in the resulting sublist.

Step 200), secondly, set up index for each sublist that is generated after compressing.The index of setting up for the table after the compression can have polytype, as B+Tree index, B-tree index, binary tree index etc.With B+Tree tree index is example, when setting up index, to each l _iThe compression dimension table of layer is set up the B+Tree index.Every record comprises two field values of start, end and other concept hierarchy fields.The dimension table of k concept hierarchy need make up k RB+Tree so that when table connects graftabl.

Fig. 5 is the synoptic diagram of the B+tree index set up for the pairing sublist of attribute city among Fig. 4 (b), comprises two category nodes in the B+tree index shown in this figure: inner node and leaf node.Leafy node is the node of last one deck, and this node is used to store tuple data.Node beyond the leaf node is inner node, is used to deposit judgement numerical value and pointer.The object number M value of each intranodal has the user to be provided with among the B+tree, is set to 3 among the figure, just maximum 3 minimum 2 objects of each node.Suspension points is illustrated in the tuple that does not occur in the city sublist but may occur.P1, P2, P3 represent pointer.By index, can find the required data of searching of user fast.

Step 300), at last, after receiving user's query requests, realize that according to the index that abovementioned steps is set up table connects.

It will be understood by those skilled in the art that table attended operation and query manipulation have close getting in touch, when the user has query demand, just need do the table attended operation usually, therefore, what the query requests that at first will understand the user before table connects is.At least should comprise dimension information, link information, working lining information in the query requests that the user sent.Wherein, dimension information is used for describing the dimension that inquiry is selected, and link information is used to describe the field that connects data stream and Wei Biao, and working lining (Work Layer) is meant certain dimension d that will inquire about in the Group-by aggregation operator operator _iMinimum level l _iFor example, at the data stream Eventlog that preamble is mentioned, the SQL statement Q1 of following expression inquiry is arranged:

Select?SrcProvince，DstCountry，count(*)

From?Eventlog?e，SrcIPaddress?ip1，DstIPaddress?ip2

Where?e.srcip＝ip1.srcip，e.dstip＝ip2.dstip

Group?by?SrcProvince，DstCountry

In above-mentioned SQL statement Q1, comprise following message: data stream fact table Eventlog, dimension information SrcIPaddress and DstIPaddress, link information e.srcip=ip1.srcip and e.dstip=ip2.dstip, metric count (*) and working lining information SrcProvince, DstCountry.SrcProvince is the working lining of SrcIPaddress dimension, and DstCountry is the working lining of DstIPaddress dimension.

At above-mentioned query statement, in conjunction with the relevant form shown in Fig. 3, Fig. 4, the implementation procedure of his-and-hers watches attended operation is described in detail.

Step 301), according to query requests, the index tree created of inquiry previous step obtains the attribute field in the respective dimension table.Specifically, when each tuple r of data stream DS arrives, according to the connection attribute value r[d of r in m dimension _i] (d _iRepresent i dimension, 1≤i≤m) searches rb[i, WL respectively _i] (rb represents index tree, WL _iRepresent i working lining), promptly obtain dimension d _iIn the attribute field record.In this course, owing to mentioned the dimension table and be divided into a plurality of sublists in the explanation of front, each sublist includes the attribute information of a certain level in the former dimension table or some level, therefore, and when searching rb[i, WL _i] time, just can select corresponding sublist to enter into internal memory, and will all not put in the internal memory by whole dimension table according to the level of the working lining that will search.For example, among the query statement Q1 in front, need the information of working lining SrcProvince in the inquiry SrcIPaddress dimension, the therefore direct sublist in the calling graph 4 (c), rather than the whole dimension table shown in Fig. 4 (a).

Herein as can be seen,, include two-layer above attribute in the sublist, then can reduce the performance of inquiry herein if when in front original super large dimension table being cut into sublist.For example, the shared sublist of certain layer of attribute L1 and another layer attribute L2 can only be come cutting by L1 or L2 layer in the cutting process.Suppose to press the L1 cutting, and what use when table connects is the L2 layer, its performance will be poor with respect to using independently L2 layer optimization so.

Step 302), with step 301) in the Query Result of each dimension merged, obtain connecting t as a result, will connect the result and write and connect result set T.This connection procedure is conventionally known to one of skill in the art, therefore not repeat specification herein.

It more than is explanation to the application of super large dimension table method of attachment of the present invention in internet management system.In other embodiments, super large dimension table of the present invention method of attachment can be used in such as other fields such as financial transaction system, e-commerce system, telecommunication systems equally.For example, in financial transaction system (stock, futures, banking), analyze real-time transaction data.These real-time transaction datas will be divided into a plurality of dimensions: negotiator, type of transaction, transaction place, dealing money etc.Negotiator's dimension table is layered as: negotiator ID, customer type, city, province; Type of transaction is layered as: type of transaction ID, type of transaction, type of transaction ....Wherein, negotiator's quantity will reach several ten million even more than one hundred million in negotiator's dimension table, belong to tangible super large dimension table.In addition, the nature person's quantity in negotiator's dimension table of e-commerce system and telecommunication system all will reach ten million to hundred million orders of magnitude, also belong to super large dimension table.Above-mentioned a few class dimension table all has and the same character of IP dimension table: quantity is big, level is many.Therefore, can adopt method of the present invention that these dimension tables are handled, the efficient when raising super large dimension table is done the table connection.

From top explanation as can be seen, the inventive method is divided into a plurality of dimension tables with super large dimension table according to the level of wherein attribute, and some carry out query manipulation certain in a plurality of dimension tables after selecting as required then to divide in query script.Because the single table of the dimension table after dividing is much smaller on size of data than super large dimension table, therefore the inventive method can solve the memory size restriction on the one hand, avoid magnetic disc i/o (, need frequently read disk and obtain in the dimension table not part) at internal memory because internal memory can not satisfy super large dimension table demand; On the other hand, the detection time when compression dimension table can reduce the table connection (this paper uses Nested-Loop Join, and the size of detection time and dimension table is inversely proportional to, and dwindles Wei Biao and can reduce detection time, the final minimizing table tie-time).

It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims

1. a super large is tieed up the cutting method of showing, and comprising:

2. the cutting method of super large dimension table according to claim 1 is characterized in that, before described step 1), also comprises the step of the list item of described super large dimension table being done ordering by the value of connecting key field.

3. the cutting method of super large dimension table according to claim 2 is characterized in that described ordering is done ascending sort according to the value of connecting key field.

4. super large dimension table method of attachment comprises:

Step 1), adopt the cutting method of the super large dimension table of one of claim 1-3 that described super large dimension table is divided into a plurality of sublists according to the level of dimension, comprise the attribute information of a certain level in the described super large dimension table or some level in the described sublist;

5. super large dimension table according to claim 4 method of attachment is characterized in that described step 3) comprises:

6. the method for attachment of super large according to claim 4 dimension table is characterized in that, in described step 2) in, the index of setting up for described sublist is a kind of in B+Tree index, B-tree index, the binary tree index.