CN107341240A

CN107341240A - A kind of processing method for tackling tilt data stream on-line joining process

Info

Publication number: CN107341240A
Application number: CN201710542086.4A
Authority: CN
Inventors: 孟小峰; 王春凯
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2017-11-10
Anticipated expiration: 2037-07-05
Also published as: CN107341240B

Abstract

The present invention relates to a kind of processing method for tackling tilt data stream on-line joining process, its step：Data flow R and data flow S carries out tuple division according to the hash function based on key assignments, and the different nodes that each tuple is allocated to the same side are stored, and synchronously send tuple to opposite side processing unit to complete the operation of on-line joining process；Load statistics information of the bigraph (bipartite graph) link model per side gusset is periodically monitored to preset time interval, and collects transmission to the data flow control built in advance；If data flow control monitors the critical value that some processing units exceed the load balancing factor, migration strategy is formulated according to heuristic rule dynamic；Before Data Migration, new caused data flow is temporarily stored in Kafka, postpones the attended operation of new data；Now, the migration of data flow and connection state information, and synchronized update routing table are carried out according to migration strategy；Continue the data of arrival kept in transmission Kafka and new, complete follow-up on-line joining process operation.

Description

A kind of processing method for tackling tilt data stream on-line joining process

Technical field

The present invention relates to a kind of data processing method, especially with regard to a kind of place for tackling tilt data stream on-line joining process Reason method.

Background technology

The attended operation of distributed traffic can be supported by being generally basede on the link model of complete bipartite graph.The model has Internal memory efficiently, easily flexible and expansible etc. characteristic.However, the model can not dynamically distributes query node, and need manual intervention The parameter setting of packet.It is lower in particular for the full history Connection inquiring of tilt data, model efficiency.

The content of the invention

In view of the above-mentioned problems, it is an object of the invention to provide it is a kind of tackle tilt data stream on-line joining process processing method, This method can successfully manage the attended operation of tilt data, and further lift the throughput of distributed data stream management system, Reduce the calculating cost in cloud environment.

To achieve the above object, the present invention takes following technical scheme：A kind of place for tackling tilt data stream on-line joining process Reason method, it is characterised in that comprise the following steps：1) data flow R and data flow S is randomly divided into n node R 1 respectively, R2 ..., Rn and m nodes S1, S2 ..., Sm, each node is respectively stored in n or m processing unit, and data flow R and data flow S is located at the both sides of bigraph (bipartite graph) link model respectively；Data flow R and data flow S is according to the Hash letter based on key assignments Number carries out tuple division, and the different nodes that data flow R or S tuple are allocated to the same side are stored, and synchronously by this yuan Group is sent to opposite side processing unit to complete the operation of on-line joining process；2) periodically monitored with presetting time interval Load statistics information of the bigraph (bipartite graph) link model per side gusset, and transmission is collected to the data flow control built in advance；If number Critical value of some processing units more than the load balancing factor is monitored according to stream controller, then is made according to heuristic rule dynamic Determine migration strategy；3) before Data Migration, will it is new caused by data flow be temporarily stored in high-throughput distributed post subscribe to disappear In breath system, postpone the attended operation of new data；Now, moving for data flow and connection state information is carried out according to migration strategy Move, and synchronized update routing table；4) continue the data of arrival kept in transmission Kafka and new, complete follow-up on-line joining process Operation.

In the step 2), the heuristic rule of setting is as follows：2.1) regular H1：Data need the processing unit moved out, If can directly meet the requirement of non-equilibrium factor threshold after the tuple for load key assignments of moving out, operation of moving out directly is carried out, and Record migration key assignments in the routing table；2.2) regular H2：Data need the processing unit moved out, if moving out some key assignments The requirement of non-equilibrium factor threshold is still unsatisfactory for after tuple, then needs cutting that there is the key assignments of higher number of tuples, and by after cutting Partial data carry out operation of moving out, and record migration key assignments in the routing table；2.3) regular H3：Data need the place moved into Unit is managed, then the tuple of the key assignments is preferentially incorporated into the processing list of hash function mapping if there is key assignments in the routing table Member, and empty the record in routing table.

According to heuristic rule, move out tuple and the rudimentary algorithm for moving into tuple, the rudimentary algorithm for tuple of moving out are set For：First, it is determined that the range of key values for tuple of being moved out in set of moving out, and determine the processing unit of tuple to be moved into；Then, pin Data are completed according to heuristic rule H1 and regular H2 to move out, and update routing table to each key assignments of moving out；Finally, it is determined that moving Move plan；The rudimentary algorithm for moving into tuple is：The range of key values of tuple is moved into first, it is determined that moving into set, and determines to wait to move Go out the processing unit of tuple；Then, moved into for each key assignments of moving into according to heuristic rule H3 completions data, and update road By table；Finally, migration plan is determined.

In the step 2), it is according to different migration three kinds of costs of type definition that dynamic, which formulates migration strategy,：(1) network Cost Cnetwork：In the case of data splitting, the tuple of identical key assignments is distributed in different processing units, in attended operation When the cost brought due to replicate data；(2) cost Cmigration is migrated：Tuple moves to other from a certain processing unit The cost of processing unit；(3) routing cost Crouting：After Data Migration, for the mapping relations of record key assignments and processing unit And safeguard the cost of migration route.

In the step 2), in moment t, unilateral Data Migration uses ISM algorithms, and its process is as follows：First, count The load Lt (pu) of each processing units of moment t, and calculate average load；Then, for the processing list for the data that need to move out Member, call tuple algorithm of moving out；Finally, for the processing unit for needing to move into data, calling moves into tuple algorithm.

In the step 2), migrated for both sides node logical and use S2SM algorithms, its process is as follows：First, statistics is every The load Lt (pu) of individual processing unit, and the average load of each side and whole cluster is counted respectivelyWithWherein,Table Show that t has the average load of m processing unit side,Represent that t has the average load of n processing unit side, Represent the average load of the whole cluster of t；Then, according to critical value judge move out tuple side and move into the one of tuple Side；Finally, for side of moving out, judge to need the processing unit moved out and calling is moved out tuple algorithm, for moving into side, judge Need the processing unit moved into and calling moves into tuple algorithm.

For the present invention due to taking above technical scheme, it has advantages below：The present invention divides inquiry section by logic again The strategy and state transition algorithm of point, the load balancing of dynamic implement join algorithm, to support the Connection inquiring of full historical data With adaptive resource management.The attended operation of tilt data is successfully managed, and further lifts distributed data flow management system The throughput of system, reduce the calculating cost in cloud environment.

Brief description of the drawings

Fig. 1 is the overall flow schematic diagram of the present invention；

Fig. 2 a are the throughput schematic diagrames in the embodiment of the present invention；

Fig. 2 b are the delay schematic diagrames in the embodiment of the present invention；

Fig. 3 a are that the equivalent connection Q3 for using the TPC-H of tri- kinds of methods of DB, JB and JB6 to provide in the embodiment of the present invention leads to Believe cost schematic diagram；

Fig. 3 b are that the equivalent connection Q5 for using the TPC-H of tri- kinds of methods of DB, JB and JB6 to provide in the embodiment of the present invention leads to Believe cost schematic diagram；

Fig. 3 c are the range queries for using the TPC-H of tri- kinds of methods of DB, JB and JB6 to provide in the embodiment of the present invention (Band) communication cost schematic diagram.

Embodiment

The present invention is described in detail with reference to the accompanying drawings and examples.

As shown in figure 1, the present invention provides a kind of processing method for tackling tilt data stream on-line joining process, it includes following Step：

1) data flow R and data flow S is randomly divided into n node R 1, R2 ..., Rn and m node S1 respectively, S2 ..., Sm, each node is respectively stored in n or m processing unit, and data flow R and data flow S are located at two respectively The both sides of portion's figure link model；Data flow R and data flow S carries out tuple division, data flow according to the hash function based on key assignments The different nodes that R or S tuple is allocated to the same side are stored, and synchronously send the tuple to opposite side processing list Member is to complete the operation of on-line joining process.

2) load statistics letter of the bigraph (bipartite graph) link model per side gusset is periodically monitored to preset time interval Breath, and transmission is collected to the data flow control built in advance.If data flow control monitors some processing units more than negative The critical value of balance factor is carried, then migration strategy is formulated according to heuristic rule dynamic.

3) before Data Migration, will it is new caused by data flow be temporarily stored into high-throughput distributed post subscribe to message In system Kafka, postpone the attended operation of new data.Now, data flow and connection state information are carried out according to migration strategy Migration, and synchronized update routing table (routing table).

4) continue the data of arrival kept in transmission Kafka and new, complete follow-up on-line joining process operation.

Above-mentioned steps 2) in, in moment t, it is assumed that there are the load of a certain processing unit is critical more than the non-equilibrium factor It is worth the upper limit, the processing unit is labeled as pu_max, i.e.,；L_t(pu_max)>(1+θ_max), L_t(pu_max) it is t moment processing units pu_max Total load, θ_maxFor the boundary value of the maximum non-equilibrium factor；Or a certain processing unit loads be present and be less than the non-equilibrium factor Critical value lower limit, the processing unit is labeled as pu_min, i.e. L_t(pu_min)<(1-θ_max), L_t(pu_min) it is that t processing is single First pu_minTotal load.To meet the balance of bigraph (bipartite graph) link model each processing unit, and the feelings of Data Migration are reduced as far as possible Condition, the heuristic rule that the present invention is set are as follows：

2.1) regular H1：Data need the processing unit moved out, if can directly meet after the tuple for load key assignments of moving out The requirement of non-equilibrium factor threshold, then directly carry out operation of moving out, and record migration key assignments in the routing table.

2.2) regular H2：Data need the processing unit moved out, if be still unsatisfactory for after the tuple for some key assignments of moving out non- The requirement of balance factor threshold value, then need cutting that there is the key assignments of higher number of tuples, and the partial data after cutting is moved Go out operation, and record migration key assignments in the routing table.

2.3) regular H3：Data need the processing unit moved into, then preferentially should if there is key assignments in the routing table The tuple of key assignments is incorporated into the processing unit of hash function mapping, and empties the record in routing table.

According to above-mentioned three kinds of rules, tuple of moving out (MoveOut) is provided respectively and moves into the basic calculation of tuple (MoveIn) Method.Wherein, the rudimentary algorithm for tuple of moving out is：

First, it is determined that the range of key values for tuple of being moved out in set of moving out, and determine the processing unit of tuple to be moved into；

Then, data are completed according to heuristic rule H1 and regular H2 for each key assignments of moving out to move out, and updates route Table；

Finally, migration plan is determined.

The rudimentary algorithm for moving into tuple is：

The range of key values of tuple is moved into first, it is determined that moving into set, and determines the processing unit of tuple to be moved out；

Then, moved into for each key assignments of moving into according to heuristic rule H3 completions data, and update routing table；

Finally, migration plan is determined.

In above-mentioned steps 2, it is according to different migration three kinds of costs of type definition that dynamic, which formulates migration strategy,：

(1) network cost C_network：In the case of data splitting, the tuple of identical key assignments is distributed in different processing lists Member, in attended operation due to cost that replicate data is brought.

(2) cost C is migrated_migration：Tuple moves to the cost of other processing units from a certain processing unit.

(3) routing cost C_routing：After Data Migration, the mapping relations for record key assignments and processing unit are safeguarded and moved Move the cost of route.

Moment t, the set F={ f1, f2, f3 ... } of all migration functions.According to different migration strategys, each migration Function f_iThe cost C of (i=1,2 ...)_t(f_i) be：

C_t(f_i)=α × C_network(f_i)+β×C_migration(f_i)+γ×C_routing(f_i),

Wherein, α, β and γ be three kinds migration cost weights, alpha+beta+γ=1.Caused by multinode Backup Data Data are transmitted several times, and network cost is to C_t(f_i) influence highest, next to that migration cost of the data between different nodes, finally It is the maintenance cost of routing table.Experimental empirical value, α, β and γ are respectively set to 0.5,0.3 and 0.2.Optimization aim It is expressed as：

Wherein, pu represents a certain processing unit；PU represents the set of whole processing units.When the bar for meeting load balancing Under part, optimization aim is to minimize migration cost.

Above-mentioned steps 2) in, in moment t, to meet the balance of bigraph (bipartite graph) link model one side processing unit, and try one's best The situation of Data Migration is reduced, unilateral Data Migration uses ISM algorithms, and its process is as follows：

First, the load L of moment t each processing unit is counted_t(pu), and average load is calculated；

Then, for the processing unit for the data that need to move out, MoveOut algorithms are called；

Finally, for the processing unit for needing to move into data, MoveIn algorithms are called.

Above-mentioned steps 2) in, because the dynamic of data rate changes, cause both sides data flow in bigraph (bipartite graph) link model Quantity has larger gap, and this seriously affects the throughput of distributed data stream management system.Moved for both sides node logical Move and use S2SM algorithms, its process is as follows：

First, the load L of each processing unit is counted_t(pu) average load of each side and whole cluster is counted, and respectivelyWithWherein,Represent that t has the average load of m processing unit side,Represent that t has at n The average load of cell side is managed,Represent the average load of the whole cluster of t；

Then, according to critical value judge to move out tuple side and move into the side of tuple；

Finally, for side of moving out, judge the processing unit that needs are moved out and call MoveOut algorithms, for moving into side, Judge the processing unit that needs are moved into and call MoveIn algorithms.

Embodiment：

For query task.The present invention chooses three query tasks altogether.Wherein, two are that the equivalence that TPC-H is provided connects Q3 and Q5, one is range query (Band).Band query specifications are：

SELECT*,FROM LINEITEM L1,LINEITEM L2

WHERE ABS(L1.orderkey-L2.orderkey)<=1

AND (L1.shipmode=' TRUCK ' AND L2.shipinstruct=' NONE ')

AND L1.Quantity>48

For contrast model.The present invention uses three kinds of algorithm comparative analysis query performances：DB, JB and JB6.DB is this hair The algorithm of bright proposition, it is expressed as dynamic bigraph (bipartite graph) link model.JB represents mean allocation clustered node to each side of bigraph (bipartite graph). After JB6 represents mean allocation node, the node inside each side of bigraph (bipartite graph) is divided into 6 subgroups and does stochastic route.

Using z=1 10GB data, for three different query tasks, it compared for the throughput of three models and prolong Late.As shown in Figure 2 a, because DB dynamics adjust the loading condition of processing unit, its throughput highest.And JB needs to do one side The whole network broadcast operation, the traffic is larger, therefore its throughput is minimum.For processing postpones, Fig. 2 b explanations, DB delay It is below JB and JB6.

As shown in Fig. 3 a~Fig. 3 c, the disposition of different inclination data streams is tackled for checking DB models, passes through change Zipf distribution situations, test the network service cost of three kinds of models.By comparison diagram 3a, Fig. 3 b and Fig. 3 c, it is found that JB needs to do The whole network broadcast operation, its communication cost highest；JB6 only does subnet broadcast operation, and its communication cost is minimum；It is different to tackle Gradient, the throughput of system is improved, DB needs to do data migration operation, and its communication cost is slightly above JB6.Further, since Q5 The data flow being related to is most, and its communication cost is higher than Q3 and Band.

The various embodiments described above are merely to illustrate the present invention, and each step can be all varied from, in the technology of the present invention On the basis of scheme, all improvement carried out according to the principle of the invention to separate step and equivalents, it should not exclude in this hair Outside bright protection domain.

Claims

1. a kind of processing method for tackling tilt data stream on-line joining process, it is characterised in that comprise the following steps：

1) data flow R and data flow S is randomly divided into n node R 1, R2 ..., Rn and m node S1, S2 ..., Sm respectively, Each node is respectively stored in n or m processing unit, and data flow R connects mould positioned at bigraph (bipartite graph) respectively with data flow S The both sides of type；Data flow R and data flow S carries out tuple division, data flow R or S tuple according to the hash function based on key assignments The different nodes being allocated to the same side are stored, and synchronously send the tuple to opposite side processing unit to complete online The operation of connection；

2) load statistics information of the bigraph (bipartite graph) link model per side gusset is periodically monitored to preset time interval, and searched Collection is sent to the data flow control built in advance；If data flow control monitor some processing units more than load balancing because The critical value of son, then migration strategy is formulated according to heuristic rule dynamic；

3) before Data Migration, will it is new caused by data flow be temporarily stored in high-throughput distributed post subscribe to message system In, postpone the attended operation of new data；Now, the migration of data flow and connection state information is carried out according to migration strategy, and together Step renewal routing table；

A kind of 2. processing method for tackling tilt data stream on-line joining process as claimed in claim 1, it is characterised in that：The step It is rapid 2) in, the heuristic rule of setting is as follows：

2.1) regular H1：Data need the processing unit moved out, if can directly meet after the tuple for load key assignments of moving out non-flat The requirement of weighing apparatus factor threshold, then directly carry out operation of moving out, and record migration key assignments in the routing table；

2.2) regular H2：Data need the processing unit moved out, if be still unsatisfactory for after the tuple for some key assignments of moving out non-equilibrium The requirement of factor threshold, then need cutting that there is the key assignments of higher number of tuples, and the partial data after cutting is subjected to the behaviour that moves out Make, and record migration key assignments in the routing table；

2.3) regular H3：Data need the processing unit moved into, then preferentially by the key assignments if there is key assignments in the routing table Tuple is incorporated into the processing unit of hash function mapping, and empties the record in routing table.

A kind of 3. processing method for tackling tilt data stream on-line joining process as claimed in claim 2, it is characterised in that：According to opening Hairdo rule, setting, which is moved out, tuple and moves into the rudimentary algorithm of tuple, and the rudimentary algorithm for tuple of moving out is：

Then, data are completed according to heuristic rule H1 and regular H2 for each key assignments of moving out to move out, and updates routing table；

Finally, migration plan is determined；

The rudimentary algorithm for moving into tuple is：

Finally, migration plan is determined.

A kind of 4. processing method for tackling tilt data stream on-line joining process as claimed in claim 3, it is characterised in that：The step It is rapid 2) in, dynamic formulate migration strategy be according to it is different migration three kinds of costs of type definition：

(1) network cost Cnetwork：In the case of data splitting, the tuple of identical key assignments is distributed in different processing units, In attended operation due to cost that replicate data is brought；

(2) cost Cmigration is migrated：Tuple moves to the cost of other processing units from a certain processing unit；

(3) routing cost Crouting：After Data Migration, migration road is safeguarded for the mapping relations of record key assignments and processing unit By cost.

A kind of 5. processing method for tackling tilt data stream on-line joining process as claimed in claim 3, it is characterised in that：The step It is rapid 2) in, in moment t, unilateral Data Migration uses ISM algorithms, and its process is as follows：

First, the load Lt (pu) of moment t each processing unit is counted, and calculates average load；

Then, for the processing unit for the data that need to move out, calling is moved out tuple algorithm；

Finally, for the processing unit for needing to move into data, calling moves into tuple algorithm.

A kind of 6. processing method for tackling tilt data stream on-line joining process as claimed in claim 3, it is characterised in that：The step It is rapid 2) in, use S2SM algorithms for the migration of both sides node logical, its process is as follows：

First, the load Lt (pu) of each processing unit is counted, and counts the average load of each side and whole cluster respectively WithWherein,Represent that t has the average load of m processing unit side,Represent that t has n processing unit side Average load,Represent the average load of the whole cluster of t；

Finally, for side of moving out, judge to need the processing unit moved out and calling is moved out tuple algorithm, for moving into side, judge Need the processing unit moved into and calling moves into tuple algorithm.