CN104794186A

CN104794186A - Collecting method for training samples of database load response time predicting model

Info

Publication number: CN104794186A
Application number: CN201510171679.5A
Authority: CN
Inventors: 牛保宁; 张锦文
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2015-04-13
Filing date: 2015-04-13
Publication date: 2015-07-22
Anticipated expiration: 2035-04-13
Also published as: CN104794186B

Abstract

The invention relates to a collecting method for training samples of a database load response time predicting model, and belongs to sample collecting methods based on clustering. The collecting method includes the steps that 1, response data, during individual operation of each load, of a database are obtained; 2, response data, during paired operation of the loads, of the database are obtained; 3, changes of average page read time are calculated; 4, according to the changes of the average page read time, a full sample space is clustered; 5, a sample selection table is filled; 6, training samples are generated. The sampling number of the statistic model can be reduced, and the modeling cost is reduced while the model accuracy is kept.

Description

The acquisition method of database loads response time forecast model training sample

Technical field

The invention belongs to the sample collection method based on cluster, is the training acquisition method being applied to database loads response time forecast model.

Background technology

In current parallel database system, the prediction load response time is extremely important, can help data base administrator's condition data storehouse parameter, the load of reasonable arrangement schedule parallel.

But (Interaction) mechanism is very complicated owing to influencing each other between data base concurrency load, traditional analytic type model process of establishing is complicated, and prediction effect is bad.Therefore existing document mainly sets up statistical model, predicts the response time of load.Namely complete statistical model set up by sample collection, model training (recurrence), model evaluation three step.The document of this respect mainly contains [1] Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads [C] //Proc.of 2011 ACM SIGMOD Conference (SIGMOD ' 2011). Athens, Greece, 2011:337-348

[2] Ahmad M, Aboulanaga A,Babu S, et al. Modeling and Exploiting Query Interaction in Database Systems[C] //Proc.of the 17th Conference on Information and Knowledge Management (CIKM’2008).Napa Valley,US,2008:183-192

[3] Ahmad M, AboulanagaA,Babu S, et al. Qshuffler: Getting the Query Mix Right[C] //Proc. of the 24th International Conference on Data Engineering (ICDE’2008).Cancun, Mexico,2008:1415-1417

[4] Ahmad M, Duan S, Aboulanaga A, et al. Predicting Completion Times of Bath Query Workloads Using Interaction-aware Models and Simulation[C] //Proc.of the 14th International Conference on Extending Database Technology (EDBT’2011).Uppsala, Sweden,2011:449-460

[5] Ahmad M, Duan S, Aboulanaga A, et al. Interaction-aware Scheduling of Report Generation Workloads [J].The VLDB Journal,2011,20(4):589-615

[6] Sheikh M B, Minhas U F, Khan O Z, et al. A Bayesian Approach to Online Performance Modeling for Database Appliances Using Gaussian Models[C] //Proc.of8th International Conference on Autonomic Computing(ICAC’2011).

Karlsruhe, Germany,2011:121-130。

But the method for sampling that above-mentioned statistical model is corresponding is not considered to influence each other between load, obtain sample by means of only to the specific sampling of full sample space or random sampling.Along with database data amount increases, the load running time increases, if not selected training sample, the model training time can be caused elongated, and the cost that model is set up will become very large.

Summary of the invention

Setting up cost to reduce model, shortening model Time Created, the invention provides a kind of acquisition method of training sample, while not obvious reduction model prediction accuracy, model can be reduced and set up cost.

Technical scheme of the present invention: the acquisition method of database loads response time forecast model training sample, comprises following content:

1, response data during each load isolated operation of database is obtained;

Namely, during each load q isolated operation, its response time, CPU time, logic reading number, BAL value is obtained.The wherein Buffer Access Latency value of BAL for defining in [1], represent that Database Systems often complete a physics and read the averaging time used, the present invention is referred to as reading averaging time.Buffer Access Latency value derives from document Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads //Proc.of 2011 ACM SIGMOD Conference (SIGMOD ' 2011). Athens, Greece, 2011:337-348

Load q represents by loaded template C _qthe executable database loads generated.

Loaded template by with parameter data base querying, upgrade statement generate; Different inquiries, renewal statement are considered as different loaded template.The load that the parameter that same loaded template generates is different, is considered as same load.

2, response data when database loads runs in pairs is obtained; I.e. the first load q _iwith the second load q _jduring paired operation, obtain respective response time, CPU time, logic reading number, BAL value; Wherein the first load q _iwith the second load q _jbelong to two different loads templates (the first loaded template C respectively _qiwith the second loaded template C _qj) generate.

3, calculate average page and read time variations;

Average page reads time variations by Δ T _{q_s}=T _{q_s}-T _qdefinition, T _{q_s}represent that in sample s, some load q(are by loaded template C _qgenerate) BAL value, T _qrepresent the BAL value of some load q isolated operation.

Average page reads time variations and meets following formula simultaneously:

Wherein Δ T _q/cijrepresent some load q and another load c _ijduring paired operation, the BAL value of some load q, another load c _ijsample s _jin by query template C _cithe load generated; Δ T _q/cirepresent some load q and another load c _iduring paired operation, the BAL value of some load q, another load c _iby query template C in sample s _cithe load generated;

Utilize the Δ T running gained in pairs _q/ccalculate higher MPL(Multi Programming Level, the maximum and line number of Database Systems, namely represent the number of loads simultaneously can run) the Δ T of some load q under rank _{q_s}.Then Δ T is provided by following formula _{q_s}calculating:

；

4, time variations is read to this space clustering of bulk sample according to average page;

For the some load q of each class, under given MPL rank (Multi Programming Level), the T all to it _{q_s}carry out cluster, clustering method selects Kmeans algorithm, measures as Euclidean distance.Clusters number is MPL*2.

5, fill sample and choose table;

6, training sample is generated.

The present invention can reduce the number of samples of statistical model, and keeps model accuracy and reduce model setting up cost.

Embodiment

Embodiment: establish given 5 loadtypes to be q respectively ₁, q ₂, q ₃, q ₄, q ₅; MPL grade is 4, and represent that the load number simultaneously can run in a database is 4, current sample is s ₀(q ₁, q ₂, q ₃, q ₄).Wherein q ₁, q ₂, q ₃, q ₄, q ₅respectively by 5 query template C _q1, C _q2, C _q3, C _q4, C _q5generate, Database Systems are IBM DB2, and version number is 9.5.

1, response data during each load isolated operation is obtained; Described response data comprises response time, CPU time, logic reading number, BAL value T _q;

Isolated operation load q ₁, q ₂, q ₃, q ₄, q ₅and obtain the BAL value of respective response time, CPU time, logic reading number, isolated operation.Data are obtained by DB2 snapshot monitor command: " db2 get snapshot for dynamic sql on database ".

2, response data when load runs in pairs is obtained; By q ₁, q ₂, q ₃, q ₄, q ₅carry out permutation and combination, obtain the paired operation response time of all pair-wise combination (10 paired running loads), run CPU time, in pairs operation logic reading number in pairs, run BAL value T in pairs _q/c.The obtain manner of data uses DB2 snapshot monitor command equally.

3, calculate average page and read time variations

Δ T is calculated by following formula _{q1_s0}scope:

Current sample is s0(q ₁, q ₂, q ₃, q ₄), MPL=4; Be 3 than low other MPL value of one-level of current MPL, that it can generate and comprise load q ₁sample have s ₁(q ₁, q ₂, q ₃), s ₂(q ₁, q ₂, q ₄), s ₃(q ₁, q ₃, q ₄).

Then:

and:

。

Δ T thus _{q1_s0}calculated value can be provided by following formula:

Δ T can be drawn thus _{q1_s0}calculated value, Δ T _{q1_s0}namely load q is represented ₁at sample s ₀in average page read time variations.

The average page of other three class loads reads time variations and also can similarly draw.

For MPL=4, allly comprise q ₁sample have s ₀(q ₁, q ₂, q ₃, q ₄), s ₄(q ₁, q ₂, q ₄, q ₅), s ₅(q ₁, q ₃, q ₄, q ₅), s ₆(q ₁, q ₂, q ₃, q ₅).

Δ T is calculated respectively for each sample _{q1_s0}, Δ T _{q1_s4}, Δ T _{q1_s5}, Δ T _{q1_s6}.Then Kmeans cluster is carried out to these four values.

In actual production environment, because loadtype reaches more than 20, MPL grade, between 30-200, therefore for each loadtype q, and under given MPL grade, can obtain the sample much comprising q.And to Δ T _{q_s}kmeans cluster is carried out in set, and clusters number is generally chosen to be MPL*2.

5, fill sample and choose table

To the sample s that each cluster is selected, its each load comprised has the numerical value indicating classification.

Such as at s ₀(q ₁, q ₂, q ₃, q ₄) in, a kind of possible for classification results K _s0(3,1,7,4), represent Δ T _{q1_s0}be the 3rd class in full sample space, Δ T _{q2_s0}for the first kind, Δ T _{q3_s0}be the 7th class, Δ T _{q4_s0}it is the 4th class.

Corresponding classification results K is had to each sample s _s.

We obtain following form by cluster

According to above classification results, fill following sample and choose table:

Here, because loadtype contained in example is few, some vacancies in schedule of samples, are had.In actual production, there are some positions to clash, cause some position to fill.Run into this situation and can degenerate to random fashion again, combination does not have the position of filling.

6, training sample is generated

Choosing table according to the 5th step gained sample, is exactly required model training sample.

Following filling algorithm is provided in the present invention:

input:loaded template C, MPL=M;

export:selected sample set SampleSeled;

1、SampleSpace = GenerateSampleSpace(M,C)；

2 ,/this space S of * generation bulk sample ampleSpace */

3、For S _j∈SampleSpace

/ * calculates the Δ T of each loadtype in each sample _{q_s}*/

4、 ComputeDIF_BAL(S _j)；

5、End For

6、For i = 1 to C, S _j∈SampleSpace

/ * is to each loadtype q _iwhole Δ T _{qi_Sj}carry out cluster, the number of cluster be M*2*/

7、 Kmeans(q _i,ΔT _{qi_Sj},M*2)；

8、End For

9、For S _j∈SampleSpace

What/* calculated each sample inserts mutual exclusion number Mu, and the Mu value of sample s is defined as: insert sample s at first, for other samples of SampleSpace, and the total sample number * that can not insert again/

10、 ComputeMutual(S _j)；

11、End For

12、Sort(Mu _j)；

/ * according to the Mu value of each sample, ordered samples space * from small to large/

13、MaxInsNum = 1；

/ * initialization maximum sample number of fills */

14、For i = 1 to K

/ * K for fill cycle index */

15、 InsertS(S _j)；

/ * inserts sample S at first _j*/

16、 InsertNum = 1；

17、 For m = j+1 to SampleSpace

18、 If(IsInsertS(S _m))

/ * judges S _mwhether can insert */

19、 InsertS(S _m)；

20、 InsertNum++；

21、 End For

/ * insert successively other can insert sample */

22、 If(InsertNum>MaxInsNum)

23、 MaxInsNum = InsertNum；

24、 RecordInsertS()；

If this cyclic pac king of/* may be greater than existing program, then preserve current filling sample */

25、End For

26、RandomInsertS()；

The room that/* does not insert for other, random combine sample */.

Claims

1. the acquisition method of database loads response time forecast model training sample, comprises the steps:

(1) response data during each load isolated operation of database is obtained;

(2) response data when database loads runs in pairs is obtained;

(3) calculate average page and read time variations;

Average page reads time variations by Δ T _{q_s}=T _{q_s}-T _qdefinition, T _{q_s}represent the BAL value of load q in sample s, T _qrepresent the isolated operation BAL value of load q;

And average page reading time variations meets following formula:

Utilize the Δ T running gained in pairs _q/ccalculate the maximum and Δ T that is some load q under line number rank of higher MPL Database Systems _{q_s}, then provide Δ T by following formula _{q_s}calculating:

；

(4) time variations is read to this space clustering of bulk sample according to average page;

(5) fill sample and choose table;

(6) training sample is generated.