CN104794186A - Collecting method for training samples of database load response time predicting model - Google Patents

Collecting method for training samples of database load response time predicting model Download PDF

Info

Publication number
CN104794186A
CN104794186A CN201510171679.5A CN201510171679A CN104794186A CN 104794186 A CN104794186 A CN 104794186A CN 201510171679 A CN201510171679 A CN 201510171679A CN 104794186 A CN104794186 A CN 104794186A
Authority
CN
China
Prior art keywords
load
sample
database
average page
represent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510171679.5A
Other languages
Chinese (zh)
Other versions
CN104794186B (en
Inventor
牛保宁
张锦文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201510171679.5A priority Critical patent/CN104794186B/en
Publication of CN104794186A publication Critical patent/CN104794186A/en
Application granted granted Critical
Publication of CN104794186B publication Critical patent/CN104794186B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a collecting method for training samples of a database load response time predicting model, and belongs to sample collecting methods based on clustering. The collecting method includes the steps that 1, response data, during individual operation of each load, of a database are obtained; 2, response data, during paired operation of the loads, of the database are obtained; 3, changes of average page read time are calculated; 4, according to the changes of the average page read time, a full sample space is clustered; 5, a sample selection table is filled; 6, training samples are generated. The sampling number of the statistic model can be reduced, and the modeling cost is reduced while the model accuracy is kept.

Description

The acquisition method of database loads response time forecast model training sample
Technical field
The invention belongs to the sample collection method based on cluster, is the training acquisition method being applied to database loads response time forecast model.
Background technology
In current parallel database system, the prediction load response time is extremely important, can help data base administrator's condition data storehouse parameter, the load of reasonable arrangement schedule parallel.
But (Interaction) mechanism is very complicated owing to influencing each other between data base concurrency load, traditional analytic type model process of establishing is complicated, and prediction effect is bad.Therefore existing document mainly sets up statistical model, predicts the response time of load.Namely complete statistical model set up by sample collection, model training (recurrence), model evaluation three step.The document of this respect mainly contains [1] Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads [C] //Proc.of 2011 ACM SIGMOD Conference (SIGMOD ' 2011). Athens, Greece, 2011:337-348
[2] Ahmad M, Aboulanaga A,Babu S, et al. Modeling and Exploiting Query Interaction in Database Systems[C] //Proc.of the 17th Conference on Information and Knowledge Management (CIKM’2008).Napa Valley,US,2008:183-192
[3] Ahmad M, AboulanagaA,Babu S, et al. Qshuffler: Getting the Query Mix Right[C] //Proc. of the 24th International Conference on Data Engineering (ICDE’2008).Cancun, Mexico,2008:1415-1417
[4] Ahmad M, Duan S, Aboulanaga A, et al. Predicting Completion Times of Bath Query Workloads Using Interaction-aware Models and Simulation[C] //Proc.of the 14th International Conference on Extending Database Technology (EDBT’2011).Uppsala, Sweden,2011:449-460
[5] Ahmad M, Duan S, Aboulanaga A, et al. Interaction-aware Scheduling of Report Generation Workloads [J].The VLDB Journal,2011,20(4):589-615
[6] Sheikh M B, Minhas U F, Khan O Z, et al. A Bayesian Approach to Online Performance Modeling for Database Appliances Using Gaussian Models[C] //Proc.of8th International Conference on Autonomic Computing(ICAC’2011).
Karlsruhe, Germany,2011:121-130。
But the method for sampling that above-mentioned statistical model is corresponding is not considered to influence each other between load, obtain sample by means of only to the specific sampling of full sample space or random sampling.Along with database data amount increases, the load running time increases, if not selected training sample, the model training time can be caused elongated, and the cost that model is set up will become very large.
Summary of the invention
Setting up cost to reduce model, shortening model Time Created, the invention provides a kind of acquisition method of training sample, while not obvious reduction model prediction accuracy, model can be reduced and set up cost.
Technical scheme of the present invention: the acquisition method of database loads response time forecast model training sample, comprises following content:
1, response data during each load isolated operation of database is obtained;
Namely, during each load q isolated operation, its response time, CPU time, logic reading number, BAL value is obtained.The wherein Buffer Access Latency value of BAL for defining in [1], represent that Database Systems often complete a physics and read the averaging time used, the present invention is referred to as reading averaging time.Buffer Access Latency value derives from document Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads //Proc.of 2011 ACM SIGMOD Conference (SIGMOD ' 2011). Athens, Greece, 2011:337-348
Load q represents by loaded template C qthe executable database loads generated.
Loaded template by with parameter data base querying, upgrade statement generate; Different inquiries, renewal statement are considered as different loaded template.The load that the parameter that same loaded template generates is different, is considered as same load.
2, response data when database loads runs in pairs is obtained; I.e. the first load q iwith the second load q jduring paired operation, obtain respective response time, CPU time, logic reading number, BAL value; Wherein the first load q iwith the second load q jbelong to two different loads templates (the first loaded template C respectively qiwith the second loaded template C qj) generate.
3, calculate average page and read time variations;
Average page reads time variations by Δ T q_s=T q_s-T qdefinition, T q_srepresent that in sample s, some load q(are by loaded template C qgenerate) BAL value, T qrepresent the BAL value of some load q isolated operation.
Average page reads time variations and meets following formula simultaneously:
Wherein Δ T q/cijrepresent some load q and another load c ijduring paired operation, the BAL value of some load q, another load c ijsample s jin by query template C cithe load generated; Δ T q/cirepresent some load q and another load c iduring paired operation, the BAL value of some load q, another load c iby query template C in sample s cithe load generated;
Utilize the Δ T running gained in pairs q/ccalculate higher MPL(Multi Programming Level, the maximum and line number of Database Systems, namely represent the number of loads simultaneously can run) the Δ T of some load q under rank q_s.Then Δ T is provided by following formula q_scalculating:
4, time variations is read to this space clustering of bulk sample according to average page;
For the some load q of each class, under given MPL rank (Multi Programming Level), the T all to it q_scarry out cluster, clustering method selects Kmeans algorithm, measures as Euclidean distance.Clusters number is MPL*2.
5, fill sample and choose table;
6, training sample is generated.
The present invention can reduce the number of samples of statistical model, and keeps model accuracy and reduce model setting up cost.
Embodiment
Embodiment: establish given 5 loadtypes to be q respectively 1, q 2, q 3, q 4, q 5; MPL grade is 4, and represent that the load number simultaneously can run in a database is 4, current sample is s 0(q 1, q 2, q 3, q 4).Wherein q 1, q 2, q 3, q 4, q 5respectively by 5 query template C q1, C q2, C q3, C q4, C q5generate, Database Systems are IBM DB2, and version number is 9.5.
1, response data during each load isolated operation is obtained; Described response data comprises response time, CPU time, logic reading number, BAL value T q;
Isolated operation load q 1, q 2, q 3, q 4, q 5and obtain the BAL value of respective response time, CPU time, logic reading number, isolated operation.Data are obtained by DB2 snapshot monitor command: " db2 get snapshot for dynamic sql on database ".
2, response data when load runs in pairs is obtained; By q 1, q 2, q 3, q 4, q 5carry out permutation and combination, obtain the paired operation response time of all pair-wise combination (10 paired running loads), run CPU time, in pairs operation logic reading number in pairs, run BAL value T in pairs q/c.The obtain manner of data uses DB2 snapshot monitor command equally.
3, calculate average page and read time variations
Δ T is calculated by following formula q1_s0scope:
Current sample is s0(q 1, q 2, q 3, q 4), MPL=4; Be 3 than low other MPL value of one-level of current MPL, that it can generate and comprise load q 1sample have s 1(q 1, q 2, q 3), s 2(q 1, q 2, q 4), s 3(q 1, q 3, q 4).
Then:
and:
Δ T thus q1_s0calculated value can be provided by following formula:
Δ T can be drawn thus q1_s0calculated value, Δ T q1_s0namely load q is represented 1at sample s 0in average page read time variations.
The average page of other three class loads reads time variations and also can similarly draw.
4, time variations is read to this space clustering of bulk sample according to average page;
For MPL=4, allly comprise q 1sample have s 0(q 1, q 2, q 3, q 4), s 4(q 1, q 2, q 4, q 5), s 5(q 1, q 3, q 4, q 5), s 6(q 1, q 2, q 3, q 5).
Δ T is calculated respectively for each sample q1_s0, Δ T q1_s4, Δ T q1_s5, Δ T q1_s6.Then Kmeans cluster is carried out to these four values.
In actual production environment, because loadtype reaches more than 20, MPL grade, between 30-200, therefore for each loadtype q, and under given MPL grade, can obtain the sample much comprising q.And to Δ T q_skmeans cluster is carried out in set, and clusters number is generally chosen to be MPL*2.
5, fill sample and choose table
To the sample s that each cluster is selected, its each load comprised has the numerical value indicating classification.
Such as at s 0(q 1, q 2, q 3, q 4) in, a kind of possible for classification results K s0(3,1,7,4), represent Δ T q1_s0be the 3rd class in full sample space, Δ T q2_s0for the first kind, Δ T q3_s0be the 7th class, Δ T q4_s0it is the 4th class.
Corresponding classification results K is had to each sample s s.
We obtain following form by cluster
According to above classification results, fill following sample and choose table:
Here, because loadtype contained in example is few, some vacancies in schedule of samples, are had.In actual production, there are some positions to clash, cause some position to fill.Run into this situation and can degenerate to random fashion again, combination does not have the position of filling.
6, training sample is generated
Choosing table according to the 5th step gained sample, is exactly required model training sample.
Following filling algorithm is provided in the present invention:
input:loaded template C, MPL=M;
export:selected sample set SampleSeled;
1、SampleSpace = GenerateSampleSpace(M,C);
2 ,/this space S of * generation bulk sample ampleSpace */
3、For S j∈SampleSpace
/ * calculates the Δ T of each loadtype in each sample q_s*/
4、 ComputeDIF_BAL(S j);
5、End For
6、For i = 1 to C, S j∈SampleSpace
/ * is to each loadtype q iwhole Δ T qi_Sjcarry out cluster, the number of cluster be M*2*/
7、 Kmeans(q i,ΔT qi_Sj,M*2);
8、End For
9、For S j∈SampleSpace
What/* calculated each sample inserts mutual exclusion number Mu, and the Mu value of sample s is defined as: insert sample s at first, for other samples of SampleSpace, and the total sample number * that can not insert again/
10、 ComputeMutual(S j);
11、End For
12、Sort(Mu j);
/ * according to the Mu value of each sample, ordered samples space * from small to large/
13、MaxInsNum = 1;
/ * initialization maximum sample number of fills */
14、For i = 1 to K
/ * K for fill cycle index */
15、 InsertS(S j);
/ * inserts sample S at first j*/
16、 InsertNum = 1;
17、 For m = j+1 to SampleSpace
18、 If(IsInsertS(S m))
/ * judges S mwhether can insert */
19、 InsertS(S m);
20、 InsertNum++;
21、 End For
/ * insert successively other can insert sample */
22、 If(InsertNum>MaxInsNum)
23、 MaxInsNum = InsertNum;
24、 RecordInsertS();
If this cyclic pac king of/* may be greater than existing program, then preserve current filling sample */
25、End For
26、RandomInsertS();
The room that/* does not insert for other, random combine sample */.

Claims (1)

1. the acquisition method of database loads response time forecast model training sample, comprises the steps:
(1) response data during each load isolated operation of database is obtained;
(2) response data when database loads runs in pairs is obtained;
(3) calculate average page and read time variations;
Average page reads time variations by Δ T q_s=T q_s-T qdefinition, T q_srepresent the BAL value of load q in sample s, T qrepresent the isolated operation BAL value of load q;
And average page reading time variations meets following formula:
Wherein Δ T q/cijrepresent some load q and another load c ijduring paired operation, the BAL value of some load q, another load c ijsample s jin by query template C cithe load generated; Δ T q/cirepresent some load q and another load c iduring paired operation, the BAL value of some load q, another load c iby query template C in sample s cithe load generated;
Utilize the Δ T running gained in pairs q/ccalculate the maximum and Δ T that is some load q under line number rank of higher MPL Database Systems q_s, then provide Δ T by following formula q_scalculating:
(4) time variations is read to this space clustering of bulk sample according to average page;
(5) fill sample and choose table;
(6) training sample is generated.
CN201510171679.5A 2015-04-13 2015-04-13 The acquisition method of database loads response time forecast model training sample Expired - Fee Related CN104794186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510171679.5A CN104794186B (en) 2015-04-13 2015-04-13 The acquisition method of database loads response time forecast model training sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510171679.5A CN104794186B (en) 2015-04-13 2015-04-13 The acquisition method of database loads response time forecast model training sample

Publications (2)

Publication Number Publication Date
CN104794186A true CN104794186A (en) 2015-07-22
CN104794186B CN104794186B (en) 2017-10-27

Family

ID=53558978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510171679.5A Expired - Fee Related CN104794186B (en) 2015-04-13 2015-04-13 The acquisition method of database loads response time forecast model training sample

Country Status (1)

Country Link
CN (1) CN104794186B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512264A (en) * 2015-12-04 2016-04-20 贵州大学 Performance prediction method of concurrency working loads in distributed database
CN108052614A (en) * 2017-12-14 2018-05-18 太原理工大学 A kind of dispatching method of Database Systems load
CN113157814A (en) * 2021-01-29 2021-07-23 东北大学 Query-driven intelligent workload analysis method under relational database

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070299965A1 (en) * 2006-06-22 2007-12-27 Jason Nieh Management of client perceived page view response time
CN104113590A (en) * 2014-06-30 2014-10-22 南京邮电大学 Copy selection method based on copy response time prediction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070299965A1 (en) * 2006-06-22 2007-12-27 Jason Nieh Management of client perceived page view response time
CN104113590A (en) * 2014-06-30 2014-10-22 南京邮电大学 Copy selection method based on copy response time prediction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JENNIE DUGGAN 等: "Performance Prediction for Concurrent Database Workloads", 《SIGMOD"2011》 *
赵建光 等: "数据库***交易型负载自适应管理", 《计算机工程与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512264A (en) * 2015-12-04 2016-04-20 贵州大学 Performance prediction method of concurrency working loads in distributed database
CN108052614A (en) * 2017-12-14 2018-05-18 太原理工大学 A kind of dispatching method of Database Systems load
CN113157814A (en) * 2021-01-29 2021-07-23 东北大学 Query-driven intelligent workload analysis method under relational database
CN113157814B (en) * 2021-01-29 2023-07-18 东北大学 Query-driven intelligent workload analysis method under relational database

Also Published As

Publication number Publication date
CN104794186B (en) 2017-10-27

Similar Documents

Publication Publication Date Title
Singh et al. Napel: Near-memory computing application performance prediction via ensemble learning
CN106708016B (en) fault monitoring method and device
Zhang et al. A weighted kernel possibilistic c‐means algorithm based on cloud computing for clustering big data
Luo et al. A parallel dbscan algorithm based on spark
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
Bilal et al. Finding the right cloud configuration for analytics clusters
Guo et al. Machine learning predictions for underestimation of job runtime on HPC system
KR20130101548A (en) Improving reliability in distributed environments
Zhu et al. Monitoring big process data of industrial plants with multiple operating modes based on Hadoop
Greathouse et al. Machine learning for performance and power modeling of heterogeneous systems
Isakov et al. HPC I/O throughput bottleneck analysis with explainable local models
CN104794186B (en) The acquisition method of database loads response time forecast model training sample
Esteves et al. A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means
US11288266B2 (en) Candidate projection enumeration based query response generation
Su et al. Towards optimal decomposition of Boolean networks
Bakhishoff et al. DTHMM ExaLB: discrete-time hidden Markov model for load balancing in distributed exascale computing environment
CN104573331B (en) A kind of k nearest neighbor data predication method based on MapReduce
US9235675B2 (en) Multidimensional monte-carlo simulation for yield prediction
Jiang et al. Hierarchical solving method for large scale TSP problems
Raman et al. BoDS: A benchmark on data sortedness
Tiwari et al. Identification of critical parameters for MapReduce energy efficiency using statistical Design of Experiments
Xing et al. HPC benchmark assessment with statistical analysis
WO2021254413A1 (en) Isolation distribution kernel construction method and apparatus, and anomaly data detection method and apparatus
US20230153491A1 (en) System for estimating feature value of material
EP4348436A1 (en) Point anomaly detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171027

Termination date: 20210413

CF01 Termination of patent right due to non-payment of annual fee