CN104794186A - Collecting method for training samples of database load response time predicting model - Google Patents
Collecting method for training samples of database load response time predicting model Download PDFInfo
- Publication number
- CN104794186A CN104794186A CN201510171679.5A CN201510171679A CN104794186A CN 104794186 A CN104794186 A CN 104794186A CN 201510171679 A CN201510171679 A CN 201510171679A CN 104794186 A CN104794186 A CN 104794186A
- Authority
- CN
- China
- Prior art keywords
- load
- sample
- database
- average page
- represent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a collecting method for training samples of a database load response time predicting model, and belongs to sample collecting methods based on clustering. The collecting method includes the steps that 1, response data, during individual operation of each load, of a database are obtained; 2, response data, during paired operation of the loads, of the database are obtained; 3, changes of average page read time are calculated; 4, according to the changes of the average page read time, a full sample space is clustered; 5, a sample selection table is filled; 6, training samples are generated. The sampling number of the statistic model can be reduced, and the modeling cost is reduced while the model accuracy is kept.
Description
Technical field
The invention belongs to the sample collection method based on cluster, is the training acquisition method being applied to database loads response time forecast model.
Background technology
In current parallel database system, the prediction load response time is extremely important, can help data base administrator's condition data storehouse parameter, the load of reasonable arrangement schedule parallel.
But (Interaction) mechanism is very complicated owing to influencing each other between data base concurrency load, traditional analytic type model process of establishing is complicated, and prediction effect is bad.Therefore existing document mainly sets up statistical model, predicts the response time of load.Namely complete statistical model set up by sample collection, model training (recurrence), model evaluation three step.The document of this respect mainly contains [1] Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads [C] //Proc.of 2011 ACM SIGMOD Conference (SIGMOD ' 2011). Athens, Greece, 2011:337-348
[2] Ahmad M, Aboulanaga A,Babu S, et al. Modeling and Exploiting Query Interaction in Database Systems[C] //Proc.of the 17th Conference on Information and Knowledge Management (CIKM’2008).Napa Valley,US,2008:183-192
[3] Ahmad M, AboulanagaA,Babu S, et al. Qshuffler: Getting the Query Mix Right[C] //Proc. of the 24th International Conference on Data Engineering (ICDE’2008).Cancun, Mexico,2008:1415-1417
[4] Ahmad M, Duan S, Aboulanaga A, et al. Predicting Completion Times of Bath Query Workloads Using Interaction-aware Models and Simulation[C] //Proc.of the 14th International Conference on Extending Database Technology (EDBT’2011).Uppsala, Sweden,2011:449-460
[5] Ahmad M, Duan S, Aboulanaga A, et al. Interaction-aware Scheduling of Report Generation Workloads [J].The VLDB Journal,2011,20(4):589-615
[6] Sheikh M B, Minhas U F, Khan O Z, et al. A Bayesian Approach to Online Performance Modeling for Database Appliances Using Gaussian Models[C] //Proc.of8th International Conference on Autonomic Computing(ICAC’2011).
Karlsruhe, Germany,2011:121-130。
But the method for sampling that above-mentioned statistical model is corresponding is not considered to influence each other between load, obtain sample by means of only to the specific sampling of full sample space or random sampling.Along with database data amount increases, the load running time increases, if not selected training sample, the model training time can be caused elongated, and the cost that model is set up will become very large.
Summary of the invention
Setting up cost to reduce model, shortening model Time Created, the invention provides a kind of acquisition method of training sample, while not obvious reduction model prediction accuracy, model can be reduced and set up cost.
Technical scheme of the present invention: the acquisition method of database loads response time forecast model training sample, comprises following content:
1, response data during each load isolated operation of database is obtained;
Namely, during each load q isolated operation, its response time, CPU time, logic reading number, BAL value is obtained.The wherein Buffer Access Latency value of BAL for defining in [1], represent that Database Systems often complete a physics and read the averaging time used, the present invention is referred to as reading averaging time.Buffer Access Latency value derives from document Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads //Proc.of 2011 ACM SIGMOD Conference (SIGMOD ' 2011). Athens, Greece, 2011:337-348
Load q represents by loaded template C
qthe executable database loads generated.
Loaded template by with parameter data base querying, upgrade statement generate; Different inquiries, renewal statement are considered as different loaded template.The load that the parameter that same loaded template generates is different, is considered as same load.
2, response data when database loads runs in pairs is obtained; I.e. the first load q
iwith the second load q
jduring paired operation, obtain respective response time, CPU time, logic reading number, BAL value; Wherein the first load q
iwith the second load q
jbelong to two different loads templates (the first loaded template C respectively
qiwith the second loaded template C
qj) generate.
3, calculate average page and read time variations;
Average page reads time variations by Δ T
q_s=T
q_s-T
qdefinition, T
q_srepresent that in sample s, some load q(are by loaded template C
qgenerate) BAL value, T
qrepresent the BAL value of some load q isolated operation.
Average page reads time variations and meets following formula simultaneously:
Wherein Δ T
q/cijrepresent some load q and another load c
ijduring paired operation, the BAL value of some load q, another load c
ijsample s
jin by query template C
cithe load generated; Δ T
q/cirepresent some load q and another load c
iduring paired operation, the BAL value of some load q, another load c
iby query template C in sample s
cithe load generated;
Utilize the Δ T running gained in pairs
q/ccalculate higher MPL(Multi Programming Level, the maximum and line number of Database Systems, namely represent the number of loads simultaneously can run) the Δ T of some load q under rank
q_s.Then Δ T is provided by following formula
q_scalculating:
;
4, time variations is read to this space clustering of bulk sample according to average page;
For the some load q of each class, under given MPL rank (Multi Programming Level), the T all to it
q_scarry out cluster, clustering method selects Kmeans algorithm, measures as Euclidean distance.Clusters number is MPL*2.
5, fill sample and choose table;
6, training sample is generated.
The present invention can reduce the number of samples of statistical model, and keeps model accuracy and reduce model setting up cost.
Embodiment
Embodiment: establish given 5 loadtypes to be q respectively
1, q
2, q
3, q
4, q
5; MPL grade is 4, and represent that the load number simultaneously can run in a database is 4, current sample is s
0(q
1, q
2, q
3, q
4).Wherein q
1, q
2, q
3, q
4, q
5respectively by 5 query template C
q1, C
q2, C
q3, C
q4, C
q5generate, Database Systems are IBM DB2, and version number is 9.5.
1, response data during each load isolated operation is obtained; Described response data comprises response time, CPU time, logic reading number, BAL value T
q;
Isolated operation load q
1, q
2, q
3, q
4, q
5and obtain the BAL value of respective response time, CPU time, logic reading number, isolated operation.Data are obtained by DB2 snapshot monitor command: " db2 get snapshot for dynamic sql on database ".
2, response data when load runs in pairs is obtained; By q
1, q
2, q
3, q
4, q
5carry out permutation and combination, obtain the paired operation response time of all pair-wise combination (10 paired running loads), run CPU time, in pairs operation logic reading number in pairs, run BAL value T in pairs
q/c.The obtain manner of data uses DB2 snapshot monitor command equally.
3, calculate average page and read time variations
Δ T is calculated by following formula
q1_s0scope:
Current sample is s0(q
1, q
2, q
3, q
4), MPL=4; Be 3 than low other MPL value of one-level of current MPL, that it can generate and comprise load q
1sample have s
1(q
1, q
2, q
3), s
2(q
1, q
2, q
4), s
3(q
1, q
3, q
4).
Then:
and:
。
Δ T thus
q1_s0calculated value can be provided by following formula:
Δ T can be drawn thus
q1_s0calculated value, Δ T
q1_s0namely load q is represented
1at sample s
0in average page read time variations.
The average page of other three class loads reads time variations and also can similarly draw.
4, time variations is read to this space clustering of bulk sample according to average page;
For MPL=4, allly comprise q
1sample have s
0(q
1, q
2, q
3, q
4), s
4(q
1, q
2, q
4, q
5), s
5(q
1, q
3, q
4, q
5), s
6(q
1, q
2, q
3, q
5).
Δ T is calculated respectively for each sample
q1_s0, Δ T
q1_s4, Δ T
q1_s5, Δ T
q1_s6.Then Kmeans cluster is carried out to these four values.
In actual production environment, because loadtype reaches more than 20, MPL grade, between 30-200, therefore for each loadtype q, and under given MPL grade, can obtain the sample much comprising q.And to Δ T
q_skmeans cluster is carried out in set, and clusters number is generally chosen to be MPL*2.
5, fill sample and choose table
To the sample s that each cluster is selected, its each load comprised has the numerical value indicating classification.
Such as at s
0(q
1, q
2, q
3, q
4) in, a kind of possible for classification results K
s0(3,1,7,4), represent Δ T
q1_s0be the 3rd class in full sample space, Δ T
q2_s0for the first kind, Δ T
q3_s0be the 7th class, Δ T
q4_s0it is the 4th class.
Corresponding classification results K is had to each sample s
s.
We obtain following form by cluster
According to above classification results, fill following sample and choose table:
Here, because loadtype contained in example is few, some vacancies in schedule of samples, are had.In actual production, there are some positions to clash, cause some position to fill.Run into this situation and can degenerate to random fashion again, combination does not have the position of filling.
6, training sample is generated
Choosing table according to the 5th step gained sample, is exactly required model training sample.
Following filling algorithm is provided in the present invention:
input:loaded template C, MPL=M;
export:selected sample set SampleSeled;
1、SampleSpace = GenerateSampleSpace(M,C);
2 ,/this space S of * generation bulk sample ampleSpace */
3、For S
j∈SampleSpace
/ * calculates the Δ T of each loadtype in each sample
q_s*/
4、 ComputeDIF_BAL(S
j);
5、End For
6、For i = 1 to C, S
j∈SampleSpace
/ * is to each loadtype q
iwhole Δ T
qi_Sjcarry out cluster, the number of cluster be M*2*/
7、 Kmeans(q
i,ΔT
qi_Sj,M*2);
8、End For
9、For S
j∈SampleSpace
What/* calculated each sample inserts mutual exclusion number Mu, and the Mu value of sample s is defined as: insert sample s at first, for other samples of SampleSpace, and the total sample number * that can not insert again/
10、 ComputeMutual(S
j);
11、End For
12、Sort(Mu
j);
/ * according to the Mu value of each sample, ordered samples space * from small to large/
13、MaxInsNum = 1;
/ * initialization maximum sample number of fills */
14、For i = 1 to K
/ * K for fill cycle index */
15、 InsertS(S
j);
/ * inserts sample S at first
j*/
16、 InsertNum = 1;
17、 For m = j+1 to SampleSpace
18、 If(IsInsertS(S
m))
/ * judges S
mwhether can insert */
19、 InsertS(S
m);
20、 InsertNum++;
21、 End For
/ * insert successively other can insert sample */
22、 If(InsertNum>MaxInsNum)
23、 MaxInsNum = InsertNum;
24、 RecordInsertS();
If this cyclic pac king of/* may be greater than existing program, then preserve current filling sample */
25、End For
26、RandomInsertS();
The room that/* does not insert for other, random combine sample */.
Claims (1)
1. the acquisition method of database loads response time forecast model training sample, comprises the steps:
(1) response data during each load isolated operation of database is obtained;
(2) response data when database loads runs in pairs is obtained;
(3) calculate average page and read time variations;
Average page reads time variations by Δ T
q_s=T
q_s-T
qdefinition, T
q_srepresent the BAL value of load q in sample s, T
qrepresent the isolated operation BAL value of load q;
And average page reading time variations meets following formula:
Wherein Δ T
q/cijrepresent some load q and another load c
ijduring paired operation, the BAL value of some load q, another load c
ijsample s
jin by query template C
cithe load generated; Δ T
q/cirepresent some load q and another load c
iduring paired operation, the BAL value of some load q, another load c
iby query template C in sample s
cithe load generated;
Utilize the Δ T running gained in pairs
q/ccalculate the maximum and Δ T that is some load q under line number rank of higher MPL Database Systems
q_s, then provide Δ T by following formula
q_scalculating:
;
(4) time variations is read to this space clustering of bulk sample according to average page;
(5) fill sample and choose table;
(6) training sample is generated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510171679.5A CN104794186B (en) | 2015-04-13 | 2015-04-13 | The acquisition method of database loads response time forecast model training sample |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510171679.5A CN104794186B (en) | 2015-04-13 | 2015-04-13 | The acquisition method of database loads response time forecast model training sample |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104794186A true CN104794186A (en) | 2015-07-22 |
CN104794186B CN104794186B (en) | 2017-10-27 |
Family
ID=53558978
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510171679.5A Expired - Fee Related CN104794186B (en) | 2015-04-13 | 2015-04-13 | The acquisition method of database loads response time forecast model training sample |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104794186B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512264A (en) * | 2015-12-04 | 2016-04-20 | 贵州大学 | Performance prediction method of concurrency working loads in distributed database |
CN108052614A (en) * | 2017-12-14 | 2018-05-18 | 太原理工大学 | A kind of dispatching method of Database Systems load |
CN113157814A (en) * | 2021-01-29 | 2021-07-23 | 东北大学 | Query-driven intelligent workload analysis method under relational database |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070299965A1 (en) * | 2006-06-22 | 2007-12-27 | Jason Nieh | Management of client perceived page view response time |
CN104113590A (en) * | 2014-06-30 | 2014-10-22 | 南京邮电大学 | Copy selection method based on copy response time prediction |
-
2015
- 2015-04-13 CN CN201510171679.5A patent/CN104794186B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070299965A1 (en) * | 2006-06-22 | 2007-12-27 | Jason Nieh | Management of client perceived page view response time |
CN104113590A (en) * | 2014-06-30 | 2014-10-22 | 南京邮电大学 | Copy selection method based on copy response time prediction |
Non-Patent Citations (2)
Title |
---|
JENNIE DUGGAN 等: "Performance Prediction for Concurrent Database Workloads", 《SIGMOD"2011》 * |
赵建光 等: "数据库***交易型负载自适应管理", 《计算机工程与应用》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512264A (en) * | 2015-12-04 | 2016-04-20 | 贵州大学 | Performance prediction method of concurrency working loads in distributed database |
CN108052614A (en) * | 2017-12-14 | 2018-05-18 | 太原理工大学 | A kind of dispatching method of Database Systems load |
CN113157814A (en) * | 2021-01-29 | 2021-07-23 | 东北大学 | Query-driven intelligent workload analysis method under relational database |
CN113157814B (en) * | 2021-01-29 | 2023-07-18 | 东北大学 | Query-driven intelligent workload analysis method under relational database |
Also Published As
Publication number | Publication date |
---|---|
CN104794186B (en) | 2017-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Singh et al. | Napel: Near-memory computing application performance prediction via ensemble learning | |
CN106708016B (en) | fault monitoring method and device | |
Zhang et al. | A weighted kernel possibilistic c‐means algorithm based on cloud computing for clustering big data | |
Luo et al. | A parallel dbscan algorithm based on spark | |
CN107908536B (en) | Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment | |
Bilal et al. | Finding the right cloud configuration for analytics clusters | |
Guo et al. | Machine learning predictions for underestimation of job runtime on HPC system | |
KR20130101548A (en) | Improving reliability in distributed environments | |
Zhu et al. | Monitoring big process data of industrial plants with multiple operating modes based on Hadoop | |
Greathouse et al. | Machine learning for performance and power modeling of heterogeneous systems | |
Isakov et al. | HPC I/O throughput bottleneck analysis with explainable local models | |
CN104794186B (en) | The acquisition method of database loads response time forecast model training sample | |
Esteves et al. | A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means | |
US11288266B2 (en) | Candidate projection enumeration based query response generation | |
Su et al. | Towards optimal decomposition of Boolean networks | |
Bakhishoff et al. | DTHMM ExaLB: discrete-time hidden Markov model for load balancing in distributed exascale computing environment | |
CN104573331B (en) | A kind of k nearest neighbor data predication method based on MapReduce | |
US9235675B2 (en) | Multidimensional monte-carlo simulation for yield prediction | |
Jiang et al. | Hierarchical solving method for large scale TSP problems | |
Raman et al. | BoDS: A benchmark on data sortedness | |
Tiwari et al. | Identification of critical parameters for MapReduce energy efficiency using statistical Design of Experiments | |
Xing et al. | HPC benchmark assessment with statistical analysis | |
WO2021254413A1 (en) | Isolation distribution kernel construction method and apparatus, and anomaly data detection method and apparatus | |
US20230153491A1 (en) | System for estimating feature value of material | |
EP4348436A1 (en) | Point anomaly detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171027 Termination date: 20210413 |
|
CF01 | Termination of patent right due to non-payment of annual fee |