CN106126612A

CN106126612A - A kind of big ETL process dynamically divides the data pick-up method of timeslice

Info

Publication number: CN106126612A
Application number: CN201610456782.9A
Authority: CN
Inventors: 马昭德; 李建; 张韬; 何荣; 夏秋
Original assignee: Chongqing Miao Yin Science And Technology Ltd
Current assignee: Chongqing Miao Yin Science And Technology Ltd
Priority date: 2016-06-22
Filing date: 2016-06-22
Publication date: 2016-11-16

Abstract

The big ETL process that the present invention provides dynamically divides the data pick-up method of timeslice, the first benchmark of initialization system and time-consumingly and initializes burst time of system, then starts system circulation and performs data pick-up；System is after having performed a data pick-up, the difference that if system is currently the most time-consuming and benchmark is time-consuming is more than predetermined threshold value, system time-consumingly and currently in real time time-consumingly adjusts the burst time next time, and perform data pick-up next time according to the burst time next time according to current slice time, benchmark；The difference that if system is currently the most time-consuming and benchmark is time-consuming is less than or equal to predetermined threshold value, system keeps the current burst time, and perform data pick-up next time according to the current slice time, the method can dynamically divide timeslice, the most time-consuming and the benchmark making data pick-up next time time-consumingly tends to be steady, and improves the efficiency of data pick-up.

Description

A kind of big ETL process dynamically divides the data pick-up method of timeslice

Technical field

The present invention relates to communication technical field, be specifically related to a kind of big ETL process and dynamically divide the data pick-up of timeslice Method.

Background technology

ETL (Extract-Transform-Load, data warehouse technology), is used for describing data being passed through from source terminal and takes out Take (extract), conversion (transform), the process of loading (load) to destination.ETL is the important of structure data warehouse One ring, user extracts required data from data source, through data cleansing, finally according to the data warehouse mould pre-defined Type, loads data in data warehouse.

Existing ETL instrument is usually full dose or increment extraction from data source extracted data, for from relational database In big table data (such as magnanimity history log extraction), data are typically based on the fixing burst time, and (history log data is general It is all temporally to index) to extract, the burst time refers to the interval of adjacent twice data pick-up.But for business diary In data each burst time, the data volume of distribution is not likely to be uniform, if we are for be divided into the decile time burst Sheet, then distribution in the data volume of distribution too much can cause system I/O, memory consumption cliffy summit, burst time in the burst time The too small meeting of data volume causes computing resource waste, occurs lengthening collection period, and dally problem.

Summary of the invention

For above-mentioned deficiency present in prior art, patent of the present invention is how to provide a kind of data based on ETL Abstracting method, it is possible to dynamically divide timeslice, improves the efficiency of data pick-up, solves system I/O that prior art exists, internal memory The problem such as cliffy summit or computing resource waste of consumption.

For solving above-mentioned technical problem, it is achieved goal of the invention, the technical solution used in the present invention is as follows:

The benchmark that a kind of big ETL process dynamically divides the data pick-up method of timeslice, first initialization system is time-consuming and first The burst time of beginning system, then start system circulation and perform data pick-up, and the system that records has performed a data pick-up After obtain in real time time-consuming；System is after having performed a data pick-up, if system is currently the most time-consuming and benchmark is time-consuming When difference is more than predetermined threshold value, system time-consumingly and currently time-consumingly adjusts burst next time in real time according to current slice time, benchmark Time, and perform data pick-up next time according to the burst time next time；If system is currently the most time-consuming and benchmark is time-consuming When difference is less than or equal to predetermined threshold value, system keeps the current burst time, and performs next time according to the current slice time Data pick-up.

As the further optimization of such scheme, described system is " time-consuming and the most real-time according to current slice time, benchmark Time-consumingly adjust the burst time next time " particularly as follows: the definition current slice time is a_n, unit is the second, and n is that system execution data are taken out The number of times taken, current is the most time-consumingly e_n, unit is the second, and benchmark is time-consumingly k, and unit is the second；Make coefficient r=k/e_n, functionWherein r_minFor default coefficient r lower limit, r_maxFor the default coefficient r upper limit, divide the most next time The sheet timeWherein g=a_n* f (r), unit is the second, and m is default fragmentation threshold, and unit is the second, The big burst time.

As the further optimization of such scheme, described r_minSpan be 0.1～1；Described r_maxSpan It is 2～5.

As the further optimization of such scheme, the span of the time-consuming k of described benchmark is 1～10, and unit is the second.

As the further optimization of such scheme, described fragmentation threshold m is the 5～10 of the burst time of system initialization Times.

As the further optimization of such scheme, the burst time of described system initializationUnit is the second, Wherein M is the handling capacity that system is per minute, and N is the data volume of the data source interior generation per minute of system.Compared to prior art, Present invention have the advantage that

The big ETL process that the present invention provides dynamically divides the data pick-up method of timeslice, and system can be according to the last time The burst time of data pick-up, the benchmark time-consuming and real-time burst time time-consumingly adjusting data pick-up next time, it is possible to dynamically draw Time-slotting so that the most time-consuming and benchmark of data pick-up time-consumingly tends to be steady next time, improves the efficiency of data pick-up.

Detailed description of the invention

Below in conjunction with embodiment, the present invention is described in further detail, but embodiments of the present invention are not limited to this.

Embodiment:

The data volume assuming the data source interior generation per minute of system is certain, then the burst time is the longest, two secondary data The time interval of extraction is the longest, then the data volume that data pick-up is drawn into is the biggest, system process the most time-consuming the most also The longest.Otherwise, the burst time is the shortest, and it is the shortest that system processes.The benchmark of default is time-consuming and initial The burst time changed can determine according to the data volume of the performance of whole system and data source, when this system of commencement of commercial operation, permissible Debug by this system, set the different burst time, in real time time-consuming under each burst time of the system that then draws, If the most time-consuming relatively big, reduce the burst time, if the least, increase the burst time, thus it is optimal to select system The in real time time-consuming corresponding burst time, when commencement of commercial operation system, system is initialized, on the one hand with this burst time The problem that can solve the problem that system cold start-up, on the other hand makes system the most postrun the most time-consuming the most reasonable.

It is true that owing to the data volume of the data source of system interior generation per minute is random, even if so burst time Equally, the data volume of the data source interior generation per minute of system is the biggest, the biggest.System is in the mistake of data pick-up Cheng Zhong, even if the most time-consuming convergence benchmark under the last burst time is time-consuming, but may next time data pick-up time data The data volume in source is larger or smaller so that the most larger or smaller, is at this moment accomplished by adjusting the burst time.The method The situation according to last data pick-up that can continue adjusts the burst time of data pick-up next time, when this method adjusts burst Between all the time practical work by data pick-up time-consumingly level off on the basis of benchmark consumption, it is achieved that dynamically draw in whole data extraction process The function of time-slotting so that the most time-consuming and benchmark of data pick-up time-consumingly tends to be steady next time, improves data pick-up Efficiency.

Described system " according to the current slice time, benchmark is time-consuming and currently time-consumingly adjusts the burst time next time in real time " tool Body is: the definition current slice time is a_n, unit is the second, and n is the number of times that system performs data pick-up, and current is the most time-consumingly e_n, Unit is the second, and benchmark is time-consumingly k, and unit is the second；Make coefficient r=k/e_n, functionWherein r_minFor The coefficient r lower limit preset, r_maxFor the default coefficient r upper limit, burst time the most next timeWherein g= a_n* f (r), unit is the second, and m is default fragmentation threshold, and unit is the second, the i.e. maximum burst time.It is not prevented from the most time-consuming shake Excessive, adjacent time sheet fluctuation range ratio, described r can be limited_minSpan be 0.1～1, preferably 0.5；Described r_max Span be 2～5, preferably 2.The span of the time-consuming k of described benchmark is 1～10, and unit is the second, preferably 2 seconds.Described point Sheet threshold value m is the maximum burst time, 5～10 times of General System initialized burst time, the initialized burst of optimum decision system 10 times of time.The burst time of described system initializationUnit is the second, and wherein M is per minute the gulping down of system The amount of telling, N is the data volume of the data source interior generation per minute of system.Such as in system, the handling capacity of host environment performance is only 50W/min, the data volume produced in the data source average minute clock of system is 100W, and the most initialized burst time is 30 seconds, Owing to also needing to consider multiple task operation simultaneously meeting competitive resource when system performs data pick-up, so also requiring that system is being divided When the sheet time reaches, the data volume of extraction is usually no more than the 10% of network interface card flow.

Assume r_min=0.5, r_max=2, k=2s, m=10a₀=300s, a₀=30s, e₀=4s, then r=0.5, f (r)= 0.5, g=15s, burst time a the most next time₁=15s, if it can thus be seen that be the most time-consumingly 4s first, more than benchmark Time-consuming 2s, the burst time next time that reduces is 15 seconds.Assume e₀=1s, then r=2, f (r)=2, g=60s, the most next time burst Time a₁=60s, if it can thus be seen that is in real time time-consumingly 1s first, less than the time-consuming 2s of benchmark, increase next time burst time Between be 60 seconds.This shows and make in aforementioned manners, it is possible to achieve slowly adjust the burst time so that practical work time-consumingly levels off to Benchmark is time-consuming.

Finally illustrating, above example is only in order to illustrate technical scheme and unrestricted, although with reference to relatively The present invention has been described in detail by good embodiment, it will be understood by those within the art that, can be to the skill of the present invention Art scheme is modified or equivalent, and without deviating from objective and the scope of technical solution of the present invention, it all should be contained at this In the middle of the right of invention.

Claims

1. a big ETL process dynamically divides the data pick-up method of timeslice, it is characterised in that the first benchmark of initialization system Time-consuming and initialize burst time of system, then start system circulation and perform data pick-up, and the system that records has performed once Obtain after data pick-up is the most time-consuming；System is after having performed a data pick-up, if system is currently the most time-consuming and base When accurate time-consuming difference is more than predetermined threshold value, system is according under current slice time, benchmark the most time-consuming time-consuming and current adjustment Burst time, and perform data pick-up next time according to the burst time next time；If system is currently the most time-consuming and base When accurate time-consuming difference is less than or equal to predetermined threshold value, system keeps the current burst time, and holds according to the current slice time Row data pick-up next time.

2. big ETL process as claimed in claim 1 dynamically divides the data pick-up method of timeslice, it is characterised in that described System " according to the current slice time, benchmark is time-consuming and currently time-consumingly adjusts the burst time next time in real time " is particularly as follows: definition is worked as The front burst time is a_n, unit is the second, and n is the number of times that system performs data pick-up, and current is the most time-consumingly e_n, unit is the second, base Accurate is time-consumingly k, and unit is the second；Make coefficient r=k/e_n, functionWherein r_minFor default coefficient r Lower limit, r_maxFor the default coefficient r upper limit, burst time the most next timeWherein g=a_n* f (r), unit For the second, m is default fragmentation threshold, and unit is the second, the i.e. maximum burst time.

3. big ETL process as claimed in claim 2 dynamically divides the data pick-up method of timeslice, it is characterised in that described r_minSpan be 0.1～1；Described r_maxSpan be 2～5.

4. big ETL process as claimed in claim 2 dynamically divides the data pick-up method of timeslice, it is characterised in that described The span of the time-consuming k of benchmark is 1～10, and unit is the second.

5. big ETL process as claimed in claim 2 dynamically divides the data pick-up method of timeslice, it is characterised in that described Fragmentation threshold m is 5～10 times of the burst time of system initialization.

6. big ETL process as claimed in claim 1 dynamically divides the data pick-up method of timeslice, it is characterised in that described The burst time of system initializationUnit is the second, and wherein M is the handling capacity that system is per minute, and N is the number of system Data volume according to source interior generation per minute.