CN106126612A - A kind of big ETL process dynamically divides the data pick-up method of timeslice - Google Patents

A kind of big ETL process dynamically divides the data pick-up method of timeslice Download PDF

Info

Publication number
CN106126612A
CN106126612A CN201610456782.9A CN201610456782A CN106126612A CN 106126612 A CN106126612 A CN 106126612A CN 201610456782 A CN201610456782 A CN 201610456782A CN 106126612 A CN106126612 A CN 106126612A
Authority
CN
China
Prior art keywords
time
consuming
data pick
burst
benchmark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610456782.9A
Other languages
Chinese (zh)
Inventor
马昭德
李建
张韬
何荣
夏秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Miao Yin Science And Technology Ltd
Original Assignee
Chongqing Miao Yin Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Miao Yin Science And Technology Ltd filed Critical Chongqing Miao Yin Science And Technology Ltd
Priority to CN201610456782.9A priority Critical patent/CN106126612A/en
Publication of CN106126612A publication Critical patent/CN106126612A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The big ETL process that the present invention provides dynamically divides the data pick-up method of timeslice, the first benchmark of initialization system and time-consumingly and initializes burst time of system, then starts system circulation and performs data pick-up;System is after having performed a data pick-up, the difference that if system is currently the most time-consuming and benchmark is time-consuming is more than predetermined threshold value, system time-consumingly and currently in real time time-consumingly adjusts the burst time next time, and perform data pick-up next time according to the burst time next time according to current slice time, benchmark;The difference that if system is currently the most time-consuming and benchmark is time-consuming is less than or equal to predetermined threshold value, system keeps the current burst time, and perform data pick-up next time according to the current slice time, the method can dynamically divide timeslice, the most time-consuming and the benchmark making data pick-up next time time-consumingly tends to be steady, and improves the efficiency of data pick-up.

Description

A kind of big ETL process dynamically divides the data pick-up method of timeslice
Technical field
The present invention relates to communication technical field, be specifically related to a kind of big ETL process and dynamically divide the data pick-up of timeslice Method.
Background technology
ETL (Extract-Transform-Load, data warehouse technology), is used for describing data being passed through from source terminal and takes out Take (extract), conversion (transform), the process of loading (load) to destination.ETL is the important of structure data warehouse One ring, user extracts required data from data source, through data cleansing, finally according to the data warehouse mould pre-defined Type, loads data in data warehouse.
Existing ETL instrument is usually full dose or increment extraction from data source extracted data, for from relational database In big table data (such as magnanimity history log extraction), data are typically based on the fixing burst time, and (history log data is general It is all temporally to index) to extract, the burst time refers to the interval of adjacent twice data pick-up.But for business diary In data each burst time, the data volume of distribution is not likely to be uniform, if we are for be divided into the decile time burst Sheet, then distribution in the data volume of distribution too much can cause system I/O, memory consumption cliffy summit, burst time in the burst time The too small meeting of data volume causes computing resource waste, occurs lengthening collection period, and dally problem.
Summary of the invention
For above-mentioned deficiency present in prior art, patent of the present invention is how to provide a kind of data based on ETL Abstracting method, it is possible to dynamically divide timeslice, improves the efficiency of data pick-up, solves system I/O that prior art exists, internal memory The problem such as cliffy summit or computing resource waste of consumption.
For solving above-mentioned technical problem, it is achieved goal of the invention, the technical solution used in the present invention is as follows:
The benchmark that a kind of big ETL process dynamically divides the data pick-up method of timeslice, first initialization system is time-consuming and first The burst time of beginning system, then start system circulation and perform data pick-up, and the system that records has performed a data pick-up After obtain in real time time-consuming;System is after having performed a data pick-up, if system is currently the most time-consuming and benchmark is time-consuming When difference is more than predetermined threshold value, system time-consumingly and currently time-consumingly adjusts burst next time in real time according to current slice time, benchmark Time, and perform data pick-up next time according to the burst time next time;If system is currently the most time-consuming and benchmark is time-consuming When difference is less than or equal to predetermined threshold value, system keeps the current burst time, and performs next time according to the current slice time Data pick-up.
As the further optimization of such scheme, described system is " time-consuming and the most real-time according to current slice time, benchmark Time-consumingly adjust the burst time next time " particularly as follows: the definition current slice time is an, unit is the second, and n is that system execution data are taken out The number of times taken, current is the most time-consumingly en, unit is the second, and benchmark is time-consumingly k, and unit is the second;Make coefficient r=k/en, functionWherein rminFor default coefficient r lower limit, rmaxFor the default coefficient r upper limit, divide the most next time The sheet timeWherein g=an* f (r), unit is the second, and m is default fragmentation threshold, and unit is the second, The big burst time.
As the further optimization of such scheme, described rminSpan be 0.1~1;Described rmaxSpan It is 2~5.
As the further optimization of such scheme, the span of the time-consuming k of described benchmark is 1~10, and unit is the second.
As the further optimization of such scheme, described fragmentation threshold m is the 5~10 of the burst time of system initialization Times.
As the further optimization of such scheme, the burst time of described system initializationUnit is the second, Wherein M is the handling capacity that system is per minute, and N is the data volume of the data source interior generation per minute of system.Compared to prior art, Present invention have the advantage that
The big ETL process that the present invention provides dynamically divides the data pick-up method of timeslice, and system can be according to the last time The burst time of data pick-up, the benchmark time-consuming and real-time burst time time-consumingly adjusting data pick-up next time, it is possible to dynamically draw Time-slotting so that the most time-consuming and benchmark of data pick-up time-consumingly tends to be steady next time, improves the efficiency of data pick-up.
Detailed description of the invention
Below in conjunction with embodiment, the present invention is described in further detail, but embodiments of the present invention are not limited to this.
Embodiment:
The benchmark that a kind of big ETL process dynamically divides the data pick-up method of timeslice, first initialization system is time-consuming and first The burst time of beginning system, then start system circulation and perform data pick-up, and the system that records has performed a data pick-up After obtain in real time time-consuming;System is after having performed a data pick-up, if system is currently the most time-consuming and benchmark is time-consuming When difference is more than predetermined threshold value, system time-consumingly and currently time-consumingly adjusts burst next time in real time according to current slice time, benchmark Time, and perform data pick-up next time according to the burst time next time;If system is currently the most time-consuming and benchmark is time-consuming When difference is less than or equal to predetermined threshold value, system keeps the current burst time, and performs next time according to the current slice time Data pick-up.
The data volume assuming the data source interior generation per minute of system is certain, then the burst time is the longest, two secondary data The time interval of extraction is the longest, then the data volume that data pick-up is drawn into is the biggest, system process the most time-consuming the most also The longest.Otherwise, the burst time is the shortest, and it is the shortest that system processes.The benchmark of default is time-consuming and initial The burst time changed can determine according to the data volume of the performance of whole system and data source, when this system of commencement of commercial operation, permissible Debug by this system, set the different burst time, in real time time-consuming under each burst time of the system that then draws, If the most time-consuming relatively big, reduce the burst time, if the least, increase the burst time, thus it is optimal to select system The in real time time-consuming corresponding burst time, when commencement of commercial operation system, system is initialized, on the one hand with this burst time The problem that can solve the problem that system cold start-up, on the other hand makes system the most postrun the most time-consuming the most reasonable.
It is true that owing to the data volume of the data source of system interior generation per minute is random, even if so burst time Equally, the data volume of the data source interior generation per minute of system is the biggest, the biggest.System is in the mistake of data pick-up Cheng Zhong, even if the most time-consuming convergence benchmark under the last burst time is time-consuming, but may next time data pick-up time data The data volume in source is larger or smaller so that the most larger or smaller, is at this moment accomplished by adjusting the burst time.The method The situation according to last data pick-up that can continue adjusts the burst time of data pick-up next time, when this method adjusts burst Between all the time practical work by data pick-up time-consumingly level off on the basis of benchmark consumption, it is achieved that dynamically draw in whole data extraction process The function of time-slotting so that the most time-consuming and benchmark of data pick-up time-consumingly tends to be steady next time, improves data pick-up Efficiency.
Described system " according to the current slice time, benchmark is time-consuming and currently time-consumingly adjusts the burst time next time in real time " tool Body is: the definition current slice time is an, unit is the second, and n is the number of times that system performs data pick-up, and current is the most time-consumingly en, Unit is the second, and benchmark is time-consumingly k, and unit is the second;Make coefficient r=k/en, functionWherein rminFor The coefficient r lower limit preset, rmaxFor the default coefficient r upper limit, burst time the most next timeWherein g= an* f (r), unit is the second, and m is default fragmentation threshold, and unit is the second, the i.e. maximum burst time.It is not prevented from the most time-consuming shake Excessive, adjacent time sheet fluctuation range ratio, described r can be limitedminSpan be 0.1~1, preferably 0.5;Described rmax Span be 2~5, preferably 2.The span of the time-consuming k of described benchmark is 1~10, and unit is the second, preferably 2 seconds.Described point Sheet threshold value m is the maximum burst time, 5~10 times of General System initialized burst time, the initialized burst of optimum decision system 10 times of time.The burst time of described system initializationUnit is the second, and wherein M is per minute the gulping down of system The amount of telling, N is the data volume of the data source interior generation per minute of system.Such as in system, the handling capacity of host environment performance is only 50W/min, the data volume produced in the data source average minute clock of system is 100W, and the most initialized burst time is 30 seconds, Owing to also needing to consider multiple task operation simultaneously meeting competitive resource when system performs data pick-up, so also requiring that system is being divided When the sheet time reaches, the data volume of extraction is usually no more than the 10% of network interface card flow.
Assume rmin=0.5, rmax=2, k=2s, m=10a0=300s, a0=30s, e0=4s, then r=0.5, f (r)= 0.5, g=15s, burst time a the most next time1=15s, if it can thus be seen that be the most time-consumingly 4s first, more than benchmark Time-consuming 2s, the burst time next time that reduces is 15 seconds.Assume e0=1s, then r=2, f (r)=2, g=60s, the most next time burst Time a1=60s, if it can thus be seen that is in real time time-consumingly 1s first, less than the time-consuming 2s of benchmark, increase next time burst time Between be 60 seconds.This shows and make in aforementioned manners, it is possible to achieve slowly adjust the burst time so that practical work time-consumingly levels off to Benchmark is time-consuming.
Finally illustrating, above example is only in order to illustrate technical scheme and unrestricted, although with reference to relatively The present invention has been described in detail by good embodiment, it will be understood by those within the art that, can be to the skill of the present invention Art scheme is modified or equivalent, and without deviating from objective and the scope of technical solution of the present invention, it all should be contained at this In the middle of the right of invention.

Claims (6)

1. a big ETL process dynamically divides the data pick-up method of timeslice, it is characterised in that the first benchmark of initialization system Time-consuming and initialize burst time of system, then start system circulation and perform data pick-up, and the system that records has performed once Obtain after data pick-up is the most time-consuming;System is after having performed a data pick-up, if system is currently the most time-consuming and base When accurate time-consuming difference is more than predetermined threshold value, system is according under current slice time, benchmark the most time-consuming time-consuming and current adjustment Burst time, and perform data pick-up next time according to the burst time next time;If system is currently the most time-consuming and base When accurate time-consuming difference is less than or equal to predetermined threshold value, system keeps the current burst time, and holds according to the current slice time Row data pick-up next time.
2. big ETL process as claimed in claim 1 dynamically divides the data pick-up method of timeslice, it is characterised in that described System " according to the current slice time, benchmark is time-consuming and currently time-consumingly adjusts the burst time next time in real time " is particularly as follows: definition is worked as The front burst time is an, unit is the second, and n is the number of times that system performs data pick-up, and current is the most time-consumingly en, unit is the second, base Accurate is time-consumingly k, and unit is the second;Make coefficient r=k/en, functionWherein rminFor default coefficient r Lower limit, rmaxFor the default coefficient r upper limit, burst time the most next timeWherein g=an* f (r), unit For the second, m is default fragmentation threshold, and unit is the second, the i.e. maximum burst time.
3. big ETL process as claimed in claim 2 dynamically divides the data pick-up method of timeslice, it is characterised in that described rminSpan be 0.1~1;Described rmaxSpan be 2~5.
4. big ETL process as claimed in claim 2 dynamically divides the data pick-up method of timeslice, it is characterised in that described The span of the time-consuming k of benchmark is 1~10, and unit is the second.
5. big ETL process as claimed in claim 2 dynamically divides the data pick-up method of timeslice, it is characterised in that described Fragmentation threshold m is 5~10 times of the burst time of system initialization.
6. big ETL process as claimed in claim 1 dynamically divides the data pick-up method of timeslice, it is characterised in that described The burst time of system initializationUnit is the second, and wherein M is the handling capacity that system is per minute, and N is the number of system Data volume according to source interior generation per minute.
CN201610456782.9A 2016-06-22 2016-06-22 A kind of big ETL process dynamically divides the data pick-up method of timeslice Pending CN106126612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610456782.9A CN106126612A (en) 2016-06-22 2016-06-22 A kind of big ETL process dynamically divides the data pick-up method of timeslice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610456782.9A CN106126612A (en) 2016-06-22 2016-06-22 A kind of big ETL process dynamically divides the data pick-up method of timeslice

Publications (1)

Publication Number Publication Date
CN106126612A true CN106126612A (en) 2016-11-16

Family

ID=57267992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610456782.9A Pending CN106126612A (en) 2016-06-22 2016-06-22 A kind of big ETL process dynamically divides the data pick-up method of timeslice

Country Status (1)

Country Link
CN (1) CN106126612A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802911A (en) * 2016-11-30 2017-06-06 北京锐安科技有限公司 A kind of method that automatic full dose of periodicity extracts database data
CN108121728A (en) * 2016-11-29 2018-06-05 北京京东尚科信息技术有限公司 The method and apparatus that data are extracted from database
CN109784647A (en) * 2018-12-14 2019-05-21 兰州空间技术物理研究所 A kind of method for scheduling task of the active potential control system for space station
CN113688159A (en) * 2021-09-08 2021-11-23 京东科技控股股份有限公司 Data extraction method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080177892A1 (en) * 2007-01-19 2008-07-24 International Business Machines Corporation Method for service oriented data extraction transformation and load
CN104219157A (en) * 2014-08-05 2014-12-17 杭州华三通信技术有限公司 Determining method and determining equipment for counting time intervals
CN104361133A (en) * 2014-12-10 2015-02-18 用友软件股份有限公司 Data extraction device and method
CN104573134A (en) * 2014-12-19 2015-04-29 深圳怡化电脑股份有限公司 Data acquisition method and data acquisition equipment
CN104866370A (en) * 2015-05-06 2015-08-26 华中科技大学 Dynamic time slice dispatching method and system for parallel application under cloud computing environment
CN104915362A (en) * 2014-07-19 2015-09-16 国家电网公司 Fast intelligent ERP system data migration scheme based on time slices, increment analysis and random disorganizing technology

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080177892A1 (en) * 2007-01-19 2008-07-24 International Business Machines Corporation Method for service oriented data extraction transformation and load
CN104915362A (en) * 2014-07-19 2015-09-16 国家电网公司 Fast intelligent ERP system data migration scheme based on time slices, increment analysis and random disorganizing technology
CN104219157A (en) * 2014-08-05 2014-12-17 杭州华三通信技术有限公司 Determining method and determining equipment for counting time intervals
CN104361133A (en) * 2014-12-10 2015-02-18 用友软件股份有限公司 Data extraction device and method
CN104573134A (en) * 2014-12-19 2015-04-29 深圳怡化电脑股份有限公司 Data acquisition method and data acquisition equipment
CN104866370A (en) * 2015-05-06 2015-08-26 华中科技大学 Dynamic time slice dispatching method and system for parallel application under cloud computing environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANTTIJUVONEN,等: "Online anomaly detection using dimensionality reduction techniques for HTTP log analysis", 《COMPUTER NETWORKS》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121728A (en) * 2016-11-29 2018-06-05 北京京东尚科信息技术有限公司 The method and apparatus that data are extracted from database
CN108121728B (en) * 2016-11-29 2021-05-25 北京京东尚科信息技术有限公司 Method and device for extracting data from database
CN106802911A (en) * 2016-11-30 2017-06-06 北京锐安科技有限公司 A kind of method that automatic full dose of periodicity extracts database data
CN109784647A (en) * 2018-12-14 2019-05-21 兰州空间技术物理研究所 A kind of method for scheduling task of the active potential control system for space station
CN113688159A (en) * 2021-09-08 2021-11-23 京东科技控股股份有限公司 Data extraction method and device
CN113688159B (en) * 2021-09-08 2024-04-05 京东科技控股股份有限公司 Data extraction method and device

Similar Documents

Publication Publication Date Title
CN106126612A (en) A kind of big ETL process dynamically divides the data pick-up method of timeslice
CN110018799B (en) Storage pool PG (packet data) master determination method, device, equipment and readable storage medium
El-Rayah The efficiency of balanced and unbalanced production lines
CN100363898C (en) Information processor capable of using past processing space
CN107566910B (en) The customized distribution monitoring method of resource, storage medium, electronic equipment and system is broadcast live
US10127281B2 (en) Dynamic hash table size estimation during database aggregation processing
CN106815260B (en) Index establishing method and equipment
CN107066612A (en) A kind of self-adapting data oblique regulating method operated based on SparkJoin
CN107798354A (en) A kind of picture clustering method, device and storage device based on facial image
CN107844187A (en) Power consumption management method, device and electronic equipment
CN105975345B (en) A kind of video requency frame data dynamic equalization memory management method based on distributed memory
CN108650334A (en) A kind of setting method and device of session failed
US20210149923A1 (en) Systems and methods for intelligently grouping financial product users into cohesive cohorts
CN112866136A (en) Service data processing method and device
CN104572505A (en) System and method for ensuring eventual consistency of mass data caches
US20230128085A1 (en) Data aggregation processing apparatus and method, and storage medium
CN104112010A (en) Data storage method and device
CN107370783B (en) Scheduling method and device for cloud computing cluster resources
CN105554069B (en) A kind of big data processing distributed cache system and its method
CN108776698B (en) Spark-based anti-deflection data fragmentation method
CN109285015B (en) Virtual resource allocation method and system
CN108388471B (en) Management method based on double-threshold constraint virtual machine migration
CN110008215A (en) A kind of big data searching method based on improved KD tree parallel algorithm
CN104050189B (en) The page shares processing method and processing device
CN107436865A (en) A kind of word alignment training method, machine translation method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20200310

AD01 Patent right deemed abandoned