CN104391929A - Transmission method of data stream in ETL - Google Patents

Transmission method of data stream in ETL Download PDF

Info

Publication number
CN104391929A
CN104391929A CN201410671540.2A CN201410671540A CN104391929A CN 104391929 A CN104391929 A CN 104391929A CN 201410671540 A CN201410671540 A CN 201410671540A CN 104391929 A CN104391929 A CN 104391929A
Authority
CN
China
Prior art keywords
link
data
etl
queue
links
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410671540.2A
Other languages
Chinese (zh)
Inventor
潘博存
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur General Software Co Ltd
Original Assignee
Inspur General Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur General Software Co Ltd filed Critical Inspur General Software Co Ltd
Priority to CN201410671540.2A priority Critical patent/CN104391929A/en
Publication of CN104391929A publication Critical patent/CN104391929A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a method for transmitting data stream in ETL, which comprises the following steps: the method comprises the following steps: determining the data outflow mode of each link according to the defined ETL process; step two: determining the number of threads executed by each link; step three: determining the maximum number of input and output queues of each link; step four: determining the number of data acquired or put from the queue at one time by each link in the ETL process; step five: completing initialization of an ETL process and creation and initialization of an ETL link; step six: all links in the ETL process start to be executed in parallel, and data start to be normally circulated; step seven: and stopping the flow execution after the data processing of each execution link of the ETL is finished. According to the invention, the big data is split, the links are used as data buffer areas based on the memory queues, and the links are processed in parallel, so that a large amount of data can be efficiently transferred among the links of the ETL.

Description

The transmission method of data stream in a kind of ETL
Technical field
The present invention relates to Data Integration field, be specifically related to a kind of ETL(Extract – Transform – Load, namely data pick-up, conversion, loading/process) in the transmission method of data stream.
Background technology
Along with the development of science and technology, the informationalized degree of all trades and professions is more and more higher, the data volume of all trades and professions all towards the future development of mass data, in data integration field, in the face of the continuous lifting of mass data and performance requirement.Require also more and more higher to data integration tool, common data integration instrument mainly utilizes the data of database or memory or each step shared, owing to data not being split, each step is all that single-threaded order performs, like this when in the face of mass data, internal memory clearly becomes bottleneck, does not also make full use of the cpu resource of existing server.Cause the waste of system resource, the performance of data integration reduces.
Therefore, for currently available technology Problems existing, be necessary to develop research in fact, to provide a kind of scheme, mass data is made full use of to the transmission carrying out efficient data stream between system resource each step in ETL flow process, efficiently to complete the process of the extraction to mass data, conversion, loading, save system resource, improve the performance of data integration.
Summary of the invention
For solving the problem, the object of the present invention is to provide the transmission method of data stream in a kind of ETL, mass data is made full use of to the transmission carrying out efficient data stream between system resource each step in ETL flow process, efficiently to complete the process of the extraction to mass data, conversion, loading, save system resource, improve the performance of data integration.
For achieving the above object, technical scheme of the present invention is:
In ETL, a transmission method for data stream, comprises the steps:
Step one: according to the ETL flow process of definition, determine the outflow mode of the data of each link;
Step 2: according to the ETL flow process of definition, determine the number of each link execution thread;
Step 3: according to the ETL flow process of definition, determine the maximum quantity of the input and output queue of each link;
Step 4: the number determining the data that links once obtains from queue or puts in ETL flow process;
Step 5: complete the initialization of ETL flow process and the establishment of ETL link and initialization;
Step 6: in ETL flow process, links starts executed in parallel, data start normal circulation;
After each execution link data processing of step 7: ETL, whole flow process is stopped to terminate to perform successively.
Further, in step one, according to the number of adjacent link, and the data mode that flows out that each link needs data count to be processed to arrange each link copies or distributes; When copying, according to the number of the direct follow-up link of this link, by the data Replica many parts of this link, put into the input queue of follow-up link respectively; During distribution, by all output data of this link, circulate according to the input queue of follow-up link, adopt each queue to distribute the mode of, carry out circulation and put into.
Further, in step 3, according to the performance of the process of this link and the speed of next link process data, rationally maximum queue is set.
Further, in step 5, specifically comprise the following steps:
According to overall flow definition, complete the initialization of flow process, form the queue that this flow process is total;
Define according to during ETL flow scheme design, form the entity of links, the configuration of each link of main carrying;
According to flow definition, form the actuator that links is corresponding, for the real execution of each link;
By the queue generated, on the actuator generated to the link of correspondence according to regular allocation.
Further, in step 6, adopt respective memory queue to deposit carrier as the centre of data, do not interfere with each other mutually between links, adopt queue mechanism, the form making mass data pass through to split circulates between each link.
In ETL of the present invention, the transmission method of data stream can make full use of system resource, adopt and large data are split, between link based on memory queue as data buffer, the mode of links parallel processing achieves mass data and circulates efficiently between ETL links.
Term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the term used like this can exchange in the appropriate case, this is only describe in embodiments of the invention the differentiation mode that the object of same alike result adopts when describing.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, to comprise the process of a series of unit, method, system, product or equipment being not necessarily limited to those unit, but can comprise clearly do not list or for intrinsic other unit of these processes, method, product or equipment.
Below be described in detail respectively.
In a kind of ETL of the present invention, the transmission method of data stream, comprises the steps:
Step one: according to the ETL flow process of definition, determine the outflow mode of the data of each link;
In step one, according to the number of adjacent link, and the data mode that flows out that each link needs data count to be processed to arrange each link copies or distributes.When copying, according to the number of the direct follow-up link of this link, by the data Replica many parts of this link, put into the input queue of follow-up link respectively.During distribution, by all output data of this link, circulate according to the input queue of follow-up link, adopt each queue to distribute the mode of, carry out circulation and put into.
Step 2: according to the ETL flow process of definition, determine the number of each link execution thread;
In step 2, according to the computing power of adjacent link in system resource situation and ETL flow process, the number of each link thread is rationally set.
Step 3: according to the ETL flow process of definition, determine the maximum quantity of the input and output queue of each link;
In step 3, according to the performance of the process of this link and the speed of next link process data, rationally maximum queue is set, arranging incorrect meeting causes queue committed memory too high, thus affect the performance of whole system, rationally arrange and can make full use of system resource, thus improve the performance of whole flow processing.
Step 4: the number determining the data that links once obtains from queue or puts in ETL flow process;
In step 4, after needing to wait for that a upper link is all disposed for some, this link just manageable situation is arranged flexibly.As conversion links, some aggregation scene needs to wait for all data in the process that can start conversion links after all arriving, the real processing power according to different data volumes and link, process data.
Step 5: complete the initialization of ETL flow process and the establishment of ETL link and initialization;
In step 5, specifically comprise the following steps:
According to overall flow definition, complete the initialization of flow process, form the queue that this flow process is total;
Define according to during ETL flow scheme design, form the entity of links, the configuration of each link of main carrying;
According to flow definition, form the actuator that links is corresponding, for the real execution of each link;
By the queue generated, on the actuator generated to the link of correspondence according to regular allocation.
Step 6: in ETL flow process, links starts executed in parallel, data start normal circulation;
In step 6, simultaneously described actuator all starts, and be all that multi-threaded parallel performs, and according to configuration, some actuator or many examples multithreading perform, and take full advantage of the feature of the many cpu of active computer, make the Longitudinal Extension ability of system be able to General Promotion;
In step 6, adopt respective memory queue to deposit carrier as the centre of data between links, do not interfere with each other mutually, adopt queue mechanism, make mass data pass through the form split, achieve the efficient circulation between each link, ensure that the ability of each link process big data quantity.
After each execution link data processing of step 7: ETL, whole flow process is stopped to terminate to perform successively.
In step 7, whether whether links be complete based on a upper link and be unanimously empty comprehensive descision in input queue at the appointed time, to determine whether current link terminates to perform;
After each link terminates, terminate the execution thread of each link, after all links are finished, the process of a data processing also just terminates.
In the embodiment of the present invention, when a link has multiple different follow-up link, when the internal memory of system and cpu resource all relatively good time, previous link setting data ways of distribution is for copying, each follow-up link arranges multiple thread and performs simultaneously, ensures that each link can process for the data acquisition Multi-instance in an input queue simultaneously simultaneously like this.When the execution of a link is consuming time more, or perform be on the remote server time, multiple this identical link can be created on stream, this link previous link setting data ways of distribution for distribution, so both ensure that the correctness of data, reuse again system resource.
For different links and system resource, be not that the number of thread is The more the better, the number of the thread of corresponding link can be set in different environments by debugging.Due to adjacent two links, the output queue of previous link is the input queue of a rear link, when previous link process speed quickly time, for by the input queue being filled into this link very fast for a large amount of data, if the speed of next like this link process is slower, this queue will accumulate a large amount of data in memory queue, cause EMS memory occupation larger, so need to arrange corresponding input queue and the max cap. of output queue, like this to control the occupancy of internal memory according to different link processing poweies.
Data dimension difference according to each link process arranges the different data number of each link, if the dimension of process data is combings one by one, can 1 be set to, if pending all link process such as some link need, such as need to gather all data, a larger data volume can be set or arrange and wait for that a upper link processes after being disposed again.
In the embodiment of the present invention, the step of links executed in parallel is as follows:
ETL flow process links starts execution, and the actuator of each link starts to start to monitor to the input queue of correspondence;
After first link in ETL flow process gets data from data source, data are put into the output queue (queue1) of this link, then proceed the acquisition of next stage data, then queue is put into, repeat whole process, until total data obtains complete, terminate this link;
Output queue (queue1) from first link (step1) to this link first time put into data after, the link (step2) monitoring this queue can perceive entering of data immediately, then from this queue by after data acquisition, process in this link (step2), after being disposed, if there is follow-up link, data are put into the output queue (queue2) of this link (step2), then continue to monitor queue, repeat said process.After getting the instruction that a upper link is finished, and the input queue of this link has not had data can in acquisition, and this link terminates.
Follow-up link in ETL flow process, all according to the execution pattern in above-mentioned steps, performs, and after all flow performing, whole flow process terminates, and completes the processing procedure of an ETL.
Data are carried out stream compression by the present invention between ETL links, by mass data is split, temporary storage area using memory queue as links data, make full use of system resource and links is carried out executed in parallel, achieve data stream to circulate efficiently between system links, improved the Longitudinal Extension ability of ETL instrument simultaneously by the method.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add required common hardware by software and realize, and can certainly comprise special IC, dedicated cpu, private memory, special components and parts etc. realize by specialized hardware.Generally, all functions completed by computer program can realize with corresponding hardware easily, and the particular hardware structure being used for realizing same function also can be diversified, such as mimic channel, digital circuit or special circuit etc.But under more susceptible for the purpose of the present invention condition, software program realizes is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in the storage medium that can read, as the floppy disk of computing machine, USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform method described in the present invention each embodiment.
In sum, above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in the various embodiments described above, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.
Accompanying drawing explanation
Fig. 1 is method flow of the present invention diagram.
Fig. 2 is the data stream conveying flow schematic diagram of the inventive method.
Embodiment
Embodiments provide the transmission method of data stream in a kind of ETL.
For making goal of the invention of the present invention, feature, advantage can be more obvious and understandable, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, the embodiments described below are only the present invention's part embodiments, and not all embodiments.Based on the embodiment in the present invention, the every other embodiment that those skilled in the art obtains, all belongs to the scope of protection of the invention.

Claims (5)

1. the transmission method of data stream in ETL, is characterized in that, comprise the steps:
Step one: according to the ETL flow process of definition, determine the outflow mode of the data of each link;
Step 2: according to the ETL flow process of definition, determine the number of each link execution thread;
Step 3: according to the ETL flow process of definition, determine the maximum quantity of the input and output queue of each link;
Step 4: the number determining the data that links once obtains from queue or puts in ETL flow process;
Step 5: complete the initialization of ETL flow process and the establishment of ETL link and initialization;
Step 6: in ETL flow process, links starts executed in parallel, data start normal circulation;
After each execution link data processing of step 7: ETL, whole flow process is stopped to terminate to perform successively.
2. the transmission method of data stream in ETL as claimed in claim 1, it is characterized in that, in step one, according to the number of adjacent link, and the data mode that flows out that each link needs data count to be processed to arrange each link copies or distributes; When copying, according to the number of the direct follow-up link of this link, by the data Replica many parts of this link, put into the input queue of follow-up link respectively; During distribution, by all output data of this link, circulate according to the input queue of follow-up link, adopt each queue to distribute the mode of, carry out circulation and put into.
3. the transmission method of data stream in ETL as claimed in claim 1, is characterized in that, in step 3, according to the performance of the process of this link and the speed of next link process data, rationally arrange maximum queue.
4. the transmission method of data stream in ETL as described in Claims 2 or 3, is characterized in that, in step 5, specifically comprise the following steps:
According to overall flow definition, complete the initialization of flow process, form the queue that this flow process is total;
Define according to during ETL flow scheme design, form the entity of links, the configuration of each link of main carrying;
According to flow definition, form the actuator that links is corresponding, for the real execution of each link;
By the queue generated, on the actuator generated to the link of correspondence according to regular allocation.
5. the transmission method of data stream in ETL as claimed in claim 4, is characterized in that, in step 6, respective memory queue is adopted to deposit carrier as the centre of data between links, do not interfere with each other mutually, adopt queue mechanism, the form making mass data pass through to split circulates between each link.
CN201410671540.2A 2014-11-21 2014-11-21 Transmission method of data stream in ETL Pending CN104391929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410671540.2A CN104391929A (en) 2014-11-21 2014-11-21 Transmission method of data stream in ETL

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410671540.2A CN104391929A (en) 2014-11-21 2014-11-21 Transmission method of data stream in ETL

Publications (1)

Publication Number Publication Date
CN104391929A true CN104391929A (en) 2015-03-04

Family

ID=52609833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410671540.2A Pending CN104391929A (en) 2014-11-21 2014-11-21 Transmission method of data stream in ETL

Country Status (1)

Country Link
CN (1) CN104391929A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469065A (en) * 2016-09-06 2017-03-01 广西科技大学第附属医院 A kind of software secondary development method based on data drain
CN114385136A (en) * 2021-12-29 2022-04-22 武汉达梦数据库股份有限公司 Flow decomposition method and device for running ETL (extract transform load) by Flink framework
US12026005B2 (en) 2022-10-18 2024-07-02 Sap Se Control mechanism of extract transfer and load (ETL) processes to improve memory usage

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071842A1 (en) * 2003-08-04 2005-03-31 Totaletl, Inc. Method and system for managing data using parallel processing in a clustered network
US20080222634A1 (en) * 2007-03-06 2008-09-11 Yahoo! Inc. Parallel processing for etl processes
CN101388844A (en) * 2008-11-07 2009-03-18 东软集团股份有限公司 Data flow processing method and system
CN101882165A (en) * 2010-08-02 2010-11-10 山东中创软件工程股份有限公司 Multithreading data processing method based on ETL (Extract Transform Loading)
CN102722355A (en) * 2012-06-04 2012-10-10 南京中兴软创科技股份有限公司 Workflow mechanism-based concurrent ETL (Extract, Transform and Load) conversion method
GB2505938A (en) * 2012-09-17 2014-03-19 Ibm ETL debugging

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071842A1 (en) * 2003-08-04 2005-03-31 Totaletl, Inc. Method and system for managing data using parallel processing in a clustered network
US20080222634A1 (en) * 2007-03-06 2008-09-11 Yahoo! Inc. Parallel processing for etl processes
CN101388844A (en) * 2008-11-07 2009-03-18 东软集团股份有限公司 Data flow processing method and system
CN101882165A (en) * 2010-08-02 2010-11-10 山东中创软件工程股份有限公司 Multithreading data processing method based on ETL (Extract Transform Loading)
CN102722355A (en) * 2012-06-04 2012-10-10 南京中兴软创科技股份有限公司 Workflow mechanism-based concurrent ETL (Extract, Transform and Load) conversion method
GB2505938A (en) * 2012-09-17 2014-03-19 Ibm ETL debugging

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469065A (en) * 2016-09-06 2017-03-01 广西科技大学第附属医院 A kind of software secondary development method based on data drain
CN114385136A (en) * 2021-12-29 2022-04-22 武汉达梦数据库股份有限公司 Flow decomposition method and device for running ETL (extract transform load) by Flink framework
CN114385136B (en) * 2021-12-29 2022-11-22 武汉达梦数据库股份有限公司 Flow decomposition method and device for running ETL (extract transform load) by Flink framework
US12026005B2 (en) 2022-10-18 2024-07-02 Sap Se Control mechanism of extract transfer and load (ETL) processes to improve memory usage

Similar Documents

Publication Publication Date Title
EP3404587B1 (en) Cnn processing method and device
EP3129870B1 (en) Data parallel processing method and apparatus based on multiple graphic procesing units
US10705878B2 (en) Task allocating method and system capable of improving computational efficiency of a reconfigurable processing system
CN106325967B (en) A kind of hardware-accelerated method, compiler and equipment
WO2016078008A1 (en) Method and apparatus for scheduling data flow task
EP4209902A1 (en) Memory allocation method, related device, and computer readable storage medium
CN103765384A (en) Data processing system and method for task scheduling in a data processing system
DE112010005705T5 (en) Reschedule workload in a hybrid computing environment
CN105808328A (en) Task scheduling method, device and system
CN103885826B (en) Real-time task scheduling implementation method of multi-core embedded system
DE112011101469T5 (en) Compiling software for a hierarchical distributed processing system
CN107656813A (en) The method, apparatus and terminal of a kind of load dispatch
Ahmadinia et al. Task scheduling for heterogeneous reconfigurable computers
CN110300959B (en) Method, system, device, apparatus and medium for dynamic runtime task management
CN106651748B (en) A kind of image processing method and image processing apparatus
CN108021449A (en) One kind association journey implementation method, terminal device and storage medium
US20210026696A1 (en) Scheduling of a plurality of graphic processing units
CN106528065B (en) A kind of thread acquisition methods and equipment
CN104391929A (en) Transmission method of data stream in ETL
DE102022105725A1 (en) METHODS AND EQUIPMENT FOR PERFORMING WEIGHT AND ACTIVATION COMPRESSION AND DECOMPRESSION
US9753769B2 (en) Apparatus and method for sharing function logic between functional units, and reconfigurable processor thereof
DE102015116036A1 (en) Distributed real-time computational structure using in-memory processing
CN106293670B (en) Event processing method and device and server
CN105653347A (en) Server, resource management method and virtual machine manager
CN107195144A (en) Method, device and the computer-readable recording medium of managing payment terminal hardware module

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150304