CN104317751A

CN104317751A - Data stream processing system on GPU (Graphic Processing Unit) and data stream processing method thereof

Info

Publication number: CN104317751A
Application number: CN201410657243.2A
Authority: CN
Inventors: 卢晓伟; 沈铂; 周勇
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2014-11-18
Filing date: 2014-11-18
Publication date: 2015-01-28
Anticipated expiration: 2034-11-18
Also published as: CN104317751B

Abstract

The invention discloses a data stream processing system on a GPU (Graphic Processing Unit) and a data stream processing method thereof, and belongs to the technical field of data stream processing on the GPU. According to the data stream processing system on the GPU, the data stream of a data source is sent to a client through the data stream processing system; the data stream processing system comprises a CPU host and a GPU device; the CPU host includes a CPU side load engine module, a CPU side buffer module, a data stream pre-processing module, a data stream load shedding module and a visual module; the GPU device includes a GPU side load engine module, a GPU side buffer module, a data stream synopsis extraction module, a data stream processing model library and a data stream processing module; the load or storage unit of the CPU side load engine module interacts with the data source, the load or storage unit of the GPU side load engine module and the client through the Internet. The data stream processing system has a significant speed advantage, can well meet the real-time requirement of the high dimensional data stream, and can serve as a common analysis method to be widely applied to the field of excavating of the high dimensional data streams.

Description

Data flow processing system and data flow processing method thereof on a kind of GPU

Technical field

The present invention relates to the technical field of Data Stream Processing on a kind of GPU, specifically data flow processing system and data flow processing method thereof on a kind of GPU.

Background technology

GPU (Graphic Processing Unit), translator of Chinese is " graphic process unit ".GPU is video card " heart ", is also just equivalent to the effect of CPU in computer.GPU has quite high memory bandwidth, and a large amount of performance elements, and it can help CPU to carry out the evaluation work of some complexity, makes video card decrease dependence to CPU.

Traditionally, the application of GPU is limited to processing graphics and plays up calculation task, the significant wastage beyond doubt to computational resource.Along with improving constantly of GPU programmability, GPU is utilized to complete the research of general-purpose computations gradually active.GPU is used for the calculating in field beyond graph rendering and becomes GPGPU(General-purpose computing on graphics processing units, the general-purpose computations based on GPU).GPGPU calculates and usually adopts CPU+GPU heterogeneous schemas, is responsible for performing the calculating that complex logic process and transaction management etc. are not suitable for data parallel, is responsible for the large-scale data parallel computation of computation-intensive by GPU by CPU.The powerful processing power of this GPU of utilization and high bandwidth make up the account form of cpu performance deficiency excavating the potential performance of computing machine, in cost and cost performance, have significant advantage.But traditional GPGPU is subject to the restriction of hardware programmable and development scheme, and application is restricted, and development difficulty is also very large.

2007, the CUDA(Compute Unified Device Architecture released by NVIDIA, unified calculation equipment framework), this DLL (dynamic link library) compensate for the deficiency of traditional GPGPU.Utilize CUDA DLL (dynamic link library), directly can call GPU resource by C language, and without the need to being mapped to figure API, the non-graphic programming for GPU is universal eliminates obstacle.

CUDA model using CPU as main frame (Host), GPU as coprocessor (co-processor) or equipment (device), both collaborative works.CPU is responsible for carrying out the strong transaction of logicality and serial computing, and GPU is then absorbed in the highly threading parallel processing task of execution.CPU, GPU have separate memory address space separately: host side internal memory and equipment end video memory.Once determine the parallel computation function (kernel) in program, just consider this part calculating to give GPU.

In fact (definition of data stream) data stream is exactly the element troop of continuous moving, and element is wherein made up of the set of related data.Make t represent arbitrary timestamp, at represents the data arrived at this timestamp, flow data can be expressed as ..., at1, at, at+1 .... be different from traditional application model, stream data model has following 4 general character: (1) data arrive in real time; (2) data arrive order independently, not controlled by application system; (3) data scale is grand and can not predict its maximal value; (4) data one are treated, unless specially preserved, otherwise again can not be taken out process, or extraction data cost dearly again.

Meanwhile, stream occurs using dual identity: (1) exists as a visible program variable of software.(2) exist as a visible management unit of hardware.Flow to toward having a lot of attribute in practical application, when stream is mapped in hardware, these attributes are still kept or become a form by hardware finding.

In the data mining of prior art, in order to eliminate the misdata such as noise, null value and exceptional value in data, to ensure the accuracy of result, usually pretreatment operation can be carried out before the static data collection in mining data storehouse; Certainly, also keep away unavoidable various misdata in data stream, in order to improve the degree of accuracy of Result, it is also very necessary for carrying out pre-service to it.But data Mining stream generally carries out all online, cannot before excavation preprocessed data.

How does GPU parallel computation apply in data Mining stream field? under computational resource constrained environment, how to ensure real-time and the versatility of Data Stream Processing.

Summary of the invention

Technical assignment of the present invention is to provide one and has significant speed advantage, meet the real-time demand of High Dimensional Data Streams well, data flow processing system and data flow processing method thereof on a kind of GPU that can be widely used in High Dimensional Data Streams excavation applications as general analytical approach.

Technical assignment of the present invention realizes in the following manner,

What data source exported is the time series data stream of higher-dimension, and after data flow processing system process, then what export to client is frequent mode or the Query Result of data stream.

Data flow processing system on a kind of GPU, the data stream of data source (data sources) is by data flow processing system to client (client), and data flow processing system comprises CPU main frame (CPU-Host) and GPU equipment (GPU-Device);

CPU main frame comprises CPU end and loads engine modules (CPU-Side Load Engine Area), CPU holds buffer module (CPU-Side Buffer Area), data stream pretreatment module (Data Stream Preprocessing Area), data stream offload module (Data Stream Load Shedding Area) and visualization model (Visual Area), CPU end loads engine modules and is provided with loading or storage unit (Load/Store Unit), CPU holds buffer module to be provided with internal memory (Main Memory, MM), CPU end loads loading or the storage unit of engine modules, data stream pretreatment module, data stream offload module and visualization model all hold the Memory linkage of buffer module mutual with CPU, loading or the storage unit of CPU end loading engine modules are connected with visualization model alternately,

GPU equipment comprises GPU end and loads engine modules (GPU-Side Load Engine Area), GPU holds buffer module (GPU-Side Buffer Area), data stream summary abstraction module (Data Stream Synopsis Extraction Area), data Stream Processing Model storehouse (Data Stream Processing Model Library) and data flow processing module (Data Stream Processing Area), GPU end loads engine modules and is provided with loading or storage unit (Load/Store Unit), GPU holds buffer module to be provided with video memory (Device Memory, DM), data stream summary abstraction module is used for integrated summary abstracting method and calls for data flow processing module, data Stream Processing Model storehouse is used for integrated data stream Processing Algorithm and calls for data flow processing module, GPU end loads loading or the storage unit of engine modules, data flow processing module all holds the video memory of buffer module to be connected alternately with GPU, data stream summary abstraction module, data Stream Processing Model storehouse all holds the loading that loads engine modules or storage unit to be connected with GPU, GPU holds and opened up storage space in the video memory of buffer module is moving window,

CPU end loads the loading of engine modules or storage unit and holds the mutual of the loading or storage unit that load engine modules and client by internet (Interconnection Network) and data source, GPU.

CPU end buffer module is also provided with the memory manager for managing internal memory, is provided with Input Monitor Connector device in memory manager, and Input Monitor Connector device is for monitoring the untreated data stream of interim storage in internal memory; CPU end loads engine modules and comprises speed regulator (Speed Regulator), loading or storage unit (Load/Store Unit) and initialization integrator (Initialization Integrator); Speed regulator is used for flowing into CPU end according to the data stream of the buffer status of internal memory adjustment data source and loads flow velocity in the loading of engine modules or storage unit, is provided with feedback mechanism (Feedback Mechanism) in speed regulator; Initialization integrator is used for the initialization operation of integrated CPU main frame and GPU equipment.

Data flow processing method on a kind of GPU, the data stream that data source exports is by transferring to client by data result after data flow processing system process; The treatment scheme of data stream is as follows:

(1), load data stream: the data stream in data source flows into loading or the storage unit that CPU end loads engine modules, hold the loading or storage unit loading engine modules that data stream is stored into CPU by CPU and hold in the internal memory of buffer module;

(2), data stream pre-service: the original data stream in internal memory is carried out pre-service by data stream pretreatment module, and pretreated data stream stored in internal memory;

(3), transmitting data stream: pretreated data stream is held the loading or storage unit that load engine modules to CPU by internal memory, hold the loading or storage unit that load engine modules to internet by CPU, arrive loading or storage unit that GPU end loads engine modules through internet, then hold the loading of loading engine modules or storage unit to be loaded in the moving window of video memory by GPU;

(4), data stream summary extracts: by the summary abstracting method in data flow processing module calling data stream summary abstraction module, carry out summary extraction, and be stored in video memory by the summary data structure finally formed to the data stream in moving window;

(5), Data Stream Processing: by the Data Stream Processing algorithm in data flow processing module calling data stream transaction module storehouse, summary data is processed, and the data result of process is stored in video memory;

(6), data result is transmitted: data result holds the video memory of buffer module to hold the loading or storage unit that load engine modules to GPU by GPU, the loading of loading engine modules or storage unit is held to be sent to internet by GPU, loading or storage unit that CPU end loads engine modules is arrived through internet, hold the loading of loading engine modules or storage unit that data result is loaded into internal memory by CPU again, or hold the loading of loading engine modules or storage unit that data result is sent to visualization model by CPU;

(7), result visualization: visualization model sends to CPU to hold loading or the storage unit of loading engine modules after being standardized by data result, hold the loading of loading engine modules or storage unit that data result is showed client by CPU.

Data flow processing method on a kind of GPU, in step (2), original data stream in internal memory uses preprocess method to carry out pre-service by data stream pretreatment module, preprocess method comprises Data Cleaning Method, data integrating method, data conversion method and hough transformation method, the preprocess method used can be a kind of above-mentioned method, or selects multiple methods combining to use.

(i), Data Cleaning Method: fill up missing values, smooth noise data, remove exceptional value and solve data inconsistence problems; Data cleansing is very important process in data prediction, but is also most time-consuming process; Missing values, noise and inconsistency all will cause data inaccurate, and data cleansing can avoid this situation effectively;

(ii), data integrating method: the detection and treatment that Mode integrating and object matching, removal redundant data, data value conflict; Data integration refers to and the script data be stored in multiple data source is integrated, and forms a data source, and carries out centralized stores with certain unified form, to facilitate follow-up data processing work;

(iii), data conversion method: data smoothing, data gathering, data generaliza-tion, data normalization and attribute construction; Data transformation data is converted to a kind of form being suitable for data mining; Dimension such as between data item may be inconsistent, at this moment just needs the dimension of cutting down high dimensional data item, to reduce the difference between them, and convenient process;

(iv), hough transformation method: data cube gathering, attribute set selection, dimension reduction, numerical value reduction, discretize and Concept Hierarchies; Hough transformation is also called data degradation technology.

Data flow processing method on a kind of GPU, after step (2), when data stream overload, carry out Reduction of Students' Study Load process through data stream offload module to data stream, concrete steps are:

I, CPU holds the Input Monitor Connector device of the memory manager of buffer module to monitor the untreated data stream of interim storage in internal memory, and determines in a chronomere, and whether the volume of data stream of newly arriving has exceeded the processing power of the data flow processing module of GPU equipment;

If the data flow processing module of II GPU equipment can process all data stream, then carry out transmitting data stream; If newly arrive, the volume of data stream has exceeded the processing power of the data flow processing module of GPU equipment, i.e. data stream overload, then data stream is proceeded to data stream offload module;

III, data stream offload module carries out Reduction of Students' Study Load process to data stream;

IV, remaining traffic proceeds in internal memory by data stream offload module, carries out the transmitting data stream of next step.

Data flow processing method on a kind of GPU, data stream offload module carries out to data stream one or more combinations that Reduction of Students' Study Load process adopts following strategy:

I, abandoning based on data: receive with the data in untreated time series data stream, find the data that length is the longest, and they abandoned;

II, finishing based on attribute: in data stream, each data have individual attribute, in the data, will have the attribute removing of minimum frequency, repair with this to the attribute of data;

III, abandoning based on priority: each data stream newly arrived is assigned with a priority, receive with the data in untreated data stream, select those data having lowest priority and abandon them.Three kinds of each have their own objects of strategy; Based on the drop policy of data, abandoned the long data that those systems need the expensive time to go to process, attempt minimizing system load fast as far as possible, it is towards efficiency; Based on the strategy of the finishing of attribute, from data, delete those attributes that is inapparent on result impact, that least take place frequently; Based on the strategy abandoned of priority, from data, delete the data that those have lowest priority; Comparatively speaking, after these two strategies are Precision-orienteds because they attempt to retain high-precision Result.

Data flow processing method on a kind of GPU, video memory uses global storage (Global Memory) to go to store various data (as summary data, intermediate data, result data etc.), moving window (the Sliding Window opened up in video memory, SW) for the data stream arriving GPU equipment from CPU main frame is saved (due to the finiteness of data stream unlimitedness and storage space, in order to reduce the data copy between internal memory and video memory, obtain more effective Result, so mark off a part of space as moving window temporal data in video memory), moving window uses the moving window based on the definition of tuple number, namely be the moving window that window size is fixed, for preserving K the data stream arrived recently,

The disposal route of the data stream of moving window uses can the moving window method of rewrite cycle, is directly covered by new data and wants expired data, and provide general layout transforming function transformation function to safeguard the logic general layout state of data in moving window during renewal.

Data flow processing method on a kind of GPU, the data stream summary of step (4) extracts, data stream is obtained summary data structure by summary abstracting method, summary abstracting method comprises sampling (Sampling) method, small echo (Wavelet) method, sketch map (Sketch) method and histogram (Histogram) method.Data stream is compressed, construct the principal character that a data structure more much smaller than the data scale of whole data stream preserves data stream, be referred to as summary data structure, the approximate value obtained by summary data structure is within user's tolerance interval.

Data flow processing method on a kind of GPU, data Stream Processing Model storehouse is integrated with various Data Stream Processing algorithms during Data Stream Processing, comprises Query Processing Algorithm (query processing algorithma), clustering algorithm (clustering algorithma), sorting algorithm (classification algorithma), correlation analysis algorithm (correlation analysis algorithma) between Frequent Itemsets Mining Algorithm (frequent itemsets mining algorithma) and many data stream.

Data flow processing method on a kind of GPU, the task of data flow processing module is that the summary abstracting method of calling data stream summary abstraction module carries out summary extraction to data stream, and the Data Stream Processing algorithm in calling data stream transaction module storehouse carries out parallel computation to summary data; Data flow processing module comprises data stream input assembler (Data Stream Input Assembler), overall thread block scheduler (Global Block Scheduler) and computing array (Compute Array), is provided with shared storage, loading or storage unit in computing array; Data stream input assembler is responsible for the data in video memory to be read in the shared storage of data flow processing module, overall situation thread block scheduler is responsible for carrying out dispatching distribution management to the thread block in shared storage, thread and instruction, and computing array is used for the calculating of thread; Loading in computing array or storage unit, when calculating, load data into shared storage from video memory.

On a kind of GPU of the present invention, data flow processing system and data flow processing method thereof have the following advantages:

1, versatility: the data flow processing system that use GPU in the past accelerates only is confined to a certain task during Data Stream Processing, or is cluster, or be classify or other; But, data flow processing system of the present invention, be suitable for many higher-dimension time series data streams of each application, it encompasses the pre-service of data stream, Reduction of Students' Study Load, summary extraction and excavate the multiple functions such as process, the data Stream Processing Model storehouse that GPU environment division comprises is integrated with various Data Stream Processing algorithm, as Query Processing Algorithm, clustering algorithm, sorting algorithm, Frequent Itemsets Mining Algorithm, correlation analysis algorithm, the multi-task during Data Stream Processing can be completed, thus for present invention gives versatility;

2, high efficiency: the parallel section of summary abstracting method of the present invention and all Data Stream Processing algorithms all uses GPU to accelerate, and takes full advantage of the powerful processing power of GPU and pipeline characteristics, further increases execution efficiency;

3, the control of extra I/O expense: in data flow processing system, moving window is opened up in video memory, just avoids the frequent copy of data between internal memory and video memory when carrying out the extraction of data stream summary like this; In addition, when carrying out summary extraction and Data Stream Processing, employing video memory, greatly reducing the number of times read and write data from internal memory;

4, because initial data traffic is excessive, if directly carry out data stream pre-service by data stream to GPU equipment, greatly can increase I/O expense, so the present invention designs data stream preprocessing process in the data stream pretreatment module of CPU main frame, namely carry out data stream pre-service and eliminated the misdata such as noise, null value and exceptional value in data, additionally reduce I/O expense;

5, in prior art, at moving window less than the stage, new data directly fills window, and in the window completely stage, along with the slip of window, new data enters the data reach will causing other in window in window, covers data above; But, what the present invention adopted can the moving window method of rewrite cycle, at moving window, completely the stage does not need Mobile data, it new data is directly covered (rewriting) to want expired data, and provide general layout transforming function transformation function to safeguard the logic general layout state of data in moving window, save the plenty of time.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, the present invention is further described.

Accompanying drawing 1 is the frame diagram of data flow processing system on a kind of GPU.

Embodiment

With reference to Figure of description and specific embodiment, data flow processing system and data flow processing method thereof on a kind of GPU of the present invention are described in detail below.

Embodiment 1:

Data flow processing system on a kind of GPU of the present invention, the data stream of data source (data sources) is by data flow processing system to client (client), and data flow processing system comprises CPU main frame (CPU-Host) and GPU equipment (GPU-Device);

Embodiment 2:

Data flow processing method on a kind of GPU of the present invention, the data stream that data source exports is by transferring to client by data result after data flow processing system process; The treatment scheme of data stream is as follows:

(1), load data stream: the data stream in data source flows into loading or the storage unit (Data flow direction as shown in ① in Figure 1) that CPU end loads engine modules, hold the loading or storage unit loading engine modules that data stream is stored into CPU by CPU and hold in the internal memory of buffer module (Data flow direction as shown in ② in Figure 1);

(2), data stream pre-service: the original data stream in internal memory is carried out pre-service (Data flow direction as shown in ③ in Figure 1) by data stream pretreatment module, and pretreated data stream stored in internal memory (Data flow direction as shown in ④ in Figure 1);

(3), transmitting data stream: pretreated data stream is held the loading or storage unit (Data flow direction is as 7. shown in Fig. 1) that load engine modules to CPU by internal memory, hold the loading or storage unit that load engine modules to internet (Data flow direction is as 8. shown in Fig. 1) by CPU, hold the loading or storage unit (Data flow direction is as 9. shown in Fig. 1) that load engine modules through internet arrival GPU, then hold the loading of loading engine modules or storage unit to be loaded into (Data flow direction is as 10. shown in Fig. 1) in the moving window of video memory by GPU;

(4), data stream summary extracts: by the summary abstracting method in data flow processing module calling data stream summary abstraction module, summary extraction (Data flow direction as shown in fig. 1) is carried out to the data stream in moving window, and the summary data structure finally formed is stored into (Data flow direction as shown in fig. 1) in video memory;

(5), Data Stream Processing: by the Data Stream Processing algorithm in data flow processing module calling data stream transaction module storehouse, summary data is processed (Data flow direction as shown in fig. 1), and the data result of process is stored into (Data flow direction as shown in fig. 1) in video memory;

(6), transmission data result: data result holds the video memory of buffer module to hold the loading or storage unit (Data flow direction as shown in fig. 1) that load engine modules to GPU by GPU, the loading of loading engine modules or storage unit is held to be sent to internet (Data flow direction as shown in fig. 1) by GPU, loading or storage unit (Data flow direction as shown in fig. 1) that CPU end loads engine modules is arrived through internet, hold the loading of loading engine modules or storage unit that data result is loaded into internal memory (Data flow direction as shown in fig. 1) by CPU again, or hold the loading of loading engine modules or storage unit that data result is sent to visualization model (Data flow direction as shown in fig. 1) by CPU,

(7), result visualization: visualization model sends to CPU to hold loading or the storage unit (Data flow direction as shown in fig. 1) of loading engine modules after being standardized by data result, (Data flow direction is as in Fig. 1 to hold the loading of loading engine modules or storage unit that data result is showed client by CPU shown in).

Embodiment 3:

In step (2), original data stream in internal memory uses preprocess method to carry out pre-service by data stream pretreatment module, preprocess method comprises Data Cleaning Method, data integrating method, data conversion method and hough transformation method, the preprocess method used can be a kind of above-mentioned method, or selects multiple methods combining to use.

After step (2), when data stream overload, carry out Reduction of Students' Study Load process through data stream offload module to data stream, concrete steps are:

If the data flow processing module of II GPU equipment can process all data stream, then carry out transmitting data stream; If newly arrive, the volume of data stream has exceeded the processing power of the data flow processing module of GPU equipment, i.e. data stream overload, then data stream is proceeded to data stream offload module (Data flow direction as shown in ⑤ in Figure 1);

IV, remaining traffic is proceeded to (Data flow direction as shown in ⑥ in Figure 1) in internal memory by data stream offload module, carries out the transmitting data stream of next step.

Data stream offload module carries out to data stream one or more combinations that Reduction of Students' Study Load process adopts following strategy:

Video memory uses global storage (Global Memory) to go to store various data (as summary data, intermediate data, result data etc.), moving window (the Sliding Window opened up in video memory, SW) for the data stream arriving GPU equipment from CPU main frame is saved (due to the finiteness of data stream unlimitedness and storage space, in order to reduce the data copy between internal memory and video memory, obtain more effective Result, so mark off a part of space as moving window temporal data in video memory), moving window uses the moving window based on the definition of tuple number, namely be the moving window that window size is fixed, for preserving K the data stream arrived recently,

The data stream summary of step (4) extracts, data stream is obtained summary data structure by summary abstracting method, summary abstracting method comprises sampling (Sampling) method, small echo (Wavelet) method, sketch map (Sketch) method and histogram (Histogram) method.Data stream is compressed, construct the principal character that a data structure more much smaller than the data scale of whole data stream preserves data stream, be referred to as summary data structure, the approximate value obtained by summary data structure is within user's tolerance interval.

Data Stream Processing Model storehouse is integrated with various Data Stream Processing algorithms during Data Stream Processing, comprises Query Processing Algorithm (query processing algorithma), clustering algorithm (clustering algorithma), sorting algorithm (classification algorithma), correlation analysis algorithm (correlation analysis algorithma) between Frequent Itemsets Mining Algorithm (frequent itemsets mining algorithma) and many data stream.

The task of data flow processing module is that the summary abstracting method of calling data stream summary abstraction module carries out summary extraction to data stream, and the Data Stream Processing algorithm in calling data stream transaction module storehouse carries out parallel computation to summary data; Data flow processing module comprises data stream input assembler (Data Stream Input Assembler), overall thread block scheduler (Global Block Scheduler) and computing array (Compute Array), is provided with shared storage, loading or storage unit in computing array; Data stream input assembler is responsible for the data in video memory to be read in the shared storage of data flow processing module, overall situation thread block scheduler is responsible for carrying out dispatching distribution management to the thread block in shared storage, thread and instruction, and computing array is used for the calculating of thread; Loading in computing array or storage unit, when calculating, load data into shared storage from video memory.

In data flow processing system, flow to row relax to data, wherein the treatment step of CPU host side is as follows:

1. start the abbreviation of CUDA(data flow processing system);

2. for input data distribute MM(internal memory);

3. CPU end loading engine modules obtains input data from data source and carries out initialization;

4. data stream pretreatment module carries out data cleansing, integrated etc. pre-service to input traffic;

5., when system overload, data stream offload module carries out Reduction of Students' Study Load process to data stream;

6. for GPU equipment distributes the moving window of video memory, for depositing input data;

7. initialization integrator is to summary abstracting method and Data Stream Processing algorithm initialization;

8. by the data copy in internal memory in the moving window of video memory;

9. for GPU equipment distributes video memory, for the summary data that store data stream processing module extracts;

10. the parallel computation function (kernel) calling the Data Stream Processing algorithm of GPU equipment end carries out parallel computation, obtains summary data, and is write the corresponding region in video memory;

11. is GPU equipment distribution video memory, for depositing the output data sent back;

12. by the result retaking of a year or grade in video memory in internal memory;

13. use visualization model to carry out subsequent treatment to data, as standardization, visual etc.;

14. releasing memories and video memory space;

15. exit CUDA.

The treatment step of GPU equipment end is as follows:

1. distribute shared storage (Shared Memory);

2. the data in the global storage of video memory are read in shared storage;

3. calculate, result is write shared storage;

4. the result in shared storage is write global storage.

Below introduce the code of each module in data flow processing system:

One, each function code of initialization integrator:

First, start and exit data flow processing system (being called for short CUDA) environment:

CUT_DEVICE_INIT (argc, argv); // start CUDA

CUT_EXIT (argc, argv); // exit CUDA;

Then, storage allocation and video memory:

At CPU host assignment internal memory, h_ represents CPU host side, and i represents input, and o represents output, and mem_size is expressed as the memory size that data are distributed,

float*?h_idata?=?(float*)?malloc(mem_size);

float*?h_odata?=?(float*)?malloc(mem_size)；

Distribute video memory at GPU equipment, d_ represents GPU equipment end,

float*?d_idata; CUDA_SAFE_CALL(cudaMalloc((void**)?&d_idata,?mem_size));

float*?d_odata; CUDA_SAFE_CALL(cudaMalloc((void**)?&d_odata,?mem_size))；

Then, task division, i.e. the dimension design of two-dimentional grid and block:

Dim3 grid (gridDim.x, gridDim.y, 1); // third dimension perseverance is 1;

Dim3 block (blockDim.x, blockDim.y, 1); // the third dimension can not be 1, because block is two dimension, so put 1;

TestKernel<<<grid, block>>> (d_idata, d_odata); // call kernel function, carry out parallel computation;

Then, the data copy between internal memory and video memory:

Value in internal memory is read in video memory,

CUDA_SAFE_CALL(cudaMemcpy(d_idata,?h_idata,?mem_size,?cudaMemcpyHostToDevice));

By result from video memory write memory,

CUDA_SAFE_CALL(cudaMemcpy(h_odata,?d_odata,?mem_size,?cudaMemcpyDeviceToHost));

Code is below used for the storage space of releasing memory, video memory.

Free (h_idata); // releasing memory;

free(h_odata);

CUDA_SAFE_CALL (cudaFree (d_idata)); // release video memory;

CUDA_SAFE_CALL(cudaFree(d_odata));

Except these functions above-mentioned, initialization integrator also has selects Data Stream Processing algorithm (certain algorithm specific in certain class algorithm), select summary abstracting method, the function of the initialization (such as Haar Wavelet obtains the Decomposition order performed when decomposing completely, and k-means is initialization cluster centre point) of various Data Stream Processing algorithm, summary abstracting method.

Two, the formalization representation of moving window is: CircularSW=<w, num, front, fun>;

Wherein, w is the width of moving window (SW); Num is the data volume of data stream in current sliding window mouth; Front is the mark of end data in moving window, and the data placement newly arrived is in this position; Fun is the general layout transforming function transformation function of moving window, which determines the general layout change that there are data when new data arrives in moving window.Fun is defined as follows shown in table 1:

Table 1 can the general layout transforming function transformation function of rewrite cycle moving window

The novel part of rewrite cycle moving window can just be that it has directly calculated the position of stale data (being about to be moved out of data), newly arrived data are then placed into this position, data original in direct covering window, also need amendment front value in addition, make it point to the end of data in window, namely front points to new data forever.Compared with moving window in the past, can promote the efficiency of data flow processing system by rewrite cycle moving window, it uses same storage space, turn avoid the movement of data in window, and allows more fine-grained con current control.

Three, in order to improve treatment effeciency, summary abstracting method all needs GPU to accelerate, the Hash mapping etc. in the wavelet decomposition in such as wavelet method, sketch map method.Describe in detail for the Haar wavelet transform (Haar Wavelet) in wavelet method below.Wavelet method is a kind of important data compression method, by carrying out wavelet transformation to raw data set, preserving the wavelet coefficient that part is important, can restore raw data set approx.Haar Wavelet is the simplest a kind of in small echo, and the wavelet transformation of two dimension or three-dimensional (2D or 3D) all can be analyzed to the wavelet transformation of 2 or 3 one dimensions (1D).It is be an individual wavelet coefficient by vector transformation that one dimension Haar Wavelet decomposes.Such as: establish, the Haar Wavelet that table 2 demonstrates this sequence converts.

Resolution l	Averages	Detail Coefficients
			l=3	(6,2,5,3,4,4,4,8)	-
l=2	(4,4,4,6)	(2,1,0,-2)
			l=1	(4,5)	(0,-1)
l=0	(4.5)	(-0.5)

The Haar Wavelet of table 2 sequence converts

Specifically being calculated as follows: middle-level to Resolution row, is original series in Averages row.Original sequence data being divided between two to asking its average, obtaining the Averages in level, namely

；

Obviously, in the process asking Averages, we have lost some information in original series, depend merely on mean value and cannot reconstruct original series, mean value is approximate value, so in order to reconstruct raw data, also needs to preserve detail coefficients, the difference being about to mean value in often pair of data and the 2nd data is kept at during Detail Coefficients arranges, namely;

Go on successively, until level.The wavelet coefficient of this sequence is made up of the mean value of the 0th layer and full details coefficient, and wavelet coefficient is namely.

For one dimension haar wavelet transform, the core code providing GPU equipment is as follows:

/*--------------------------------------

* one dimension haar wavelet transform

*--------------------------------------*/

////////////////////////////////////////////////////

//! @param id inputs data

//! @param od exports data

//! @param approx_final stores final approximation coefficient, i.e. mean value

//! The Decomposition order that@param dlevels converts

//! The half signal length of the@param slength_step_half current decomposition layer side-play amount of detail coefficients (in the global storage)

//! The dimension of@param bdim thread block

///////////////////////////////////////////////////

__global__?void

dwtHaar1D(?float*?id,?float*?od,?float*?approx_final,

const?unsigned?int?dlevels,

const?unsigned?int?slength_step_half,

const?int?bdim?)

{ // represent dynamic assignment shared storage space with extern statement, the Ns parameter that array size is held by host is determined

extern?__shared__?float?shared[];

// thread running environment, One Dimensional Parameter

const?int?gdim?=?gridDim.x;

const?int?bdim?=?blockDim.x;

const?int?bid?=?blockIdx.x;

const?int?tid?=?threadIdx.x;

The thread id of // overall situation

const?int?tid_global?=?(bid?*?bdim)?+?tid;

unsigned?int?idata?=?(bid?*?(2?*?bdim))?+?tid;

// from global memory, read in data

shared[tid]?=?id[idata];

shared[tid?+?bdim]?=?id[idata?+?bdim];

__syncthreads();

// from shared drive sense data

float?data0?=?shared[2*tid];

float?data1?=?shared[(2*tid)?+?1];

__syncthreads();

// Detail Coefficients, does not quote further so be directly stored in global memory

od[tid_global?+?slength_step_half]?=?(data0?-?data1)?*?INV_SQRT_2;

The side-play amount (in order to avoid producing bank conflicts) of // thread block

unsigned?int?atid?=?tid?+?(tid?>>?LOG_NUM_BANKS);

// Averages, is stored in decompose further in shared memory

shared[atid]?=?(data0?+?data1)?*?INV_SQRT_2;

__syncthreads();

if(?dlevels?>?1)

{ unsigned?int?offset_neighbor?=?1;

unsigned?int?num_threads?=?bdim?>>?1;

unsigned?int?idata0?=?tid?*?2;

for(?unsigned?int?i?=?1;?i?<?dlevels;?++i)

{

if(?tid?<?num_threads)

{ // upgrade step-length, often decomposes one deck, and step-length just increases 2 times

unsigned?int?idata1?=?idata0?+?offset_neighbor;

Position in // write global memory

unsigned?int?g_wpos?=?(num_threads?*?gdim)?+?(bid?*?num_threads)?+?tid;

//-----wavelet decomposition step is as follows-----

// offset in order to avoid bank conflicts

unsigned?int?c_idata0?=?idata0?+?(idata0?>>?LOG_NUM_BANKS);

unsigned?int?c_idata1?=?idata1?+?(idata1?>>?LOG_NUM_BANKS);

// Detail Coefficients, does not revise further and is directly stored in global memory

od[g_wpos]?=?(shared[c_idata0]?-?shared[c_idata1])?*?INV_SQRT_2;

// Averages, please notes that the expression in shared memory becomes quite sparse

shared[c_idata0]?=?(shared[c_idata0]?+?shared[c_idata1])?*?INV_SQRT_2;

// upgrade storage skew

Num_threads=num_threads >> 1; // divided by 2

Offset_neighbor <<=1; // be multiplied by 2

Idata0=idata0 << 1; // be multiplied by 2

}

_ _ syncthreads (); // carry out synchronous after each decomposition step

}

If (0==tid) // for will write the element of the superiors in the next decomposition step of host side execution

{ approx_final[bid]?=?shared[0]; }

}

}。

The k-means algorithm of the clustering algorithm four, in Data Stream Processing algorithm:

K-means algorithm is the representational clustering algorithm of most, and its fundamental purpose is the division of gathering by the shortest rule of distance the sample data with same data type, each equivalence class of final acquisition.Use the similarity degree between distance table registration certificate herein.But due to the singularity of data stream, traditional clustering algorithm is difficult to realize on the data streams, so we have carried out summary extraction for Haar wavelet transform in data Stream Processing Model storehouse.Based on this small echo summary, the approximate distance between data stream and cluster centre can be calculated fast.Like this, k-means algorithm realization gets up just easier.

In cluster, Euclidean distance (i.e. Euclidean distance) has meaning very intuitively.Herein, the Euclidean distance between data item (data point) and data item is calculated as follows:

Distance definition between a data item and a data set is that this data item and all data item of this data centralization work as middle distance minimum value.Euclidean distance between data item and data acquisition is calculated as follows:

Corresponding to this problem is the task (namely solving a large amount of " distance ") of computation-intensive, therefore algorithmically too large optimization space.And make an issue of in " enhancing concurrency " significant acceleration effect can be brought undoubtedly, because the calculating between different distance is not Existence dependency completely.Can say, this is the task occasion that a GPU has absolute predominance.

In High Dimensional Data Streams, the dimension of each data item is usually very high, so we store data with self-defining matrix when realizing.During cluster, need the operation that quadratic sum of the transposition of the frequent compute matrix of GPU, data item spacing etc. is very consuming time.The core code of equipment end k-means algorithm is as follows.Wherein, the corresponding distance of each thread.

/*--------------------------------------

* defined matrix

*--------------------------------------*/

/ *-----the upper structure of CPU-----*/

typedef?struct?{

int?width;//column

int?height;//row

//int stride; // not interval access time stride=width

Float* elements; The head pointer of // matrix element

}?CPU_Matrix;

/ *-----the upper structure of GPU-----*/

typedef?struct?{

int?width;//column

int?height;//row

Float* elements; The head pointer of // matrix element

}?GPU_Matrix;

/*--------------------------------------

* matrix transpose

*--------------------------------------*/

__global__?void?transpose(float?*odata,?float?*idata,?int?width,?int?height)?{

// static allocation shared memory

__shared__?float?block[BLOCK_SIZE][BLOCK_SIZE+1];

// matrix-block is read in shared storage

unsigned?int?xIndex?=?blockIdx.x?*?BLOCK_SIZE?+?threadIdx.x;

unsigned?int?yIndex?=?blockIdx.y?*?BLOCK_SIZE?+?threadIdx.y;

if((xIndex<width)&&(yIndex<height))?{

unsigned?int?index_in?=?yIndex?*?width?+?xIndex;

block[threadIdx.y][threadIdx.x]?=?idata[index_in];

}

__syncthreads();

// matrix after transposition is write back global storage

xIndex?=?blockIdx.y?*?BLOCK_SIZE?+?threadIdx.x;

yIndex?=?blockIdx.x?*?BLOCK_SIZE?+?threadIdx.y;

if((xIndex<height)&&(yIndex<width)){

unsigned?int?index_out?=?yIndex?*?height?+?xIndex;

odata[index_out]?=?block[threadIdx.x][threadIdx.y];

}

/*--------------------------------------

* distance calculate, ask be square distance and

*--------------------------------------*/

__global__?void

DistanceOfSquareKernel(GPU_Matrix?A,?GPU_Matrix?B,?GPU_Matrix?C,?size_t?pitch)?{

int?blockRow?=?blockIdx.y; int?blockCol?=?blockIdx.x;

GPU_Matrix?subC?=?GetSubMatrix(C,?blockRow,?blockCol);

float?Cvalue?=?0; int?row?=?threadIdx.y; int?col?=?threadIdx.x;

for?(int?m?=?0;?m?<?(A.width?/?BLOCK_SIZE);?++m){

GPU_Matrix?subA?=?GetSubMatrix(A,?blockRow,?m);

GPU_Matrix?subB?=?GetSubMatrix(B,?m,?blockCol);

// state shared storage array for storing A, B sub-block

__shared__?float?As[BLOCK_SIZE][BLOCK_SIZE];

__shared__?float?Bs[BLOCK_SIZE][BLOCK_SIZE];

// completing the copy of data from global storage to shared storage, each thread is responsible for an element

As[row][col]?=?GetGPUElement(&subA,?row,?col,?pitch);

Bs[row][col]?=?GetGPUElement(&subB,?row,?col,?pitch);

__syncthreads();

// parallel computation, each thread is responsible for the calculating of an element value in C

for?(int?n?=?0;?n?<?BLOCK_SIZE;?++n)

Cvalue?+=?powf((As[row][n]?-?Bs[n][col]),2.0f);

__syncthreads();

}

SetGPUElement(&subC,?row,?col,?Cvalue,?pitch);

}。

Five, the thread grid (Grid) in computing array, thread block (Block) and thread (Thread) are loaded into the stream handle array (SPA) of computing array, stream multiprocessor (SM) and the upper execution of stream handle (SP) respectively.Exchange data by video memory between thread grid, and each thread block is executed in parallel, can not intercoms mutually, data can only be shared by video memory; Thread in same thread block can pass through shared storage (Shared Memory) and communicate with synchronously realizing.

Due to the higher-dimension characteristic of data stream, when carrying out task division, the dimension of thread grid (Grid) and thread block (Block) to be all designed to be two dimension by us.This is CPU host side code below, for arranging operational factor, i.e. and the shape of thread grid and the shape of thread block.Wherein, gridDim, blockDim, blockIdx, threadIdx are the built-in variablees in CUDA C.

Dim3 grid (gridDim.x, gridDim.y, 1); // third dimension perseverance is 1

Dim3 block (blockDim.x, blockDim.y, 1); // the third dimension can not be 1, because block is two dimension, so put 1

Wherein, blockIdx.x ∈ [0, gridDim.x-1], blockIdx.y ∈ [0, gridDim.y-1], threadIdx.x ∈ [0, blockDim.x-1], threadIdx.y ∈ [0, blockDim.y-1].

There is the parallel of two levels in figure, walk abreast between the thread (Thread) between the thread block (Block) namely in thread grid (Grid) in parallel and thread block (Block).(N+1) represent thread block (block) sum, (M+1) represents thread (thread) sum.

(N+1)=gridDim.x* gridDim.y≤65535*65535, wherein gridDim.x≤65535, gridDim.y≤65535.

(M+1)=blockDim.x* blockDim.y≤1024, wherein blockDim.x≤512, blockDim.y≤512.

Because thread grid (Grid) and thread block (Block) are two-dimentional, so the index of two dimension also will be used in kernel function.Here is equipment end code, for computational threads index, namely determines the position of thread (Thread) in whole thread grid (Grid).

unsigned?int?bid_in_grid?=?blockIdx.x?+?blockIdx.y*gridDim.x;

unsigned?int?tid_in_block?=?threadIdx.x?+?threadIdx.y*blockDim.x;

unsigned?int?tid_in_grid_x?=?threadIdx.x?+?blockIdx.x*blockDim.x;//x

unsigned?int?tid_in_grid_y?=?threadIdx.y?+?blockIdx.y*blockDim.y;//y

unsigned?int?tid_in_grid?=?tid_in_grid_x?+?tid_in_grid_y*blockDim.x*gridDim.x;//offset。

In addition, in order to effectively utilize performance element, design thread block (Block) time, should make as far as possible the number of threads in each block be 32 integral multiple, preferably remain between 64 ~ 256.In order to make full use of the resource of GPU, improve its execution efficiency, during codes implement, we especially must pay attention to the dimension design of thread grid (Grid) and thread block (Block).

More than illustrate and be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses, the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims

1. a data flow processing system on GPU, is characterized in that the data stream of data source by data flow processing system to client, and data flow processing system comprises CPU main frame and GPU equipment;

CPU main frame comprises CPU end loading engine modules, CPU holds buffer module, data stream pretreatment module, data stream offload module and visualization model, CPU end loads engine modules and is provided with loading or storage unit, CPU holds buffer module to be provided with internal memory, loading or storage unit, data stream pretreatment module, data stream offload module and the visualization model of CPU end loading engine modules all hold the Memory linkage of buffer module mutual with CPU, loading or the storage unit of CPU end loading engine modules are connected with visualization model alternately;

GPU equipment comprises GPU end and loads engine modules, GPU holds buffer module, data stream summary abstraction module, data Stream Processing Model storehouse and data flow processing module, GPU end loads engine modules and is provided with loading or storage unit, GPU holds buffer module to be provided with video memory, data stream summary abstraction module is used for integrated summary abstracting method and calls for data flow processing module, data Stream Processing Model storehouse is used for integrated data stream Processing Algorithm and calls for data flow processing module, GPU end loads loading or the storage unit of engine modules, data flow processing module all holds the video memory of buffer module to be connected alternately with GPU, data stream summary abstraction module, data Stream Processing Model storehouse all holds the loading that loads engine modules or storage unit to be connected with GPU, GPU holds and opened up storage space in the video memory of buffer module is moving window,

CPU end loads the loading of engine modules or storage unit and holds the mutual of the loading or storage unit that load engine modules and client by internet and data source, GPU.

2. data flow processing system on a kind of GPU according to claim 1, it is characterized in that CPU end buffer module is also provided with the memory manager for managing internal memory, be provided with Input Monitor Connector device in memory manager, Input Monitor Connector device is for monitoring the untreated data stream of interim storage in internal memory; CPU end loads engine modules and comprises speed regulator, loading or storage unit and initialization integrator; Speed regulator is used for flowing into CPU end according to the data stream of the buffer status of internal memory adjustment data source and loads flow velocity in the loading of engine modules or storage unit, is provided with feedback mechanism in speed regulator; Initialization integrator is used for the initialization operation of integrated CPU main frame and GPU equipment.

3. a data flow processing method on GPU, is characterized in that data stream that data source exports is by transferring to client by data result after any one data flow processing system process in claim 1 or 2; The treatment scheme of data stream is as follows:

4. data flow processing method on a kind of GPU according to claim 3, it is characterized in that in step (2), original data stream in internal memory uses preprocess method to carry out pre-service by data stream pretreatment module, preprocess method comprises Data Cleaning Method, data integrating method, data conversion method and hough transformation method, the preprocess method used can be a kind of above-mentioned method, or selects multiple methods combining to use.

5. data flow processing method on a kind of GPU according to claim 3, after it is characterized in that step (2), when data stream overload, carry out Reduction of Students' Study Load process through data stream offload module to data stream, concrete steps are:

6. data flow processing method on a kind of GPU according to claim 5, is characterized in that data stream offload module carries out to data stream one or more combinations that Reduction of Students' Study Load process adopts following strategy:

III, abandoning based on priority: each data stream newly arrived is assigned with a priority, receive with the data in untreated data stream, select those data having lowest priority and abandon them.

7. data flow processing method on a kind of GPU according to claim 3, it is characterized in that video memory uses global storage to go to store various data, the moving window opened up in video memory is for saving the data stream arriving GPU equipment from CPU main frame, moving window uses the moving window based on the definition of tuple number, namely be the moving window that window size is fixed, for preserving the data stream arrived recently;

8. data flow processing method on a kind of GPU according to claim 3, it is characterized in that the data stream summary of step (4) extracts, data stream is obtained summary data structure by summary abstracting method, and summary abstracting method comprises the methods of sampling, wavelet method, sketch map method and histogram method.

9. data flow processing method on a kind of GPU according to claim 3, it is characterized in that various Data Stream Processing algorithms when data Stream Processing Model storehouse is integrated with Data Stream Processing, comprise Query Processing Algorithm, clustering algorithm, sorting algorithm, correlation analysis algorithm between Frequent Itemsets Mining Algorithm and many data stream.

10. data flow processing method on a kind of GPU according to claim 3,7,8 or 9, it is characterized in that the task of data flow processing module is that the summary abstracting method of calling data stream summary abstraction module carries out summary extraction to data stream, and the Data Stream Processing algorithm in calling data stream transaction module storehouse carries out parallel computation to summary data; Data flow processing module comprises data stream input assembler, overall thread block scheduler and computing array, is provided with shared storage, loading or storage unit in computing array; Data stream input assembler is responsible for the data in video memory to be read in the shared storage of data flow processing module, overall situation thread block scheduler is responsible for carrying out dispatching distribution management to the thread block in shared storage, thread and instruction, and computing array is used for the calculating of thread; Loading in computing array or storage unit, when calculating, load data into shared storage from video memory.