CN109213793A

CN109213793A - A kind of stream data processing method and system

Info

Publication number: CN109213793A
Application number: CN201810889376.0A
Authority: CN
Inventors: 左梅兰; 郭子森
Original assignee: Jingxian County Mai Lan Network Technology Service Co Ltd
Current assignee: Jingxian County Mai Lan Network Technology Service Co Ltd
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2019-01-15

Abstract

The invention discloses a kind of stream data processing method and system, by the summary feature data that stream datas a large amount of in e-commerce are extracted to stream data, establish a plurality of processing thread, economic cooperation summary feature data and at multiple data sets, and data are pre-processed in advance, reduce data dimension, the data similarity value between reasonable computation reference data and other data, so that it is determined that whether each data in data set have the preferable degree of association, finally determine whether to retain the data.It can make when in face of stream data amount, larger and high concurrent is accessed, system can timely respond to request, filter false data, and query time is reduced, the available optimization of transmission performance.

Description

A kind of stream data processing method and system

Technical field

The present invention relates to computer data processing technology field, in particular to a kind of stream data processing method and system.

Background technique

E-commerce is a booming business model, thus brings new opportunity to sme development.In During small enterprise and e-commerce cooperative development, informationization is essential intermediate link.However current middle-size and small-size enterprise Industry informatization is slow, and related medium-sized and small enterprises' warehouse logistics information study on construction is less, and the system that oneself realizes at present is set Meter has basic function, but lacks preferable detailed design and user experience.For e-commerce company, what inside fell behind The level of IT application is likely to become a major reason for restricting its efficiency of service.The design of E-business applications must be with data Centered on storage and management, centered on database technology, a height is realized in terms of logical concept and soft and hardware technology two Performance and data-centered network system provides an effective data storage management system for user.

But the concurrent control mechanism of user terminal/server framework is generally used in the prior art, it is asked by client reception It asks, the data that server customer in response end is sent, carries out parallel data processing, but and high concurrent larger in face of stream data amount is visited When asking, system can not timely respond to request, and client-side management is cumbersome, and query time increases, and transmission performance is difficult to ensure, In some data do not carry out screening and filtering or be not optimized processing, the data being stored in database table often have data lack Mistake, information redundancy and error in data and other issues.It would therefore be highly desirable to propose the method for stream data processing.

Summary of the invention

The embodiment of the invention provides a kind of stream data processing method and system, and stream data is optimized processing, Request can not be timely responded to, query time increase, pass by occurring error in data, system when to solving the processing of existing stream data The problems such as defeated performance is difficult to ensure.

To solve the above-mentioned problems, the invention discloses following technical solutions:

In a first aspect, providing a kind of stream data processing method, comprising:

The window that a length is S is established, is extracted from the current window of a plurality of stream data using processor CPU element Summary feature data；

Multiple thread parallel processing units are established using processor GPU unit, in the multiple thread parallel processing unit A thread parallel processing unit correspond to a plurality of stream data in a stream data；

The summary feature data are merged to form multiple summary feature data sets, wherein each summary feature data First concentrated is recorded as the reference data of the summary feature data set；

Data in the multiple summary feature data set are pre-processed, the dimension of the data is reduced, are deleted superfluous Remaining or little relevance attribute；

Execution character String matching operation is traversed one by one to the data of the summary feature data set, by the summary feature number It is compared according to first record and subsequent record of collection；

The data similarity value for calculating other data in the reference data and the summary feature data set, by what is obtained Data similarity value Q is compared with preset reference data similarity value, obtains comparison result；

Determine whether other described data retain according to the comparison result, the data of reservation are depositing for the current window File data.

Second aspect provides a kind of stream data processing system, comprising:

Abstraction module establishes the window that a length is S, extracts summary feature from the current window of a plurality of stream data Data；

Multiple threads module, establishes multiple thread parallel processing units, in the multiple thread parallel processing unit One thread parallel processing unit corresponds to a stream data in a plurality of stream data；

Merging module merges the summary feature data to form multiple summary feature data sets, wherein each described general First for wanting characteristic to concentrate is recorded as the reference data of the summary feature data set；

Preprocessing module pre-processes the data in the multiple summary feature data set, reduces the data Dimension deletes redundancy or the little attribute of relevance；

Comparison module traverses execution character String matching operation to the data of the summary feature data set one by one, will be described First record of summary feature data set is compared with subsequent record；

Computing module calculates the data similarity of other data in the reference data and the summary feature data set Value, obtained data similarity value Q is compared with preset reference data similarity value, obtains comparison result；

As a result confirmation module determines whether other described data retain according to the comparison result, and the data of reservation are institute State the archive data of current window.

The invention discloses a kind of electronic commerce data processing method and system, by by streaming numbers a large amount of in e-commerce According to the summary feature data for extracting stream data, a plurality of processing thread is established, economic cooperation summary feature data and at multiple data sets, And data are pre-processed in advance, reduce data dimension, the data phase between reasonable computation reference data and other data Like angle value, so that it is determined that whether each data in data set have the preferable degree of association, finally determine whether to retain the data. This method makes when in face of stream data amount, larger and high concurrent is accessed, and system can timely respond to request, filter false number According to query time is reduced, the available optimization of transmission performance.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is the flow diagram of stream data processing method in one embodiment of the invention.

Fig. 2 is the structural schematic diagram of stream data processing system in another embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments, based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Referring to Fig. 1, one embodiment of the invention proposes a kind of flow chart of stream data processing method, firstly, establishing The window that one length is S, extracts summary feature number using processor CPU element from the current window of a plurality of stream data According to.So-called summary feature data refer to the data for best embodying the stream data attribute in energy this stream data, pass through word frequency Or other algorithms can analyze and obtain the data.

Further, multiple thread parallel processing units are established using processor GPU unit, multiple thread parallel processing are single A thread parallel processing unit in member corresponds to a stream data in a plurality of stream data.The thread of GPU is light weight Zero-overhead may be implemented in grade thread, the switching between thread, and the advantage of this thread switching is to switch to ready state thread, can To hide the delay of thread with the calculating in thread, and bring hiding delay better if thread；And CPU is real The method of existing multithreading is the coarseness multithreading using software itself, he the characteristics of when being that thread switching generally requires hundreds of In the clock period, this consumption is very big.In CPU, there is the standard of multicore, there can be 2-8 calculating core, but hardware The raising of energy is limited, so the quantity that be continuously increased calculating core is not easy very much.In comparison, the stream multiprocessing in GPU Device usually has 1-30, if used at full capacity, Floating-point Computation processing capacity is very advantageous, so, mainstream GPU performance is 10 times of cpu performance are even higher.GPU and CPU are compared as can be seen that in the bandwidth of memory and two sides of ability of operation Face, GPU are higher by several times or more than CPU of the same period in terms of the two.In addition, the characteristics of according to stream data, in processing stream It transfers to GPU to go to handle the parallel section in the parallel algorithm of design or algorithm when formula data, utilizes its high memory bandwidth With multithread processor, to execute Large-scale parallel computing, so that streaming data processing accelerates, this is very reasonable.

Further, summary feature data are merged to form multiple summary feature data sets, wherein each summary feature number The reference data of summary feature data set is recorded as according to first of concentration.Since stream data amount is magnanimity, at data The decomposition of reason task can start with from data itself, and original data set is resolved into multiple small data sets.Assuming that data volume There is N item record, the processing time loss for each record is t, then the data processing task time-consuming for having executed this M item record is This M/n data set is performed simultaneously data processing if this M data to be resolved into M/n small data sets by M*t, In the case of the influence for not considering memory and CPU, it is believed that processing time time-consuming is M/n*t.

Further, summary feature data are merged to form multiple summary feature data set specific steps are as follows: extract summary First record in characteristic, and first record is considered as new summary feature data set, and preserve；Analysis is general The Article 2 in characteristic is wanted to record, by comparing Article 2 record and oneself current category through existing summary feature data set Property, upon a match, Article 2 record is assigned in matched summary feature data set；If this record with it is current Oneself mismatches through existing all summary feature data sets, then records one new summary feature data of creation for this Collection, and match attribute is created for it；The step of front two is constantly repeated, was calculated until every record is all scanned, final Multiple summary feature data sets are recorded to Article 2.

Further, the data in multiple summary feature data sets are pre-processed, reduces the dimension of data, deleted superfluous Remaining or little relevance attribute；For small data sets multiple after having decomposed, the dimension for reducing data is carried out, in this way The time complexity of algorithm will be greatly reduced, error is reduced.

Further, execution character String matching operation is traversed one by one to the data of summary feature data set, by summary feature First record of data set is compared with subsequent record；Data sliding window model is a processing window on data set Mouthful, and can slide.When handling data, window is that first record from data set constantly slides backward.

Further, the data similarity value for calculating other data in reference data and summary feature data set, will obtain Data similarity value Q be compared with preset reference data similarity value, obtain comparison result；

Finally, determining whether other data retain according to comparison result, the data of reservation are the archive data of current window. If the data similarity Q of the data is greater than or equal to reference data similarity value, indicate the data in the data intensive data The degree of association is higher, is not wrong data；On the contrary, being indicated if the data similarity Q of the data is less than reference data similarity value Data data correlation degree in the data set is lower, which is wrong data.

Wherein, the calculation formula of data similarity value Q are as follows:

D is the total length of the data window of summary feature data set, q_iFor field i Similarity, p be two comparison character strings identical characters number, N_maxFor the maximum value for taking two comparison string lengths, m_iFor The weight that field i is accounted for.

The present invention is by establishing the summary feature data of stream datas a large amount of in e-commerce extraction stream data a plurality of Thread is handled, economic cooperation summary feature data are simultaneously pre-processed at multiple data sets, and to data in advance, reduce data dimension Degree, the data similarity value between reasonable computation reference data and other data, so that it is determined that each data in data set are It is no that there is the preferable degree of association, finally determine whether to retain the data.This method makes larger and high in face of stream data amount When concurrently accessing, system can timely respond to request, filter false data, and query time is reduced, and transmission performance is available excellent Change.

Fig. 2 is the structural schematic diagram of stream data processing system in another embodiment of the present invention, proposes a kind of streaming Data processing system, comprising: abstraction module 201, multiple threads module 202, merging module 203, preprocessing module 204, ratio Compared with module 205, computing module 206 and result confirmation module 207.Wherein:

Abstraction module 201 establishes the window that a length is S, utilizes processor CPU element working as from a plurality of stream data Summary feature data are extracted in front window.So-called summary feature data, the streaming can be best embodied in this stream data by referring to The data of data attribute can analyze by word frequency or other algorithms and obtain the data.

Multiple threads module 202 establishes multiple thread parallel processing units, multiple threads using processor GPU unit A thread parallel processing unit in parallel processing element corresponds to a stream data in a plurality of stream data.GPU's Thread is lightweight thread, and zero-overhead may be implemented in the switching between thread, and the advantage of this thread switching is to switch to ready State thread can hide the delay of thread with the calculating in thread, and bring hiding delay if thread more Better；And the method that CPU realizes multithreading is coarseness multithreading using software itself, he the characteristics of be that thread switching is general Hundreds of clock cycle are needed, this consumption is very big.In CPU, there is the standard of multicore, there can be 2-8 calculating Core, but the raising of hardware performance is limited, so the quantity that be continuously increased calculating core is not easy very much.In comparison, GPU In stream multiprocessor usually have 1-30, if used at full capacity, Floating-point Computation processing capacity is very advantageous, so, it is main Stream GPU performance is that 10 times of cpu performance are even higher.GPU and CPU are compared as can be seen that memory bandwidth and operation Two aspects of ability, GPU are higher by several times or more than CPU of the same period in terms of the two.In addition, the spy according to stream data Parallel section in the parallel algorithm of design or algorithm is transferred to GPU to go to handle by point when handling stream data, using it High memory bandwidth and multithread processor, to execute Large-scale parallel computing, so that streaming data processing accelerates, this is to close very much Reason.

Merging module 203 merges summary feature data to form multiple summary feature data sets, wherein each summary feature First in data set is recorded as the reference data of summary feature data set.Since stream data amount is magnanimity, to data The decomposition of processing task can start with from data itself, and original data set is resolved into multiple small data sets.Assuming that data Amount has N item record, and the processing time loss for each record is t, then the data processing task for having executed this M item record is time-consuming This M/n data set is performed simultaneously data processing if this M data to be resolved into M/n small data sets for M*t, When not considering the influence of memory and CPU, it is believed that processing time time-consuming is M/n*t.

Preprocessing module 204 pre-processes the data in multiple summary feature data sets, reduces the dimension of data, Delete redundancy or the little attribute of relevance；For small data sets multiple after having decomposed, the dimension for reducing data is carried out Degree will greatly reduce the time complexity of algorithm in this way, reduce error.

Comparison module 205 traverses execution character String matching operation to the data of summary feature data set one by one, by summary spy First record of sign data set is compared with subsequent record；Data sliding window model is a processing on data set Window, and can slide.When handling data, window is that first record from data set constantly slides backward.

Computing module 206 calculates the data similarity value of other data in reference data and summary feature data set, will To data similarity value Q be compared with preset reference data similarity value, obtain comparison result；

As a result confirmation module 207 determine whether other data retain according to comparison result, and the data of reservation are current window Archive data.If the data similarity Q of the data is greater than or equal to reference data similarity value, indicate the data in the number It is higher according to the intensive data degree of association, it is not wrong data；On the contrary, if the data similarity Q of the data is less than reference data phase Like angle value, indicate that data data correlation degree in the data set is lower, which is wrong data.

Wherein, the calculation formula of data similarity value Q are as follows:

Above system is by establishing the summary feature data of stream datas a large amount of in e-commerce extraction stream data more Item handles thread, and economic cooperation summary feature data are simultaneously pre-processed at multiple data sets, and to data in advance, reduces data dimension Degree, the data similarity value between reasonable computation reference data and other data, so that it is determined that each data in data set are It is no that there is the preferable degree of association, finally determine whether to retain the data.This method makes larger and high in face of stream data amount When concurrently accessing, system can timely respond to request, filter false data, and query time is reduced, and transmission performance is available excellent Change.

It should be noted that, in this document, such as first and second etc relational terms are used merely to an entity Or operation is distinguished with another entity or operation, is existed without necessarily requiring or implying between these entities or operation Any actual relationship or order.Moreover, the terms "include", "comprise" or its any other variant be intended to it is non- It is exclusive to include, so that the process, method, article or equipment for including a series of elements not only includes those elements, It but also including other elements that are not explicitly listed, or further include solid by this process, method, article or equipment Some elements.In the absence of more restrictions, the element limited by sentence " including one ", is not arranged Except there is also other identical factors in the process, method, article or equipment for including element.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can store in computer-readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light In the various media that can store program code such as disk.

Finally, it should be noted that the foregoing is merely a prefered embodiment of the invention, it is merely to illustrate technical side of the invention Case is not intended to limit the scope of the present invention.It is any modification for being made all within the spirits and principles of the present invention, equivalent Replacement, improvement etc., are included within the scope of protection of the present invention.

Claims

1. a kind of stream data processing method, which is characterized in that the described method includes:

The window that a length is S is established, extracts summary from the current window of a plurality of stream data using processor CPU element Characteristic；

Multiple thread parallel processing units are established using processor GPU unit, one in the multiple thread parallel processing unit A thread parallel processing unit corresponds to a stream data in a plurality of stream data；

The summary feature data are merged to form multiple summary feature data sets, wherein in each summary feature data set First reference data for being recorded as the summary feature data set；

Data in the multiple summary feature data set are pre-processed, reduce the dimension of the data, delete redundancy or The little attribute of person's relevance；

Execution character String matching operation is traversed one by one to the data of the summary feature data set, by the summary feature data set First record with it is subsequent record be compared；

Calculate the data similarity value of other data in the reference data and the summary feature data set, the data that will be obtained Similarity value Q is compared with preset reference data similarity value, obtains comparison result；

Determine whether other described data retain according to the comparison result, the data of reservation are the archive number of the current window According to.

2. the method according to claim 1, wherein wherein whether determining other described data according to comparison result Retain specifically:, will if the data similarity value of other data is greater than or equal to the reference data similarity value Other described data are added to record set, finally save into new data table；If obtained data similarity value Q is less than described Reference data similarity value deletes other described data from the summary feature data.

3. the method according to claim 1, wherein wherein the summary feature data are merged to be formed it is multiple general Want characteristic data set specifically: extract first record in the summary feature data, and described first is recorded It is considered as new summary feature data set, and preserves；The Article 2 record in the summary feature data is analyzed, by comparing The Article 2 record and oneself the current attribute through existing summary feature data set upon a match record the Article 2 It is assigned in matched summary feature data set；If this record with it is current oneself through existing all summary features Data set all mismatches, then records one new summary feature data set of creation for this, and create match attribute for it；Constantly The step of front two is repeated, was calculated until every record is all scanned, and was finally obtained Article 2 and record multiple summary feature numbers According to collection.

4. the method according to claim 1, wherein wherein, the calculation formula of the data similarity value Q are as follows:

D is the total length of the data window of the summary feature data set, q_iFor field i Similarity, p be two comparison character strings identical characters number, N_maxFor the maximum value for taking two comparison string lengths, m_iFor The weight that field i is accounted for.

5. a kind of stream data processing system, which is characterized in that the system comprises:

Abstraction module establishes the window that a length is S, summary feature number is extracted from the current window of a plurality of stream data According to；

Multiple threads module, establishes multiple thread parallel processing units, and one in the multiple thread parallel processing unit Thread parallel processing unit corresponds to a stream data in a plurality of stream data；

Merging module merges the summary feature data to form multiple summary feature data sets, wherein each summary is special First in sign data set is recorded as the reference data of the summary feature data set；

Preprocessing module pre-processes the data in the multiple summary feature data set, reduces the dimension of the data, Delete redundancy or the little attribute of relevance；

Comparison module traverses execution character String matching operation to the data of the summary feature data set, by the summary one by one First record of characteristic data set is compared with subsequent record；

Computing module calculates the data similarity value of other data in the reference data and the summary feature data set, will Obtained data similarity value Q is compared with preset reference data similarity value, obtains comparison result；

As a result confirmation module determines whether other described data retain according to the comparison result, and the data of reservation are described work as The archive data of front window.

6. system according to claim 5, which is characterized in that wherein whether determine other described data according to comparison result Retain specifically:, will if the data similarity value of other data is greater than or equal to the reference data similarity value Other described data are added to record set, finally save into new data table；If obtained data similarity value Q is less than described Reference data similarity value deletes other described data from the summary feature data.

7. system according to claim 5, which is characterized in that wherein the summary feature data are merged to be formed it is multiple general Want characteristic data set specifically: extract first record in the summary feature data, and described first is recorded It is considered as new summary feature data set, and preserves；The Article 2 record in the summary feature data is analyzed, by comparing The Article 2 record and oneself the current attribute through existing summary feature data set upon a match record the Article 2 It is assigned in matched summary feature data set；If this record with it is current oneself through existing all summary features Data set all mismatches, then records one new summary feature data set of creation for this, and create match attribute for it；Constantly The step of front two is repeated, was calculated until every record is all scanned, and was finally obtained Article 2 and record multiple summary feature numbers According to collection.

8. system according to claim 5, which is characterized in that wherein, the calculation formula of the data similarity value Q are as follows: