CN106126721A

CN106126721A - The data processing method of a kind of real-time calculating platform and device

Info

Publication number: CN106126721A
Application number: CN201610512866.XA
Authority: CN
Inventors: 王素梅; 沈迪; 徐胜国
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2016-06-30
Filing date: 2016-06-30
Publication date: 2016-11-16

Abstract

The invention discloses data processing method and the device of a kind of real-time calculating platform.The method includes: receives calculating task, reads the configuration information of this calculating task；According to the data source information in configuration information, receive the pending data of input in real time from respective data sources；Judge whether configuration information comprises duplicate removal statistical rules；It is then, the data structure that the pending data received are taken by compression stroke is stored；According to the duplicate removal statistical rules in configuration information, the pending data of storage are carried out duplicate removal statistics.The data structure that the pending data received are taken by compression stroke is stored by this programme, pending data according to the data structure storage to taking with compression stroke of the duplicate removal statistical rules in configuration information carry out duplicate removal statistics, make use of the feature needing " duplicate removal " in duplicate removal statistics, space is exchanged for by changing pending data storage method, alleviate and calculate platform in real time in workload, the work efficiency of raising real-time calculating platform.

Description

The data processing method of a kind of real-time calculating platform and device

Technical field

The present invention relates to Internet technical field, be specifically related to data processing method and the dress of a kind of real-time calculating platform Put.

Background technology

In prior art, calculating platform in real time can carry out data process based on Computational frame, calculates the system of platform in real time Meter processing procedure is by mean of what the internal memory of the cluster of Computational frame own was carried out, calculates platform in real time from data sources is treated After reason data, pending data are directly placed in internal memory, then the data in internal memory are carried out statistical disposition.Generally, Calculating in real time platform every day from the pending data of data sources 2G, the statistical disposition of the pending data of this data volume is energy Enough by normal duty, but when calculate in real time platform every day from the pending data of data sources 10G time, this data volume Pending data are put in internal memory and can not be added up.

Therefore, when the data volume of pending data is bigger, how pending data being carried out statistical disposition is current reality Time calculate problem demanding prompt solution in platform relevant programme.

Summary of the invention

In view of the above problems, it is proposed that the present invention in case provide one overcome the problems referred to above or at least in part solve on State data processing method and the device of a kind of real-time calculating platform of problem.

According to one aspect of the present invention, it is provided that the data processing method of a kind of real-time calculating platform, wherein, the method Including:

Receive calculating task, read the configuration information of this calculating task；

According to the data source information in described configuration information, receive the pending data of input in real time from respective data sources；

Judge whether described configuration information comprises duplicate removal statistical rules；

It is then, the data structure that the pending data received are taken by compression stroke is stored；

According to the duplicate removal statistical rules in described configuration information, the pending data of storage are carried out duplicate removal statistics.

Alternatively, the data structure that described compression stroke takies includes: HyperLogLog data structure；

The most described according to the duplicate removal statistical rules in described configuration information, the pending data of storage are carried out duplicate removal statistics Including: according to the duplicate removal statistical rules in described configuration information and pending data in HyperLogLog data structure Base value, obtains the duplicate removal statistical result of pending data.

Alternatively, described judge that whether comprising duplicate removal statistical rules in described configuration information includes:

When described configuration information comprises using unique mark of data as index carry out adding up regular time, determine described Configuration information comprises duplicate removal statistical rules.

Alternatively, the described rule carrying out adding up as index using unique mark of data includes:

The statistical rules of independent visitor's number, the statistical rules of independent IP number, and/or, the statistical rules of independent search word.

Alternatively, described judge whether described configuration information comprises duplicate removal statistical rules before, the method is further Including:

According to the analysis condition in described configuration information, the field meeting described analysis condition in pending data is resolved Metadata for specified format；

The most described the pending data received are carried out storage by the data structure that compression stroke takies include: will refer to The metadata of the formula that fixes is stored by specified structure data；

The most described according to the duplicate removal statistical rules in described configuration information, the pending data of storage are carried out statistics bag Include: according to the duplicate removal statistical rules in described configuration information, the metadata of the specified format of storage is carried out duplicate removal statistics.

Alternatively, the metadata of described specified format is the key-value pair form being made up of field and field value.

Alternatively, the method farther includes: prestore multiple basic solution parser, and each basic solution parser adapts to a kind of base Notebook data form；

The described metadata that the field meeting described analysis condition in pending data resolves to specified format includes:

When the form of pending data is single master data form, search suitable from the multiple basic solution parsers prestored It is assigned in the basic solution parser of this master data form, pending data will meet institute by calling the basic solution parser found The field stating analysis condition resolves to the metadata of specified format.

Alternatively, the described metadata that the field meeting described analysis condition in pending data is resolved to specified format Also include:

When the combination that the form of pending data is multiple master data form, for every kind of master data form, from The multiple basic solution parsers prestored are searched and adapt to the basic solution parser of this master data form, by call find many The field meeting described analysis condition in pending data is resolved to the metadata of specified format by the combination of individual basic solution parser.

Alternatively, the described metadata that the field meeting described analysis condition in pending data is resolved to specified format Including:

According to the form of pending data, determine the one or more analytical functions adapting to pending data；

Create the resolver that pending data are corresponding, the one or more parsing letter of dynamic registration in this resolver Number；

By calling created resolver, the field meeting analysis condition in pending data is resolved to specified format Metadata.

Alternatively, after the described pending data to storage carry out duplicate removal statistics, the method farther includes:

According to the storage rule in described configuration information, the duplicate removal statistical result obtained is saved in corresponding storage medium In, for user, duplicate removal statistical result is inquired about.

Alternatively, described data source include following one or more:

Kafka data source, Qbus data source, Scribe data source, Apache data source, Kestrel data source.

Alternatively, described real-time calculating platform is carried out based on Storm Computational frame or Spark Streaming Computational frame Data process；

When described real-time calculating platform based on Spark Streaming Computational frame carry out data process time, described will connect The pending data received carry out storage by the data structure that compression stroke takies and include:

The data structure storage pending data received taken by compression stroke is in internal memory.

According to another aspect of the present invention, it is provided that the data processing equipment of a kind of real-time calculating platform, wherein, this dress Put and include:

Task receives unit, is suitable to receive calculating task, reads the configuration information of this calculating task；

Data receipt unit, is suitable to, according to the data source information in described configuration information, receive in real time from respective data sources The pending data of input；And be suitable to judge whether described configuration information comprises duplicate removal statistical rules；It is then, will receive The data structure that taken by compression stroke of pending data store；

Data statistics unit, is suitable to according to the duplicate removal statistical rules in described configuration information, the pending data to storage Carry out duplicate removal statistics.

The most described data statistics unit, is suitable to according to the duplicate removal statistical rules in described configuration information and pending data Base value in HyperLogLog data structure, obtains the duplicate removal statistical result of pending data.

Alternatively, described data receipt unit, be suitable to when described configuration information comprises using data uniquely identify as Index carry out adding up regular time, determine and described configuration information comprise duplicate removal statistical rules.

Alternatively, this device farther includes:

Data parsing unit, is suitable to whether comprise duplicate removal statistics in described data receipt unit judges described configuration information Before rule, according to the analysis condition in described configuration information, pending data will meet the field solution of described analysis condition Analysis is the metadata of specified format；

The most described data receipt unit, is suitable to be stored the metadata of specified format by specified structure data；

The most described data statistics unit, is suitable to according to the duplicate removal statistical rules in described configuration information, the appointment to storage The metadata of form carries out duplicate removal statistics.

Alternatively, described data parsing unit, it is further adapted for multiple basic solution parser that prestores, each basic solution parser is fitted It is assigned in a kind of master data form；When the form of pending data is single master data form, from prestore multiple substantially Resolver is searched the basic solution parser adapting to this master data form, will wait to locate by calling the basic solution parser found The field meeting described analysis condition in reason data resolves to the metadata of specified format.

Alternatively, described data parsing unit, it is further adapted for when the form of pending data is multiple master data form During combination, for every kind of master data form, search from the multiple basic solution parsers prestored and adapt to this master data form Basic solution parser, pending data will meet described parsing bar by calling the combination of the multiple basic solution parsers found The field of part resolves to the metadata of specified format.

Alternatively, described data parsing unit, be suitable to the form according to pending data, determine and adapt to pending data One or more analytical functions；Create the resolver that pending data are corresponding, dynamic registration one in this resolver Or multiple analytical function；By calling created resolver, the field meeting analysis condition in pending data is resolved to finger The metadata of the formula that fixes.

Alternatively, this device farther includes:

Data storage cell, is suitable to, according to the storage rule in described configuration information, the duplicate removal statistical result obtained be protected It is stored in corresponding storage medium, for user, duplicate removal statistical result is inquired about.

Alternatively, described data source include following one or more:

Described data receipt unit, is suitable to when described real-time calculating platform enters based on Spark Streaming Computational frame When row data process, in the data structure storage that the pending data received are taken by compression stroke to internal memory.

From the foregoing, the technical scheme that the present invention provides describes calculates the flow chart of data processing on platform in real time, right The calculating task of duplicate removal statistical rules is comprised, after the pending data of corresponding data sources, no in configuration information It is same as the common pending data received being directly placed in internal memory of prior art, adds up based on the data in internal memory The mode processed, the data structure that the pending data received are taken by compression stroke is stored by this programme, then root Duplicate removal is carried out according to the pending data of the data structure storage to taking with compression stroke of the duplicate removal statistical rules in configuration information Statistics, make use of the feature needing " duplicate removal " in duplicate removal statistics, exchanges space for by changing pending data storage method, subtracts The light platform that calculates in real time, at workload, improves the work efficiency calculating platform in real time.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow above and other objects of the present invention, the feature and advantage can Become apparent, below especially exemplified by the detailed description of the invention of the present invention.

Accompanying drawing explanation

By reading the detailed description of hereafter preferred implementation, various other advantage and benefit common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical parts.In the accompanying drawings:

Fig. 1 shows the flow process of the data processing method of a kind of real-time calculating platform Figure；

Fig. 2 shows the signal of the data processing equipment of a kind of real-time calculating platform Figure；

Fig. 3 shows the signal of the data processing equipment of a kind of real-time calculating platform Figure.

Detailed description of the invention

It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing shows the disclosure Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure and should be by embodiments set forth here Limited.On the contrary, it is provided that these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.

Fig. 1 shows the flow process of the data processing method of a kind of real-time calculating platform Figure.As it is shown in figure 1, the method includes:

Step S110, receives calculating task, reads the configuration information of this calculating task.

Step S120, according to the data source information in configuration information, receives the pending of input in real time from respective data sources Data.

Step S130, it is judged that whether comprise duplicate removal statistical rules in configuration information.

Step S140, is then, the data structure that the pending data received are taken by compression stroke is stored.

The pending data of storage, according to the duplicate removal statistical rules in configuration information, are carried out duplicate removal statistics by step S150.

Visible, the method shown in Fig. 1 describes and calculates the flow chart of data processing on platform in real time, for wrapping in configuration information Calculating task containing duplicate removal statistical rules, after the pending data of corresponding data sources, is different from prior art normal That sees is directly placed into the pending data received in internal memory, and carries out the mode of statistical disposition based on the data in internal memory, this The data structure that the pending data received are taken by compression stroke is stored by scheme, further according in configuration information The pending data of the duplicate removal statistical rules data structure storage to taking with compression stroke carry out duplicate removal statistics, make use of duplicate removal Statistics needs the feature of " duplicate removal ", exchanges space for by changing pending data storage method, alleviate and calculate platform in real time At workload, improve the work efficiency calculating platform in real time.

In one embodiment of the invention, the configuration information of the calculating task received is configured input by user, Specifically, the front end calculating platform in real time interacts with user, and the configuration information configuring input according to user creates calculating times Business, as shown multiple input frame with the form of Webpage to user, user has been arranged by carrying out input in input frame Become the configuration process of configuration information；The calculating task of establishment is submitted to calculate in real time platform by front end, calculates platform in real time and receives Calculating task, reads the configuration information of this calculating task, launches at corresponding data according to the data source information in configuration information Reason state passes to you.This calculates in real time data processing needs that platform is different and has offered unified interface, and user is without for data Processing procedure writes complete program code, it is only necessary to the configuration information that would correspond to data processing needs is input to front end establishment Calculating task, it is simple to implement, the most time saving and energy saving, the conformability calculating in real time platform is good, real-time is high, efficiency is high and User is the most friendly, and can run multiple calculating task simultaneously, meets current big data development trend.

In one embodiment of the invention, the data structure that compression stroke takies includes: HyperLogLog data are tied Structure.

HyperLogLog is a data structure that can be utilized for sets cardinal, and its space efficiency is the highest, in 1.5K Depositing can be on the premise of error be less than 2%, for the data acquisition system sets cardinal more than 1,000,000,000；Radix in sets cardinal It is meant that: such as set 0,1,3,3,4,5}, its radix is 5, and number is 6, because 3 have repeated twice, It can be seen that radix adds up the most individual duplicate removal statistics；It is to say, under some scene, such as with redis's HyperLogLog carries out PV (page view, page browsing amount) statistics, if do not dealt with for the URL in pending data Link directly invokes if function pfadd (url) calculates, then result out is unsatisfactory certainly, unless each URL Link the most only have accessed once, but, by one numeral of url splicing being added up every time, and can enter with HyperLogLog Row UV (unique visitor, independent visitor's number) adds up, it is common that very directly radix statistics.

For the principle that realizes of clear and definite HyperLogLog data structure, herein according to the sending out of stroke analysis algorithm of data Exhibition process illustrates: the realization of common direct radix statistic algorithm is typically to utilize Hash or B power statistics, but empty Between efficiency general, have direct drawback for big data statistics；Optimum big data radix statistic algorithm is with bitmap, reference " programming pearl " the inside is mentioned, it is known that a bit non-zero i.e. 1, say, that bit is minimum measurement unit, for Set up and map one to one, it is impossible to have the statistical method more in hgher efficiency than bitmap statistical space.Certainly, the drawback of bitmap statistics It is that its space efficiency depends on the upper limit that statistics is interval, say, that if radix is 100,000,000, then, bitmap is added up every time 12.5M internal memory to be opened up, if adding up the UV of concrete URL, then URL less for visit capacity, and the waste of its internal memory is obvious, Although sparse bitmap can compress, but, the limit is also statistics number position.Therefore, under allowing certain range of error, base Number statistics pursues more excellent space efficiency, must need probability statistics algorithm, radix be carried out probability statistics and is sets cardinal Algorithm, needs the order of accuarcy utilizing result of calculation to exchange memory space for, and HyperLogLog is contemplated to meet such need Ask generation.

The central principle of HyperLogLog is: bitmap is freed from map one to one, it is contemplated that become the most constantly to throw The process of coin, non-frontal i.e. reverse side (probability of every one side is 0.5).In this process, throwing the number of times probability more than k is 0.5^k (throws out k reverse side) continuously, and during once, throwing the number of times probability less than k is (1-0.5) ^k.；Therefore, exist During n time is thrown, the probability throwing number of times respectively less than k is: P (x≤k)=(1-0.5^k) ^n；P (x >=k)=1-(1- 0.5^k)^n。

From above formula it can be seen that as n > ＞ k time, the probability of P (x≤k) is close to 0, so, as n > ＞ k time, do not have Once throw the number of times probability more than k and be almost 0.Process is understood as a bit substring, and reverse side is 0, and front is 1, throw corresponding first 1 position occurred of number of times k, when adding up substring and being abundant, the position of first 1 of its maximum is j, So as n>＞ 2^j time, P (x≤k) is close to 0, when n<<during 2^j, P (x>=0) also tend to be 0.It is to say, obtaining x On the premise of=k, one can consider that n=2^j, then, draw following probability statistics conclusion: n=2^j

The most popular explanation: assume that we are the Hash string that a data acquisition system generates 8, then we obtain The probability of 00000111 is the lowest, say, that we generate the probability of a large amount of continuous print 0 is the lowest.Generate continuous 5 The probability of 0 is 1/32, then when we obtain this string, can be evaluated whether, the radix of this data set is 32.

Certainly, it can be seen that estimate to deposit iff with the most such single appraisal from above procedure Because occasionality and error are relatively big, in actual application, point bucket average principle can be utilized to eliminate error, and carry out deviation and repair Just.In the application, it will usually set and accept range of error, and this range of error can determine its point of barrelage in realization.

Based on described above it is recognised that HyperLogLog is a data structure that can be utilized for sets cardinal, will The pending data received store with HyperLogLog data structure, then can be obtained by the storage of this data structure The estimated value (duplicate removal statistical result) of the radix of the specific field in pending data, generally, the estimation of this radix The accuracy rate of value can reach 99%, is in the range of the acceptance of this programme.

Specifically, after the pending data received are stored by HyperLogLog data structure, then step According to the duplicate removal statistical rules in configuration information in S150, the pending data stored are carried out duplicate removal statistics and includes: according to joining Duplicate removal statistical rules in confidence breath and pending data base value in HyperLogLog data structure, obtain waiting to locate The duplicate removal statistical result of reason data.

Wherein, step S130 judging, whether comprising duplicate removal statistical rules in configuration information includes: when configuration information wraps Containing using unique mark of data as index carry out adding up regular time, determine and configuration information comprise duplicate removal statistical rules.Its In, include using the rule that unique mark of data carries out adding up as index: the statistical rules of independent visitor's number, independent IP number Statistical rules, and/or, the statistical rules of independent search word.

Such as, calculate platform in real time and receive calculating task, read the configuration information of calculating task, according in this configuration information Data source information, the pending data of 10G inputted in real time from corresponding data sources, and, calculate task configuration letter Comprise the statistical rules of independent visitor's number of named web page in breath, this statistical rules is duplicate removal statistical rules, then will receive Pending data are stored by HyperLogLog data structure, and the storage of HyperLogLog data structure has only to 20M Space, further according to the duplicate removal statistical rules in configuration information know need calculate access named web page visitor ID remove tuple Amount, then find the radix of the data corresponding to visitor ID from pending data base value HyperLogLog data structure Value, this base value is estimated value, and accuracy rate, in this programme tolerance interval, obtains the duplicate removal statistics of the visitor ID of named web page As a result, i.e. UV statistical result.

In one embodiment of the invention, judge whether described configuration information comprises duplicate removal statistics in step s 130 Before rule, the method shown in this Fig. 1 farther includes: according to the analysis condition in configuration information, will accord with in pending data The field closing analysis condition resolves to the metadata of specified format；Then the pending data received are passed through pressure by step S140 The data structure of contracting space hold carries out storage and includes: the metadata of specified format stored by specified structure data； Then according to the duplicate removal statistical rules in configuration information in step S150, the pending data stored are carried out statistics and includes: according to Duplicate removal statistical rules in configuration information, carries out duplicate removal statistics to the metadata of the specified format of storage.Wherein, it is intended that form Metadata is the key-value pair form being made up of field and field value, i.e. the form of key-value, the metadata of this form can Reflect all data parameters in pending data.

Continue to use the example above calculating UV statistical result, according to the present embodiment, calculate platform in real time and receive calculating task, Read the configuration information of calculating task, according to the data source information in this configuration information, the most defeated from corresponding data sources The pending data of 10G entered, before whether comprising duplicate removal statistical rules, read the solution in configuration information in judging configuration information Analysis condition, in this example, analysis condition is: resolve the field of the visitor ID indicating named web page in pending data, then according to this solution Analysis condition, all resolves to the metadata of specified format, this example by the field indicating the visitor ID of named web page in pending data In analysis result be: (visitorID, a), (visitorID, b), (visitorID, a), (visitorID, c), (visitorID, b), again because the configuration information of the task of calculating comprises the statistical rules of independent visitor's number of named web page, should Statistical rules is duplicate removal statistical rules, then parse 5 metadata stored by HyperLogLog data structure, Obtaining base value in HyperLogLog data structure is 3, it is determined that the statistical result of independent visitor's number of pending data is 3.

Additionally, in one embodiment of the invention, in the data structure that pending data are taken by compression stroke Before storing, first pending data can be carried out Hash calculation, the cryptographic Hash calculated is taken by compression stroke Data structure store.Such as, before above-mentioned 5 metadata are stored by HyperLogLog data structure, The most respectively these 5 metadata are done Hash respectively, obtain 5 cryptographic Hash, by these 5 cryptographic Hash by HyperLogLog data Structure stores, due to the hash characteristic of hash algorithm, it is possible to made a distinction by the minute differences between different pieces of information, enter one Step reduces the error degree of HyperLogLog data structure estimation radix.

In one embodiment of the invention, above-mentioned the field meeting described analysis condition in pending data is resolved to The metadata of specified format includes:

Mode one, according to the form of pending data, determines the one or more analytical functions adapting to pending data； Create the resolver that pending data are corresponding, the one or more analytical function of dynamic registration in this resolver；By adjusting With the resolver created, the field meeting analysis condition in pending data is resolved to the metadata of specified format.

Wherein, analytical function include following one or more: Base64decode function, base64encode function, Urldecode function, urlencode function, isNum function, isVer function, getDay function, getHour function, getMin Function, wherein, Base64decode function is for the decoding data to Base64 coding, and base64encode function is used for Data carry out Base64 coding, and urldecode function is used for reducing url coded string, and urlencode function is for word Symbol string carries out url coding, and isNum function is used for judging whether to be numeral, and isVer function is used for judging whether to be version, GetDay function is for obtaining the date and time information of time, and getHour function is for obtaining hour information of time, getMin function For obtaining minute information of time.Needed for the present embodiment is by the resolver created, dynamic registration resolves pending data Analytical function, it is achieved that the Dynamic Customization to resolver, can be with the variation of the form of the most adaptive pending data.

Mode two, prestore multiple basic solution parser, and each basic solution parser adapts to a kind of master data form.Specifically Ground, basic solution parser include following one or more: Apache daily record resolver, Nginx daily record resolver, array daily record resolve Device, Json daily record resolver, decollator resolver, Apache daily record resolver adapts to the data form of Apache daily record, Nginx daily record resolver adapts to the data form of Nginx daily record, and array daily record resolver adapts to the data lattice of array daily record Formula, Json daily record resolver adapts to the data form of Json daily record, and decollator resolver adapts to specify decollator to carry out The data form of field segmentation.

When the form of pending data is single master data form, search suitable from the multiple basic solution parsers prestored It is assigned in the basic solution parser of this master data form, pending data will meet solution by calling the basic solution parser found The field of analysis condition resolves to the metadata of specified format.

When the combination that the form of pending data is multiple master data form, for every kind of master data form, from The multiple basic solution parsers prestored are searched and adapt to the basic solution parser of this master data form, by call find many The field meeting analysis condition in pending data is resolved to the metadata of specified format by the combination of individual basic solution parser.

Such as, the pending data received are Apache daily records, corresponding to the data form of Apache daily record, are single Master data form, then the process resolved the pending data received is: from the multiple basic solution parsers prestored Find Apache daily record resolver, by calling this Apache daily record resolver, the field in pending data is resolved to finger The metadata of the formula that fixes.Or, the pending data received are carried out field segmentation by decollator, such as " field 1& field 2 ", wherein " " is decollator, and field 1 is array formats, and field 2 is Json form, then entering the pending data received When row resolves, need the combination calling decollator resolver, array daily record resolver and Json daily record resolver by pending number It is block form combination that field according to resolves to the metadata of specified format, array daily record resolver and Json daily record resolver, Decollator resolver and the combination of this block form form hierarchical combination, specifically, first pass through call separation symbol resolver and solve respectively Separate out field 1 and field 2, then by calling array daily record resolver, field 1 is resolved, and by calling Json daily record Field 2 is resolved by resolver.

Further, these pending data will meet institute by calling resolver corresponding to these pending data After stating the metadata that the field of analysis condition resolves to specified format, said method farther includes: the parsing that will be called Device is put in appointment global variable data base.Then by calling resolver corresponding to these pending data by this pending number Meet the field of analysis condition according to resolve to the metadata of specified format and include: according to the form of these pending data, from Described appointment global variable data base searches the resolver that these pending data are corresponding；If found, directly by adjusting With the resolver found, the field meeting analysis condition in these pending data is resolved to the metadata of specified format；As Fruit does not finds, and creates the resolver that these pending data are corresponding, by calling created resolver, this is pending The field meeting analysis condition in data resolves to the metadata of specified format.

Such as, from same data sources to pending data 1 and pending data 2, pending data 1 and pending number According to 2, there is identical data form, first pending data 1 are resolved, create the resolver 1 of pending data 1 correspondence, will Field in pending data 1 resolves to the metadata of specified format, after parsing, resolver 1 is put into the appointment overall situation and becomes In amount data base so that this resolver 1 exists can be called easily as global variable, then entering pending data 2 When row resolves, first from specifying the resolver searching whether pending data 2 correspondence global variable data base, due to pending Data 2 are identical with the data form of pending data 1, and resolver 1 adapts to pending data 2 equally, therefore, directly by adjusting With specifying the resolver 1 in global variable data base, pending data 2 are resolved, it is to avoid adapt to identical data form Resolver repeat create, it is to avoid the use of unnecessary system resource, and directly look for the process proportion of resolver globally The process of newly created resolver is faster, accelerates resolving, it is ensured that the real-time of log processing process.

In one embodiment of the invention, in step s 130 the pending data stored are carried out duplicate removal and add up it After, the method farther includes: according to the storage rule in configuration information, the duplicate removal statistical result obtained be saved in accordingly In storage medium, for user, duplicate removal statistical result is inquired about.Wherein, described storage medium include following one or more: Redis data base, big storage Redis data base, Mysql data base, HBase data base, HDFS data base, GreenPlum number According to storehouse.Different storage mediums has different characteristics, can select suitable storage medium according to storage demand, such as Redis Data base stores in internal memory based on key-value form, but when data volume reaches to a certain degree, can use The big storage Redis data base carrying out storing based on disk shares storage pressure, or can also use distributed storage GreenPlum data base shares storage pressure so that writes data in storage medium and reads data from storage medium Process the most quick, it is ensured that calculate in real time the real-time of platform, effectiveness and stability.

In a specific example, before statistical disposition result is saved in storage medium, it is also possible to statistics Result carries out polymerization process, to alleviate the pressure of storage medium, or, within real-time levels claimed range, set and touch Send out the condition of storage, after obtaining statistical disposition result, the most directly store, but laggard in the condition of satisfied triggering storage Row storage, equally alleviates storage pressure.

In one embodiment of the invention, data source include following one or more: Kafka data source, Qbus data Source, Scribe data source, Apache data source, Kestrel data source.

In one embodiment of the invention, platform is calculated in real time based on Storm Computational frame or Spark Streaming Computational frame carries out data process.In prior art, the system calculating platform in real time based on Spark Streaming Computational frame Meter processing procedure is by mean of what the internal memory of the Spark cluster of Streaming Computational frame own was carried out, calculates platform in real time After the pending data of data sources, pending data are directly placed in internal memory, then the data in internal memory are added up Process.And in the present embodiment, when calculating platform carries out data process based on Spark Streaming Computational frame in real time, By the data structure that compression stroke takies, the pending data received are carried out storage include: the pending number that will receive According in the data structure storage taken by compression stroke to internal memory.

Fig. 2 shows the signal of the data processing equipment of a kind of real-time calculating platform Figure.As in figure 2 it is shown, this data processing equipment 200 calculating platform in real time includes:

Task receives unit 210, is suitable to receive calculating task, reads the configuration information of this calculating task.

Data receipt unit 220, is suitable to, according to the data source information in configuration information, receive the most defeated from respective data sources The pending data entered；And be suitable to judge whether configuration information comprises duplicate removal statistical rules；It is then, waiting of receiving is located The data structure that reason data are taken by compression stroke stores.

Data statistics unit 230, is suitable to, according to the duplicate removal statistical rules in configuration information, enter the pending data of storage Row duplicate removal is added up.

Visible, the device shown in Fig. 2 performs and calculates the flow chart of data processing on platform in real time, for wrapping in configuration information Calculating task containing duplicate removal statistical rules, after the pending data of corresponding data sources, is different from prior art normal That sees is directly placed into the pending data received in internal memory, and carries out the mode of statistical disposition based on the data in internal memory, this The data structure that the pending data received are taken by compression stroke is stored by scheme, further according in configuration information The pending data of the duplicate removal statistical rules data structure storage to taking with compression stroke carry out duplicate removal statistics, make use of duplicate removal Statistics needs the feature of " duplicate removal ", exchanges space for by changing pending data storage method, alleviate and calculate platform in real time At workload, improve the work efficiency calculating platform in real time.

In one embodiment of the invention, the data structure that compression stroke takies includes: HyperLogLog data are tied Structure.Then data statistics unit 230, is suitable to exist according to the duplicate removal statistical rules in configuration information and pending data Base value in HyperLogLog data structure, obtains the duplicate removal statistical result of pending data.

In one embodiment of the invention, data receipt unit 220, be suitable to when configuration information comprises with data only One mark as index carry out adding up regular time, determine and configuration information comprise duplicate removal statistical rules.Wherein, with data only The rule that one mark carries out adding up as index includes: the statistical rules of independent visitor's number, the statistical rules of independent IP number, and/ Or, the statistical rules of independent search word.

In one embodiment of the invention, platform is calculated in real time based on Storm Computational frame or Spark Streaming Computational frame carries out data process.Data receipt unit 220, is suitable to ought calculate platform in real time and counts based on Spark Streaming Calculating framework and carry out data when processing, the data structure storage pending data received taken by compression stroke is to internal memory In.

Fig. 3 shows the signal of the data processing equipment of a kind of real-time calculating platform Figure.As it is shown on figure 3, this data processing equipment 300 calculating platform in real time enters to include: task receives unit 310, data receiver list Unit 320, data statistics unit 330, data parsing unit 340 and data storage cell 350.

Wherein, task reception task shown in unit 310, data receipt unit 320, data statistics unit 330 and Fig. 2 connects Receipts unit 210, data receipt unit 220, data statistics unit 230 have corresponding identical function, and identical part is at this not Repeat again.

Data parsing unit 340, is suitable to whether comprise duplicate removal statistical rules in data receipt unit judges configuration information Before, according to the analysis condition in configuration information, the field meeting described analysis condition in pending data is resolved to appointment The metadata of form.

Data receipt unit 320, is suitable to be stored the metadata of specified format by specified structure data.

Data statistics unit 330, is suitable to according to the duplicate removal statistical rules in configuration information, the unit to the specified format of storage Data carry out duplicate removal statistics.

Data storage cell 350, is suitable to, according to the storage rule in configuration information, the duplicate removal statistical result obtained be preserved In corresponding storage medium, for user, duplicate removal statistical result is inquired about.

In one embodiment of the invention, it is intended that the metadata of form is the key-value pair being made up of field and field value Form.

In one embodiment of the invention, data parsing unit 340, be suitable to the form according to pending data, determine Adapt to one or more analytical functions of pending data；Create the resolver that pending data are corresponding, in this resolver The one or more analytical function of dynamic registration；Pending data will meet parsing bar by calling created resolver The field of part resolves to the metadata of specified format.

In one embodiment of the invention, data parsing unit 340, it is further adapted for multiple basic solution parser that prestores, Each basic solution parser adapts to a kind of master data form；When the form of pending data is single master data form, Search from the multiple basic solution parsers prestored and adapt to the basic solution parser of this master data form, find by calling The field meeting described analysis condition in pending data is resolved to the metadata of specified format by basic solution parser.

Further, data parsing unit 340, it is further adapted for when the form of pending data is multiple master data form During combination, for every kind of master data form, search from the multiple basic solution parsers prestored and adapt to this master data form Basic solution parser, pending data will meet analysis condition by calling the combination of the multiple basic solution parsers found Field resolves to the metadata of specified format.

It should be noted that the corresponding phase of each embodiment of each embodiment of Fig. 2-Fig. 3 shown device and method shown in Fig. 1 With, the most it is described in detail, does not repeats them here.

In sum, the technical scheme that the present invention provides is visible, and the method shown in Fig. 1 describes and calculates on platform in real time Flow chart of data processing, for comprising the calculating task of duplicate removal statistical rules in configuration information, is treating from corresponding data sources After processing data, it is different from common be directly placed in internal memory, the pending data received based on internal memory of prior art In data carry out the mode of statistical disposition, the data that the pending data received are taken by compression stroke are tied by this programme Structure stores, further according to the data structure storage to taking with compression stroke of the duplicate removal statistical rules in configuration information wait locate Reason data carry out duplicate removal statistics, make use of the feature needing " duplicate removal " in duplicate removal statistics, by changing pending data storage side Formula exchanges space for, alleviates and calculates platform in real time in workload, the work efficiency of raising real-time calculating platform.

It should be understood that

Algorithm and display are not intrinsic to any certain computer, virtual bench or miscellaneous equipment relevant provided herein. Various fexible units can also be used together with based on teaching in this.As described above, construct required by this kind of device Structure be apparent from.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use various Programming language realizes the content of invention described herein, and the description done language-specific above is to disclose this Bright preferred forms.

In description mentioned herein, illustrate a large amount of detail.It is to be appreciated, however, that the enforcement of the present invention Example can be put into practice in the case of not having these details.In some instances, it is not shown specifically known method, structure And technology, in order to do not obscure the understanding of this description.

Similarly, it will be appreciated that one or more in order to simplify that the disclosure helping understands in each inventive aspect, exist Above in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes In example, figure or descriptions thereof.But, the method for the disclosure should not be construed to reflect an intention that i.e. required guarantor The application claims feature more more than the feature being expressly recited in each claim protected.More precisely, as following Claims reflected as, inventive aspect is all features less than single embodiment disclosed above.Therefore, The claims following detailed description of the invention are thus expressly incorporated in this detailed description of the invention, the most each claim itself All as the independent embodiment of the present invention.

Those skilled in the art are appreciated that and can carry out the module in the equipment in embodiment adaptively Change and they are arranged in one or more equipment different from this embodiment.Can be the module in embodiment or list Unit or assembly are combined into a module or unit or assembly, and can put them in addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit excludes each other, can use any Combine all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint Where method or all processes of equipment or unit are combined.Unless expressly stated otherwise, this specification (includes adjoint power Profit requires, summary and accompanying drawing) disclosed in each feature can be carried out generation by providing identical, equivalent or the alternative features of similar purpose Replace.

Although additionally, it will be appreciated by those of skill in the art that embodiments more described herein include other embodiments Some feature included by rather than further feature, but the combination of the feature of different embodiment means to be in the present invention's Within the scope of and form different embodiments.Such as, in the following claims, embodiment required for protection appoint One of meaning can mode use in any combination.

The all parts embodiment of the present invention can realize with hardware, or to run on one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that and can use in practice Microprocessor or digital signal processor (DSP) realize the data of a kind of real-time calculating platform according to embodiments of the present invention The some or all functions of the some or all parts in processing means.The present invention is also implemented as performing here Part or all equipment of described method or device program (such as, computer program and computer program product Product).The program of such present invention of realization can store on a computer-readable medium, or can have one or more The form of signal.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or to appoint What his form provides.

The present invention will be described rather than limits the invention to it should be noted above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference marks that should not will be located between bracket is configured to limitations on claims.Word " comprises " and does not excludes the presence of not Arrange element in the claims or step.Word "a" or "an" before being positioned at element does not excludes the presence of multiple such Element.The present invention and can come real by means of including the hardware of some different elements by means of properly programmed computer Existing.If in the unit claim listing equipment for drying, several in these devices can be by same hardware branch Specifically embody.Word first, second and third use do not indicate that any order.These word explanations can be run after fame Claim.

The invention discloses A1, the data processing method of a kind of real-time calculating platform, wherein, the method includes:

A2, method as described in A1, wherein, the data structure that described compression stroke takies includes: HyperLogLog data Structure；

A3, method as described in A1, wherein, described judge whether comprise duplicate removal statistical rules bag in described configuration information Include:

A4, method as described in A3, wherein, the described rule bag carrying out using unique mark of data as index adding up Include:

A5, method as described in A1, wherein, described judge whether described configuration information comprises duplicate removal statistical rules it Before, the method farther includes:

A6, method as described in A5, wherein, the metadata of described specified format is the key being made up of field and field value Value is to form.

A7, method as described in A5, wherein, the method farther includes: prestore multiple basic solution parser, each basic solution Parser adapts to a kind of master data form；

A8, method as described in A7, wherein, described resolve to the field meeting described analysis condition in pending data The metadata of specified format also includes:

A9, method as described in A5, wherein, described resolve to the field meeting described analysis condition in pending data The metadata of specified format includes:

A10, method as described in A1, wherein, after the described pending data to storage carry out duplicate removal statistics, the party Method farther includes:

A11, method as described in A1, wherein, described data source include following one or more:

A12, method as described in A1, wherein, described real-time calculating platform is based on Storm Computational frame or Spark Streaming Computational frame carries out data process；

The invention also discloses B13, the data processing equipment of a kind of real-time calculating platform, wherein, this device includes:

B14, device as described in B13, wherein, the data structure that described compression stroke takies includes: HyperLogLog number According to structure；

B15, device as described in B13, wherein,

Described data receipt unit, is suitable to carry out as index when comprising the unique mark using data in described configuration information Statistics regular time, determine and described configuration information comprise duplicate removal statistical rules.

B16, device as described in B15, wherein, the described rule bag carrying out using unique mark of data as index adding up Include:

B17, device as described in B13, wherein, this device farther includes:

B18, device as described in B17, wherein, the metadata of described specified format is to be made up of field and field value Key-value pair form.

B19, device as described in B17, wherein,

Described data parsing unit, is further adapted for multiple basic solution parser that prestores, and each basic solution parser adapts to one Plant master data form；When the form of pending data is single master data form, from the multiple basic solution parsers prestored Middle lookup adapts to the basic solution parser of this master data form, by calling the basic solution parser found by pending data In meet the field of described analysis condition and resolve to the metadata of specified format.

B20, device as described in B19, wherein,

Described data parsing unit, is further adapted for when the combination that the form of pending data is multiple master data form, For every kind of master data form, from the multiple basic solution parsers prestored, search the basic solution adapting to this master data form Parser, will meet the field of described analysis condition by calling the combination of the multiple basic solution parsers found in pending data Resolve to the metadata of specified format.

B21, device as described in B17, wherein,

Described data parsing unit, is suitable to the form according to pending data, determines adapt to pending data one Or multiple analytical function；Create the resolver that pending data are corresponding, in this resolver dynamic registration the one or more Analytical function；By calling created resolver, the field meeting analysis condition in pending data is resolved to specified format Metadata.

B22, device as described in B13, wherein, this device farther includes:

B23, device as described in B13, wherein, described data source include following one or more:

B24, device as described in B13, wherein, described real-time calculating platform is based on Storm Computational frame or Spark Streaming Computational frame carries out data process；

Claims

1. a data processing method for real-time calculating platform, wherein, the method includes:

The data structure that the most described compression stroke takies includes: HyperLogLog number According to structure；

The most described according to the duplicate removal statistical rules in described configuration information, the pending data of storage are carried out duplicate removal statistics bag Include: the base in HyperLogLog data structure according to the duplicate removal statistical rules in described configuration information and pending data Numerical value, obtains the duplicate removal statistical result of pending data.

The most described judge whether described configuration information comprises duplicate removal statistical rules bag Include:

4. method as claimed in claim 3, wherein, the described rule bag carrying out using unique mark of data as index adding up Include:

The most the method for claim 1, wherein judge whether described configuration information comprises duplicate removal statistical rules described Before, the method farther includes:

According to the analysis condition in described configuration information, the field meeting described analysis condition in pending data is resolved to finger The metadata of the formula that fixes；

The most described the pending data received are carried out storage by the data structure that compression stroke takies include: lattice will be specified The metadata of formula is stored by specified structure data；

The most described according to the duplicate removal statistical rules in described configuration information, the pending data stored are carried out statistics and includes: root According to the duplicate removal statistical rules in described configuration information, the metadata of the specified format of storage is carried out duplicate removal statistics.

6. a data processing equipment for real-time calculating platform, wherein, this device includes:

Data receipt unit, is suitable to, according to the data source information in described configuration information, receive input in real time from respective data sources Pending data；And be suitable to judge whether described configuration information comprises duplicate removal statistical rules；It is then, by treating of receiving The data structure that process data are taken by compression stroke stores；

Data statistics unit, is suitable to, according to the duplicate removal statistical rules in described configuration information, carry out the pending data of storage Duplicate removal is added up.

7. device as claimed in claim 6, wherein, the data structure that described compression stroke takies includes: HyperLogLog number According to structure；

The most described data statistics unit, is suitable to exist according to the duplicate removal statistical rules in described configuration information and pending data Base value in HyperLogLog data structure, obtains the duplicate removal statistical result of pending data.

8. device as claimed in claim 6, wherein,

Described data receipt unit, is suitable to add up as index when comprising the unique mark using data in described configuration information Regular time, determine and described configuration information comprise duplicate removal statistical rules.

9. device as claimed in claim 8, wherein, the described rule bag carrying out using unique mark of data as index adding up Include:

10. device as claimed in claim 6, wherein, this device farther includes:

Data parsing unit, is suitable to whether comprise duplicate removal statistical rules in described data receipt unit judges described configuration information Before, according to the analysis condition in described configuration information, the field meeting described analysis condition in pending data is resolved to The metadata of specified format；

The most described data statistics unit, is suitable to according to the duplicate removal statistical rules in described configuration information, the specified format to storage Metadata carry out duplicate removal statistics.