CN102214236B

CN102214236B - Method and system for processing mass data

Info

Publication number: CN102214236B
Application number: CN 201110182296
Authority: CN
Inventors: 祝博立
Original assignee: Beijing Feinno Communication Technology Co Ltd
Current assignee: Beijing Feinno Communication Technology Co Ltd
Priority date: 2011-06-30
Filing date: 2011-06-30
Publication date: 2013-10-23
Anticipated expiration: 2031-06-30
Also published as: CN102214236A

Abstract

The invention discloses a method for processing mass data. The method comprises the following steps that: a scheduling module judges whether to call a data warehouse operation statement (HQL) according to acquired current service information and a predetermined scheduling strategy, acquires a calling sequence according to the acquired current service information and the predetermined scheduling strategy if the HQL is called, and calls the HQL to a data warehouse platform according to the calling sequence; and the data warehouse platform reads configuration information which corresponds to a data warehouse from a relational database, triggers the HQL to perform operation on data stored in a distributed platform according to the calling sequence, generates result data and stores the result data into the distributed platform. The invention also discloses a system for processing the mass data. By the method and the system provided by the invention, the flexibility of processing of the mass data can be improved.

Description

A kind of mass data processing method and system

Technical field

The present invention relates to data processing technique, particularly relate to a kind of mass data processing method and system.

Background technology

Along with the fast development of Internet technology, Internet user's quantity sharp increase, therefore, more and more for the demand that the data such as the collection of Internet user's data, cleaning, statistics, analysis are processed.Simultaneously, the magnitude of Internet user's data also is being explosive growth, thereby the pressure that causes above-mentioned data to be processed further increases.

At present, when Internet user's mass data is processed, the method that adopts distributed platform (Hadoop) technology to combine with Data Warehouse Platform (Hive) technology.In distributed platform storage mass data, the calculation command by console instructions calling data warehouse action statement (HQL) to the mass data of distributed platform storage add up, the processing such as analysis, the very flexible of the method when command calls.

Summary of the invention

The invention provides a kind of mass data processing method, adopt the method can strengthen the dirigibility of mass data processing.

The present invention also provides a kind of mass data processing system, adopts this system can strengthen the dirigibility of mass data processing.

For achieving the above object, technical scheme of the present invention is achieved in that

The invention discloses a kind of mass data processing method, comprising:

Scheduler module judges whether calling data warehouse action statement according to the current business information obtained and default scheduling strategy, when being judged as when being, obtaining according to the current business information of obtaining and default scheduling strategy and to call order;

Scheduler module is called sequentially to Data Warehouse Platform calling data warehouse action statement according to described;

Data Warehouse Platform reads configuration information corresponding to described data warehouse action statement from relational database;

Data Warehouse Platform triggers described data warehouse action statement the data of distributed platform storage is carried out computing according to the described order of calling, and generates result data and also stores described distributed platform into.

Described generation destination file also stores into after the described distributed platform, also comprises:

Scheduler module is controlled described distributed platform described result data is imported described relational database;

Scheduler module control cache module extracts result data commonly used according to the default strategy that represents from described relational database;

The data exhibiting platform reads from described cache module and represents described result data commonly used.

Described data exhibiting platform reads from described cache module and represents after the described destination file commonly used, also comprises:

The data exhibiting platform reads from described relational database and represents described result data.

Described scheduler module judges whether also to comprise before the calling data warehouse action statement according to the current business information of obtaining and default scheduling strategy:

The data access platform transmits at least one times data to distributed platform;

When being transmitted, the data access platform sends data transmission to the message interface module and finishes message at every turn;

Described scheduler module is obtained at least one times described data transmission from described message interface module and is finished message, as described current business information.

Described data access platform is finished message to message interface module transmission data transmission and is comprised:

Described data access platform adopts the transmission of messages scheme protoBuffer of Google communication modes to send described data transmission to the message interface module and finishes message.

The invention discloses a kind of mass data processing system, comprising:

Scheduler module, be used for judging whether calling data warehouse action statement according to the current business information of obtaining and default scheduling strategy, when being judged as when being, obtain according to the current business information of obtaining and default scheduling strategy and to call order, according to the described order of calling to Data Warehouse Platform calling data warehouse action statement;

Described Data Warehouse Platform, be used for reading configuration information corresponding to described data warehouse action statement from relational database, trigger described data warehouse action statement the data of distributed platform storage are carried out computing according to the described order of calling, generate result data and also store described distributed platform into;

Described relational database is used for storing configuration information corresponding to described data warehouse action statement;

Distributed platform is used for storing described data and described result data.

Described scheduler module also is used for controlling described distributed platform described result data is imported described relational database, and the control cache module extracts result data commonly used according to the default strategy that represents from described relational database;

Described system also comprises:

Described cache module: be used for the described result data commonly used of buffer memory;

The data exhibiting platform is used for reading and representing described result data commonly used from described cache module.

Described data exhibiting platform also is used for reading and representing described result data from described relational database.

Described system also comprises:

The data access platform is used for transmitting at least one times data to distributed platform, when being transmitted at every turn, sending data transmission to the message interface module and finishes message;

Described message interface module is used for receiving described data transmission and finishes message;

Described scheduler module also is used for obtaining at least one times described data transmission from described message interface module and finishes message, as described current business information.

Described data access platform specifically is used for adopting the transmission of messages scheme protoBuffer of Google communication modes to send described data transmission to the message interface module and finishes message.

By the foregoing invention content as seen, in the mass data processing system, add scheduler module, this module is determined calling data warehouse action statement and is called order according to current business information and default scheduling strategy, under the control of scheduler module, finish data handling procedure, thereby avoided in the existing mass data processing system by control desk transmitting order to lower levels one by one, because control by scheduler module, can be according to the logic of the business of required realization, the corresponding scheduling strategy of flexible configuration and call order, thus the dirigibility of mass data processing strengthened.

Description of drawings

Fig. 1 is the process flow diagram of the mass data processing method of the embodiment of the invention one;

Fig. 2 is the process flow diagram of the mass data processing method of the embodiment of the invention two;

Fig. 3 is the structural representation of the mass data processing system of the embodiment of the invention three.

Embodiment

In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.

Basic thought of the present invention is, in the mass data processing system, add scheduler module, this module is determined calling data warehouse action statement and is called order according to current business information and default scheduling strategy, finish data handling procedure under the control of scheduler module.

Fig. 1 is the process flow diagram of the mass data processing method of the embodiment of the invention one.As shown in Figure 1, the method comprises following process at least.

Step 101: scheduler module judges whether calling data warehouse action statement according to the current business information obtained and default scheduling strategy, when being judged as when being, obtaining according to the current business information of obtaining and default scheduling strategy and to call order.

Step 102: scheduler module is according to calling order to Data Warehouse Platform calling data warehouse action statement.

Step 103: Data Warehouse Platform is configuration information corresponding to reading out data warehouse action statement from relational database (mysql).

Step 104: Data Warehouse Platform carries out computing according to calling order trigger data warehouse action statement to the data of distributed platform storage, generates result data and stores distributed platform into.

Fig. 2 is the process flow diagram of the mass data processing method of the embodiment of the invention two.As shown in Figure 2, the method comprises following process.

Step 201: the data access platform transmits at least one times data to distributed platform.

In this step, a kind of better embodiment is that the data transmission that the data access platform regularly will receive is to the distributed platform the inside.Distributed platform supports data receiver, arrangement, calculating, the distribution result of calculation of peripheral system to arrive the functions such as reporting system.Particularly, distributed platform is the data storage platform under the foundation (Apache) of abroad increasing income, by member compositions such as distributed file system (HDFS), distributed document processing.Wherein, the processing of distributed file system (HDFS) and distributed document is two most important members the most basic.Distributed file system (HDFS) is the version of increasing income of Google's Distribute file system (GFS), it is the distributed file system of an Error Tolerance, it can provide the data access of high-throughput, the large file that is fit to storage magnanimity, the large file that surpasses 64M of PB level for example, become N little file to be distributed to above the different machines large file declustering, and the quantity of backup can be set, thereby when some machine goes wrong, still can work.It is the sharp weapon that large-scale data calculates that distributed document is processed, and for example the TB DBMS comprises that distributed data extracts (Map) and distributed data is processed (Reduce) module.The distributed data abstraction module is responsible for data are broken up; The distributed data processing module is responsible for data are assembled.The user only need to realize that distributed data extracts and distributed data is processed two interfaces, can finish the calculating of TB DBMS.Distributed document is processed and can be applied to the data analyses such as log analysis and data mining, also can be applicable to science data and calculates, such as the calculating of circular constant PI etc.

Step 202: when being transmitted, the data access platform sends data transmission to the message interface module and finishes message at every turn.

In this step, when the data access platform was finished to the distributed platform the transmission of data at every turn, the data access platform sent data transmission to the message interface module and finishes message, and the information synchronization of data transmission being finished by this message is to the application system of data platform.A kind of better embodiment is that the data access platform adopts a kind of transmission of messages scheme (protoBuffer) communication modes of Google to send data transmission to the message interface module and finishes message.

Step 203: scheduler module is obtained at least one times data transmission from the message interface module and is finished message, as current business information.

In this step, for example, the data access platform has transmitted 3 secondary data to distributed platform, correspondingly, scheduler module is obtained 3 data transfer from the message interface module and is finished message, and scheduler module is finished message as current business information with the data transmission of obtaining for 3 times.

Step 204: scheduler module judges whether calling data warehouse action statement according to the current business information of obtaining and default scheduling strategy.When being judged as when being execution in step 205; When whether being judged as, return step 201.

In this step, scheduling strategy sets in advance in scheduler module.Scheduling strategy is used to indicate the trigger condition of calling data warehouse action statement, if current business information satisfies the scheduling strategy defined terms, then scheduler module is judged as calling data warehouse action statement, otherwise, if current business information does not satisfy the scheduling strategy defined terms, then scheduler module is judged as and never calls the data warehouse action statement.For example, the data that the data access platform receives comprise the data of many aspects, data import to the distributed platform the inside several times, correspondingly, scheduler module is obtained repeatedly data transmission from the message interface module and is finished message, dispatching system is finished message according to data transmission repeatedly and is judged whether calling data warehouse action statement, according to scheduling strategy, when receiving only that wherein partial data is transmitted message, never call the data warehouse action statement, only have when the data of above-mentioned many aspects all complete import to distributed platform after, receive whole data transmission and finish message, scheduler module just is judged as beginning calling data warehouse action statement, calculates to carry out data.

Step 205: scheduler module is obtained according to the current business information of obtaining and default scheduling strategy and is called order.

In this step, because calculating, data comprise a lot of steps, mutually not subsistence logic contact between some steps, and must carry out in a certain order between some steps, therefore, carry out calculating according to certain sequence call data warehouse action statement of calling.This calls order and sets in advance in scheduler module.Can preset a plurality of orders of calling in scheduler module, scheduler module can select to call accordingly order according to the current business information of obtaining and default scheduling strategy.

Step 206: scheduler module is according to calling order to Data Warehouse Platform calling data warehouse action statement.

Step 207: Data Warehouse Platform is configuration information corresponding to reading out data warehouse action statement from relational database.

In this step, Data Warehouse Platform is a Structured Query Language (SQL) (SQL) analytics engine, and it is used for that SQL statement is translated into distributed data extraction/distributed data processes, and then carries out in distributed platform, to reach the purpose of fast Development.The table of storing in the Data Warehouse Platform is the catalogue of distributed platform, particularly, the Data Warehouse Platform default table is deposited the data warehouse catalogue that the path is positioned at the work at present catalogue, separate as file with table name, if there is partition table in work at present, then the subregion value is sub-folder, can directly directly use this part data in other distributed data extraction/distributed data is processed.Data Warehouse Platform can carry out related with relational database.The file that the data warehouse action statement need to be operated or catalogue are mapped to table name information and are stored in the relational database, and the field information that the field in the file also is mapped to the table that will operate is stored in the relational database, and the table name information that above-mentioned mapping obtains and field information are as the configuration information of this data warehouse action statement.When data warehouse receives the order that calling data warehouse action statement calculates, can resolve the order that receives, and from relational database, read the relevant configuration information of data warehouse action statement that calls, be translated into distributed data extraction/distributed data handling procedure according to this configuration information and carry out statistical computation.

Step 208: Data Warehouse Platform carries out computing according to calling order trigger data warehouse action statement to the data of distributed platform storage, generates result data and stores distributed platform into.

Step 209: scheduler module control distributed platform imports relational database with result data.

In this step, particularly, calling module adopts and imports the result data that algorithm generates from the reading out data warehouse calculating of distributed platform the inside, this result data can be with the storage of the form of destination file, then calling module according to business demand with in a plurality of tables of data of the above results data importing in the relational database.

Step 210: scheduler module control cache module extracts result data commonly used according to the default strategy that represents from relational database.

In this step, representing strategy sets in advance in scheduler module, this represents the frequently-used data that strategy is used to indicate exhibition platform, scheduler module represents strategy according to this, and the result data that belongs to the frequently-used data of exhibition platform in the result data of storing in the relational database is drawn in the cache module.Particularly, cache module can adopt memory cache (memcache) technology, it is a high performance distributed memory object caching system, data by huge hash (Hash) table of safeguarding a unification in internal memory is stored various forms comprise the result of image, video, file and database retrieval etc.Cache module is a kind of distributed, namely can allow a plurality of users on the different main frames to access simultaneously, thereby not only having solved shared drive can only be the drawback of unit, but also has reduced the pressure of database retrieval, and has improved the speed of obtaining data of accessing.

Step 211: the data exhibiting platform reads from cache module and represents result data commonly used.

In this step, the data exhibiting platform obtains by reading result data from cache module, and represent result data commonly used after acquisition for self data commonly used.The data that are of little use for the data exhibiting platform are because can't read from cache module, so continue to carry out following step 212.

Step 212: the data exhibiting platform reads from relational database and represents result data.

In this step, the data that the data exhibiting platform is of little use for example, need the data of dynamic mapping and inquiry etc., and the data exhibiting platform obtains by reading result data from relational database, and represents result data commonly used after acquisition.

Fig. 3 is the structural representation of the mass data processing system of the embodiment of the invention three.As shown in Figure 3, this mass data processing system comprises at least: scheduler module 31, Data Warehouse Platform 32, relational database 33 and distributed platform 34.On this basis, can also comprise: data access platform 35, message interface module 36, cache module 37 and data exhibiting platform 38.Above-mentioned message interface module 36 can all be arranged in application system with scheduler module 31.Wherein processing mode and the flow process of each ingredient execution can be referring to the records of the embodiment of the invention one and the embodiment of the invention two.

Wherein, scheduler module 31 judges whether calling data warehouse action statement according to the current business information of obtaining and default scheduling strategy, when being judged as when being, obtain according to the current business information of obtaining and default scheduling strategy and to call order, according to calling order to Data Warehouse Platform 32 calling data warehouse action statement.

Data Warehouse Platform 32 is configuration information corresponding to reading out data warehouse action statement from relational database 33, according to calling order trigger data warehouse action statement the data of distributed platform 34 storages are carried out computing, generate result data and store distributed platform 34 into.

Configuration information corresponding to relational database 33 storage data warehouse action statement.

The distributed platform 34 above-mentioned data of storage and the above results data.

On the basis of technique scheme, in the situation that comprise data access platform 35 and message interface module 36 in the said system, data access platform 35 transmits at least one times data to distributed platform 34, when being transmitted at every turn, sending data transmission to message interface module 36 and finishes message.Message interface module 36 receive datas are transmitted message.Scheduler module 31 is obtained at least one times data transmission from message interface module 36 and is finished message, as current business information.Particularly, data access platform 35 specifically can adopt a kind of transmission of messages scheme of Google, and for example the protoBuffer communication modes sends data transmission to message interface module 36 and finishes message.Wherein, data access platform 35 is used for the data access of peripheral system, supports the real-time interface access.The data form according to the rules that data access platform 35 receives generates text, for example file of txt form.And data access platform 35 regularly is transferred to above-mentioned text the HDFS file system the inside of distributed platform 34.

On the basis of technique scheme, in the situation that comprise cache module 37 in the said system, scheduler module 31 is also controlled distributed platform 34 result data is imported relational database 33, and control cache module 37 extracts result data commonly used according to the default strategy that represents from relational database 33.The result data that cache module 37 buffer memorys are commonly used.

Data exhibiting platform 38 represents the interface with the result data of the final arrangement of notebook data disposal system.The Data Source of data exhibiting platform 38 comprises following two kinds: the first, from cache module 37, obtain; The second, from relational database, obtain.Particularly, data exhibiting platform 38 reads from cache module 37 and represents result data commonly used.And data exhibiting platform 38 also reads from relational database 33 and represents result data.

According to above embodiment as seen, in the mass data processing system, add scheduler module, this module is determined calling data warehouse action statement and is called order according to current business information and default scheduling strategy, under the control of scheduler module, finish data handling procedure, thereby avoided in the existing mass data processing system by control desk transmitting order to lower levels one by one, because control by scheduler module, can be according to the logic of the business of required realization, the corresponding scheduling strategy of flexible configuration and call order, thus the dirigibility of mass data processing strengthened.And, by cache module storage result data commonly used, the data exhibiting module is reading result data and representing from cache module preferentially, only have when not storing required result data in the cache module, the data exhibiting platform just can read from database, thereby has reduced the pressure that a large amount of access cause to the data exhibiting platform by increasing cache module.

The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. a mass data processing method is characterized in that, adds scheduler module in the mass data processing system, and the method comprises:

The data access platform transmits at least one times data to distributed platform; When being transmitted, the data access platform sends data transmission to the message interface module and finishes message at every turn; Scheduler module is obtained at least one times described data transmission from described message interface module and is finished message, as current business information;

Scheduler module judges whether calling data warehouse action statement according to the current business information obtained and default scheduling strategy, when being judged as when being, obtaining according to the current business information of obtaining and default scheduling strategy and to call order; Scheduling strategy sets in advance in scheduler module; The order of calling of obtaining sets in advance in scheduler module, is preset with a plurality of orders of calling in the scheduler module;

2. mass data processing method according to claim 1 is characterized in that, described generation destination file also stores into after the described distributed platform, also comprises:

3. mass data processing method according to claim 2 is characterized in that, described data exhibiting platform reads from described cache module and represents after the described destination file commonly used, also comprises:

4. each described mass data processing method in 3 according to claim 1 is characterized in that, described data access platform sends data transmission to the message interface module and finishes message and comprise:

5. a mass data processing system is characterized in that, has added scheduler module in this mass data processing system, and this mass data processing system comprises:

The message interface module is used for receiving described data transmission and finishes message;

Scheduler module, be used for obtaining at least one times described data transmission from described message interface module and finish message, as current business information, be used for judging whether calling data warehouse action statement according to the current business information of obtaining and default scheduling strategy, when being judged as when being, obtain according to the current business information of obtaining and default scheduling strategy and to call order, according to the described order of calling to Data Warehouse Platform calling data warehouse action statement; Wherein, scheduling strategy sets in advance in scheduler module; The order of calling of obtaining sets in advance in scheduler module, is preset with a plurality of orders of calling in the scheduler module;

6. mass data processing according to claim 5 system is characterized in that,

Described system also comprises:

7. mass data processing according to claim 6 system is characterized in that,

8. each described mass data processing system in 7 according to claim 5 is characterized in that,