CN103559036A

CN103559036A - Data batch processing system and method based on Hadoop

Info

Publication number: CN103559036A
Application number: CN201310538259.7A
Authority: CN
Inventors: 王欢龙
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Zhongsou Network Technology Co ltd
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2014-02-05

Abstract

The invention provides a data batch processing system and a data batch processing method based on Hadoop. The system comprises an application interface module, a task scheduling module, a task executing module and a result collecting module, wherein the application interface module comprises a Push interface and a Pop interface, the task scheduling module is used for scheduling tasks received by the Push interface, the task executing module is used for processing the tasks, and the result collecting module is used for analyzing and pushing processing results of the tasks. Through the system and the method, developers can meet requirement of processing big data only through invoking two simple interfaces, so the acting point of the developers focuses on the flow control before and after data processing and other developing requirements, the mastering of Hadoop is not needed, and the workload and the development time are greatly reduced.

Description

A kind of batch disposal system and method based on Hadoop

Technical field

The invention belongs to data processing field, be specifically related to a kind of batch disposal system and method based on Hadoop.

Background technology

Be the epoch of the blast of internet information now, the processing of large data is ubiquitous, how better to process faster the main bugbear that large data have become all Internet enterprises.

At present, Hadoop is one of most widely used technology of large Data processing, has become the popular data processing framework technology of increasing income, and becomes the synonym of large data.Hadoop is used MapReduce and distributed file system (HDFS) to realize.MapReduce is that a plurality of little task pieces go to carry out by application cutting, and in order to guarantee the reliability of data processing, HDFS can create a plurality of copies for data block, and is placed in group's computing node, and process in the place that MapReduce just deposits at data trnascription.

Existing batch disposal system is unit operation, can not reach real-time and the high efficiency of data processing.If distributed batch disposal system newly developed need to consume a large amount of time and efforts, from the angle of project development, consider to use Hadoop technology.

And the large data processing advantage of Hadoop only just can embody when data volume acquires a certain degree, generally, directly use Hadoop must will do two things: the design and development of source data collection module and result data parsing module.Although the large data-handling capacity of Hadoop is very powerful, yet, allow developer's success a correct utilization Hadoop who puts forth effort on application, still there is no small difficulty, require a great deal of time and experience, especially what understand Hadoop, build and configure startup and the structure of HDFS and libhdfs etc.This all can affect the use of Hadoop and the construction cycle of project greatly.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of batch disposal system and method based on Hadoop.

In order to realize foregoing invention object, the present invention takes following technical scheme:

On the one hand, the invention provides a kind of batch disposal system based on Hadoop, it is characterized in that, this system comprises:

AIM, comprises Push interface and Pop interface;

Task scheduling modules, receiving for dispatching Push interface of task;

Task execution module, for the treatment of described task; With

Collection module, for resolving and push the result of described task.

Preferably, described Push interface is for reception task, and described Pop interface is for obtaining the execution result of this system to described task.

Preferably, the task that described scheduling Push interface receives comprises: obtain described task and arranged packing, writing cache file, and according to file size calling task execution module, generating destination file.

Preferably, described task execution module comprises Hadoop module and local execution module; Described Hadoop module comprises that Hadoop's calls executive routine and script; Described local execution module, for the local Done function of carrying out, returns results data.

Preferably, describedly according to file size calling task execution module, comprise: if in cache file data number be greater than 2000 or file size be greater than 1G, the Hadoop module in calling task execution module; Otherwise, the local execution module in calling task execution module.

On the other hand, the invention provides a kind of batch disposal route based on Hadoop, it is characterized in that: said method comprising the steps of:

A.Push interface receives task, and submits task scheduling modules;

B. task scheduling modules arranges packing by described task, writes cache file, obtains cache file size, and is handed to task execution module;

C. task execution module is processed described cache file, and generates destination file;

D. collection module parses destination file push to pop interface.

Preferably, described step C comprises: if cache file in data number be greater than 2000 or file size be greater than 1G, the Hadoop module in calling task execution module is carried out described processing, and generates destination file; Otherwise the local execution module in calling task execution module carries out described processing, and generates destination file.

Described step C comprises: if task scheduling modules detects Hadoop module, can not normally use, the local execution module in calling task execution module carries out described processing, and generates destination file.

Preferably, described local execution module reads current cache file by Done thread, and calls Done function and draw result data.

Compared with prior art, beneficial effect of the present invention is:

By the present invention, developer only need simply call two simple interfaces just can solve the demand of processing large data, make developer's center of effort be placed on the flow process control of data processing front and back and other development requirement, and do not need to be grasped Hadoop, greatly reduced its workload and development time.

Accompanying drawing explanation

Fig. 1 is the structural drawing of batch processing system of the present invention;

Fig. 2 is the use process flow diagram of batch processing system of the present invention;

Fig. 3 is the particular flow sheet of batch processing method of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

The mode of the most desirable large data processing is: provide Push and Pop two interfaces, developer only need to call Push interface source data is offered to batch disposal system, calls Pop interface and fetches data result, as shown in Figure 2.

Based on above-mentioned analysis, the framework of native system (as Fig. 1) is elaborated:

1) application-interface layer: this module is the external mutual unique channel of native system, it provides Push and two interfaces of Pop.User calls Push interface and pushes task to native system, calls Pop interface and obtains result from native system.

2) task scheduling modules: the task that this module is responsible for application layer Push to come in arranges packing, writes cache file, and according to system state calling task execution module, generate destination file.

3) task execution module: this module is responsible for calling Hadoop kernel or local execution module, returns results data.

4) collection module: result output buffer is resolved and pushed to the destination file that this module is responsible for task scheduling modules to generate, the confession user Pop result of going out on missions.

5) Hadoop module: this module package Hadoop concrete call executive routine and script.

6) local execution module: this module is only effective when Hadoop can not normal service, is responsible for the local Done of execution function, returns results data.

The concrete using method following (as shown in Figure 3) of this system:

1. task scheduling modules is got source data from source data input queue, goto2.

2. whether task scheduling modules to detect Hadoop normal, as normally, and goto3, otherwise goto4.

3. task scheduling modules writes cache file current data, and checks whether current cache file meets submission condition (as the time is greater than 5 minutes or file size is greater than 1G), goto5 in this way, otherwise goto1.

4. local execution module starts local queue Done thread, and current data is write to local input queue, and local queue Done thread is responsible for obtaining source data from local input queue, after execution Done function, writes result data output queue.Also check and whether have the cache file of not submitting to simultaneously, as existed, goto8.

5. task execution module checks that the whether satisfied condition that is submitted to Hadoop execution of current cache file is (as task number >2000 in cache file, just need to be put into the upper execution of hadoop, because Hadoop starts and stop a job, need to start a large amount of advance ratios more consuming time) goto6 in this way, otherwise goto8.

6.Hadoop module submits to current cache file to carry out to Hadoop, and fetches destination file, goto7.

7. collection module reading result file, writes result output queue, and the Pop interface that wait user calls application-interface layer obtains result data.goto1。

8. the local file Done thread of local execution module reads current cache file and calls Done function and show that result data writes result output queue, goto1.

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims

1. the batch disposal system based on Hadoop, is characterized in that, this system comprises:

AIM, comprises Push interface and Pop interface;

Task scheduling modules, receiving for dispatching Push interface of task;

Task execution module, for the treatment of described task;

Collection module, for resolving and push the result of described task.

2. the system as claimed in claim 1, is characterized in that: described Push interface is for reception task, and described Pop interface is for obtaining the execution result of this system to described task.

3. the system as claimed in claim 1, is characterized in that: the task that described scheduling push interface receives comprises: obtain described task and arranged packing, writing cache file, and according to file size calling task execution module, generating destination file.

4. the system as claimed in claim 1, is characterized in that: described task execution module comprises Hadoop module and local execution module; Described Hadoop module comprises that Hadoop's calls executive routine and script; Described local execution module, for the local Done function of carrying out, returns results data.

5. system as claimed in claim 3, is characterized in that: describedly according to file size calling task execution module, comprise: if in file task number be greater than 2000 or file size be greater than 1G, the Hadoop module in calling task execution module; Otherwise, the local execution module in calling task execution module.

6. the batch disposal route based on Hadoop, is characterized in that: said method comprising the steps of:

A.Push interface receives task, and submits task scheduling modules;

D. collection module parses destination file push to Pop interface.

7. method as claimed in claim 6, it is characterized in that, described step C comprises: if cache file in task number be greater than 2000 or cache file size be greater than 1G, the Hadoop module in calling task execution module is carried out described processing, and generates destination file; Otherwise the local execution module in calling task execution module carries out described processing, and generates destination file.

8. method as claimed in claim 6, is characterized in that, described step C comprises: if task scheduling modules detects Hadoop module, can not normally use, the local execution module in calling task execution module carries out described processing, and generates destination file.

9. method as claimed in claim 7 or 8, is characterized in that: described local execution module reads current cache file by Done thread, and calls Done function and draw result data.