CN103559036A - Data batch processing system and method based on Hadoop - Google Patents

Data batch processing system and method based on Hadoop Download PDF

Info

Publication number
CN103559036A
CN103559036A CN201310538259.7A CN201310538259A CN103559036A CN 103559036 A CN103559036 A CN 103559036A CN 201310538259 A CN201310538259 A CN 201310538259A CN 103559036 A CN103559036 A CN 103559036A
Authority
CN
China
Prior art keywords
task
module
execution module
hadoop
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310538259.7A
Other languages
Chinese (zh)
Inventor
王欢龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201310538259.7A priority Critical patent/CN103559036A/en
Publication of CN103559036A publication Critical patent/CN103559036A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data batch processing system and a data batch processing method based on Hadoop. The system comprises an application interface module, a task scheduling module, a task executing module and a result collecting module, wherein the application interface module comprises a Push interface and a Pop interface, the task scheduling module is used for scheduling tasks received by the Push interface, the task executing module is used for processing the tasks, and the result collecting module is used for analyzing and pushing processing results of the tasks. Through the system and the method, developers can meet requirement of processing big data only through invoking two simple interfaces, so the acting point of the developers focuses on the flow control before and after data processing and other developing requirements, the mastering of Hadoop is not needed, and the workload and the development time are greatly reduced.

Description

A kind of batch disposal system and method based on Hadoop
Technical field
The invention belongs to data processing field, be specifically related to a kind of batch disposal system and method based on Hadoop.
Background technology
Be the epoch of the blast of internet information now, the processing of large data is ubiquitous, how better to process faster the main bugbear that large data have become all Internet enterprises.
At present, Hadoop is one of most widely used technology of large Data processing, has become the popular data processing framework technology of increasing income, and becomes the synonym of large data.Hadoop is used MapReduce and distributed file system (HDFS) to realize.MapReduce is that a plurality of little task pieces go to carry out by application cutting, and in order to guarantee the reliability of data processing, HDFS can create a plurality of copies for data block, and is placed in group's computing node, and process in the place that MapReduce just deposits at data trnascription.
Existing batch disposal system is unit operation, can not reach real-time and the high efficiency of data processing.If distributed batch disposal system newly developed need to consume a large amount of time and efforts, from the angle of project development, consider to use Hadoop technology.
And the large data processing advantage of Hadoop only just can embody when data volume acquires a certain degree, generally, directly use Hadoop must will do two things: the design and development of source data collection module and result data parsing module.Although the large data-handling capacity of Hadoop is very powerful, yet, allow developer's success a correct utilization Hadoop who puts forth effort on application, still there is no small difficulty, require a great deal of time and experience, especially what understand Hadoop, build and configure startup and the structure of HDFS and libhdfs etc.This all can affect the use of Hadoop and the construction cycle of project greatly.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of batch disposal system and method based on Hadoop.
In order to realize foregoing invention object, the present invention takes following technical scheme:
On the one hand, the invention provides a kind of batch disposal system based on Hadoop, it is characterized in that, this system comprises:
AIM, comprises Push interface and Pop interface;
Task scheduling modules, receiving for dispatching Push interface of task;
Task execution module, for the treatment of described task; With
Collection module, for resolving and push the result of described task.
Preferably, described Push interface is for reception task, and described Pop interface is for obtaining the execution result of this system to described task.
Preferably, the task that described scheduling Push interface receives comprises: obtain described task and arranged packing, writing cache file, and according to file size calling task execution module, generating destination file.
Preferably, described task execution module comprises Hadoop module and local execution module; Described Hadoop module comprises that Hadoop's calls executive routine and script; Described local execution module, for the local Done function of carrying out, returns results data.
Preferably, describedly according to file size calling task execution module, comprise: if in cache file data number be greater than 2000 or file size be greater than 1G, the Hadoop module in calling task execution module; Otherwise, the local execution module in calling task execution module.
On the other hand, the invention provides a kind of batch disposal route based on Hadoop, it is characterized in that: said method comprising the steps of:
A.Push interface receives task, and submits task scheduling modules;
B. task scheduling modules arranges packing by described task, writes cache file, obtains cache file size, and is handed to task execution module;
C. task execution module is processed described cache file, and generates destination file;
D. collection module parses destination file push to pop interface.
Preferably, described step C comprises: if cache file in data number be greater than 2000 or file size be greater than 1G, the Hadoop module in calling task execution module is carried out described processing, and generates destination file; Otherwise the local execution module in calling task execution module carries out described processing, and generates destination file.
Described step C comprises: if task scheduling modules detects Hadoop module, can not normally use, the local execution module in calling task execution module carries out described processing, and generates destination file.
Preferably, described local execution module reads current cache file by Done thread, and calls Done function and draw result data.
Compared with prior art, beneficial effect of the present invention is:
By the present invention, developer only need simply call two simple interfaces just can solve the demand of processing large data, make developer's center of effort be placed on the flow process control of data processing front and back and other development requirement, and do not need to be grasped Hadoop, greatly reduced its workload and development time.
Accompanying drawing explanation
Fig. 1 is the structural drawing of batch processing system of the present invention;
Fig. 2 is the use process flow diagram of batch processing system of the present invention;
Fig. 3 is the particular flow sheet of batch processing method of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
The mode of the most desirable large data processing is: provide Push and Pop two interfaces, developer only need to call Push interface source data is offered to batch disposal system, calls Pop interface and fetches data result, as shown in Figure 2.
Based on above-mentioned analysis, the framework of native system (as Fig. 1) is elaborated:
1) application-interface layer: this module is the external mutual unique channel of native system, it provides Push and two interfaces of Pop.User calls Push interface and pushes task to native system, calls Pop interface and obtains result from native system.
2) task scheduling modules: the task that this module is responsible for application layer Push to come in arranges packing, writes cache file, and according to system state calling task execution module, generate destination file.
3) task execution module: this module is responsible for calling Hadoop kernel or local execution module, returns results data.
4) collection module: result output buffer is resolved and pushed to the destination file that this module is responsible for task scheduling modules to generate, the confession user Pop result of going out on missions.
5) Hadoop module: this module package Hadoop concrete call executive routine and script.
6) local execution module: this module is only effective when Hadoop can not normal service, is responsible for the local Done of execution function, returns results data.
The concrete using method following (as shown in Figure 3) of this system:
1. task scheduling modules is got source data from source data input queue, goto2.
2. whether task scheduling modules to detect Hadoop normal, as normally, and goto3, otherwise goto4.
3. task scheduling modules writes cache file current data, and checks whether current cache file meets submission condition (as the time is greater than 5 minutes or file size is greater than 1G), goto5 in this way, otherwise goto1.
4. local execution module starts local queue Done thread, and current data is write to local input queue, and local queue Done thread is responsible for obtaining source data from local input queue, after execution Done function, writes result data output queue.Also check and whether have the cache file of not submitting to simultaneously, as existed, goto8.
5. task execution module checks that the whether satisfied condition that is submitted to Hadoop execution of current cache file is (as task number >2000 in cache file, just need to be put into the upper execution of hadoop, because Hadoop starts and stop a job, need to start a large amount of advance ratios more consuming time) goto6 in this way, otherwise goto8.
6.Hadoop module submits to current cache file to carry out to Hadoop, and fetches destination file, goto7.
7. collection module reading result file, writes result output queue, and the Pop interface that wait user calls application-interface layer obtains result data.goto1。
8. the local file Done thread of local execution module reads current cache file and calls Done function and show that result data writes result output queue, goto1.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims (9)

1. the batch disposal system based on Hadoop, is characterized in that, this system comprises:
AIM, comprises Push interface and Pop interface;
Task scheduling modules, receiving for dispatching Push interface of task;
Task execution module, for the treatment of described task;
Collection module, for resolving and push the result of described task.
2. the system as claimed in claim 1, is characterized in that: described Push interface is for reception task, and described Pop interface is for obtaining the execution result of this system to described task.
3. the system as claimed in claim 1, is characterized in that: the task that described scheduling push interface receives comprises: obtain described task and arranged packing, writing cache file, and according to file size calling task execution module, generating destination file.
4. the system as claimed in claim 1, is characterized in that: described task execution module comprises Hadoop module and local execution module; Described Hadoop module comprises that Hadoop's calls executive routine and script; Described local execution module, for the local Done function of carrying out, returns results data.
5. system as claimed in claim 3, is characterized in that: describedly according to file size calling task execution module, comprise: if in file task number be greater than 2000 or file size be greater than 1G, the Hadoop module in calling task execution module; Otherwise, the local execution module in calling task execution module.
6. the batch disposal route based on Hadoop, is characterized in that: said method comprising the steps of:
A.Push interface receives task, and submits task scheduling modules;
B. task scheduling modules arranges packing by described task, writes cache file, obtains cache file size, and is handed to task execution module;
C. task execution module is processed described cache file, and generates destination file;
D. collection module parses destination file push to Pop interface.
7. method as claimed in claim 6, it is characterized in that, described step C comprises: if cache file in task number be greater than 2000 or cache file size be greater than 1G, the Hadoop module in calling task execution module is carried out described processing, and generates destination file; Otherwise the local execution module in calling task execution module carries out described processing, and generates destination file.
8. method as claimed in claim 6, is characterized in that, described step C comprises: if task scheduling modules detects Hadoop module, can not normally use, the local execution module in calling task execution module carries out described processing, and generates destination file.
9. method as claimed in claim 7 or 8, is characterized in that: described local execution module reads current cache file by Done thread, and calls Done function and draw result data.
CN201310538259.7A 2013-11-04 2013-11-04 Data batch processing system and method based on Hadoop Pending CN103559036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538259.7A CN103559036A (en) 2013-11-04 2013-11-04 Data batch processing system and method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538259.7A CN103559036A (en) 2013-11-04 2013-11-04 Data batch processing system and method based on Hadoop

Publications (1)

Publication Number Publication Date
CN103559036A true CN103559036A (en) 2014-02-05

Family

ID=50013292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538259.7A Pending CN103559036A (en) 2013-11-04 2013-11-04 Data batch processing system and method based on Hadoop

Country Status (1)

Country Link
CN (1) CN103559036A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729435A (en) * 2017-09-29 2018-02-23 郑州云海信息技术有限公司 Method, apparatus, equipment and the storage medium that distributed file system task is assigned

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110154341A1 (en) * 2009-12-20 2011-06-23 Yahoo! Inc. System and method for a task management library to execute map-reduce applications in a map-reduce framework
CN102279730A (en) * 2010-06-10 2011-12-14 阿里巴巴集团控股有限公司 Parallel data processing method, device and system
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN102880658A (en) * 2012-08-31 2013-01-16 电子科技大学 Distributed file management system based on seismic data processing
CN102902716A (en) * 2012-08-27 2013-01-30 苏州两江科技有限公司 Storage system based on Hadoop distributed computing platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110154341A1 (en) * 2009-12-20 2011-06-23 Yahoo! Inc. System and method for a task management library to execute map-reduce applications in a map-reduce framework
CN102279730A (en) * 2010-06-10 2011-12-14 阿里巴巴集团控股有限公司 Parallel data processing method, device and system
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN102902716A (en) * 2012-08-27 2013-01-30 苏州两江科技有限公司 Storage system based on Hadoop distributed computing platform
CN102880658A (en) * 2012-08-31 2013-01-16 电子科技大学 Distributed file management system based on seismic data processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李云桃: "基于Hadoop的海量数据处理***的设计与实现", 《中国优秀硕士学位论文个文数据库信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729435A (en) * 2017-09-29 2018-02-23 郑州云海信息技术有限公司 Method, apparatus, equipment and the storage medium that distributed file system task is assigned

Similar Documents

Publication Publication Date Title
US20200326992A1 (en) Acceleration method for fpga-based distributed stream processing system
US10942716B1 (en) Dynamic computational acceleration using a heterogeneous hardware infrastructure
CN108804140B (en) Batch instruction analysis method, device and equipment
Gu et al. SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters
US9146830B2 (en) Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features
Jia et al. Improving the performance of distributed tensorflow with RDMA
US20120284730A1 (en) System to provide computing services
CN104050007A (en) QOS based binary translation and application streaming
CN110502583B (en) Distributed data synchronization method, device, equipment and readable storage medium
CN104572290A (en) Method and device for controlling message processing threads
CN103645944B (en) Batch data conflict detection method, device and system
US8949835B2 (en) Yielding input/output scheduler to increase overall system throughput
US11321090B2 (en) Serializing and/or deserializing programs with serializable state
CN111814959A (en) Model training data processing method, device and system and storage medium
US12032655B2 (en) Asynchronous document ingestion and enrichment system
Nabi Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark
US11556381B2 (en) Asynchronous distributed data flow for machine learning workloads
Abbasi et al. A preliminary study of incorporating GPUs in the Hadoop framework
CN102193831B (en) Method for establishing hierarchical mapping/reduction parallel programming model
Perumalla et al. Discrete event execution with one-sided and two-sided gvt algorithms on 216,000 processor cores
CN103577604B (en) A kind of image index structure for Hadoop distributed environments
CN103559036A (en) Data batch processing system and method based on Hadoop
CN116954944A (en) Distributed data stream processing method, device and equipment based on memory grid
CN110502337A (en) For the optimization system and method for shuffling the stage in Hadoop MapReduce
CN108491220B (en) Method of skill training and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140205

RJ01 Rejection of invention patent application after publication