CN104317899A

CN104317899A - Big-data analyzing and processing system and access method

Info

Publication number: CN104317899A
Application number: CN201410577412.1A
Authority: CN
Inventors: 王茜; 葛新; 李安颖; 史晨昱; 梁小江
Original assignee: Xi'an Following International Information Ltd Co
Current assignee: Xi'an Following International Information Ltd Co
Priority date: 2014-10-24
Filing date: 2014-10-24
Publication date: 2015-01-28

Abstract

The invention discloses a big-data analyzing and processing system which comprises a Hadoop MapRuduce module, mongg-hadoop connector and a mongodb database sharding cluster which are distributed on a physical server. Data in MongoDB can be directly processed through a MapReduce component of Hadoop by the big-data analyzing and processing system based on the Hadoop and the MongoDB, and processing results can be directly written back to a MongoDB database. The invention aims to further provide a big-data access method of the big-data analyzing and processing system based on the hadoop and the MongoDB, so that the data in the MongoDB can be directly processed through the MapReduce of the hadoop, and the processing results can be directly written back to the MongoDB database.

Description

A kind of large data analysis and disposal system and access method

Technical field

The invention belongs to large technical field of data processing, relate to a kind of large data analysis and disposal system, the invention still further relates to a kind of large data access method.

Background technology

Along with the development of infotech, information content presents geometric growth, various non-relational data structure is full of in internet, traditional Relational DataBase is difficult to satisfied new demand, simultaneously, centralized data analysis and process from magnanimity information express-analysis with count the information really needed and just becoming more and more difficult, so data store all should possess distributed treatment ability with data analysis, can the growth of process information as required, constantly expanding system scale is to strengthen system storage ability, information analysis and processing power.The current problems faced that appears as of NoSQL database technology provides new solution, and it have employed the mode of distributed multinode, is more applicable to the store and management of large data.NoSQL database is paid special attention in design to the high concurrent read-write of data and the storage etc. to mass data, and compared with relevant database, they have done " subtraction " in framework and data model, and expansion and concurrent etc. in done " addition ".Computer Architecture now requires to possess huge horizontal extension in data storage, and NoSQL is devoted to change this present situation.Current Google, Yahoo, Facebook, Twitter, Amazon are at extensive application NoSQL type database.NoSQL database little by little becomes a part indispensable in database field.

MongoDB is one the most popular in NoSQL database product.It is a product between relational database and non-relational database, is that in the middle of non-relational database, function is the abundantest, as relational database.The data structure that it is supported is very loose, is the bjson form of similar json, therefore can stores the data type of more complicated.The maximum feature of Mongo is that the query language that he supports is very powerful, and its grammer is similar to a little OO query language, almost can realize most functions of similarity relation database list table inquiry, but also supports to set up index to data.Its feature is high-performance, easily disposes, easily uses, and stores data very convenient.

The mode of distributed cloud computing technology by reallocating resources, for reducing costs computing platform that provide a kind of simplification with energy consumption, that concentrate.Hadoop is a distributed parallel computing platform of increasing income, and its Map/Reduce calculation function is widely used in data analysis and process field, and Hadoop is developing into excellent large data analysing method.

Hadoop software is the complete Open Framework for large data analysis.It comprises a distributed file system (HDFS), a parallel processing framework (Apache HadoopMapReduce) and multiple different assembly, the functions such as supported data acquisition, workflow coordination, task management and cluster monitoring.Hadoop more more economically than classic method can process large-scale unstructured data collection efficiently.

When mass data storage is in NoSQL database, way when hadoop will process these data be first by the data importing that will analyze in NoSQL database in HDFS, and then MapReduce operation is carried out to it, again data are write in HDFS after MapReduce process completes, finally result is write back NoSQL database.In whole process, HDFS has just done the middleware that data store, substantial analyzing and processing is not carried out to data, and the instrument of NoSQL database inherently data persistence, if this process of HDFS omitted, the efficiency of data handling procedure will improve a lot.

Summary of the invention

The object of this invention is to provide a kind of large data analysis and disposal system, directly can be processed the data in MongoDB by the MapReduce assembly of hadoop, and result is directly write back MongoDB database.

Another object of the present invention is to provide a kind of large data access method, directly can be processed the data in MongoDB, and result is directly write back MongoDB database by the MapReduce assembly of hadoop.

The technical scheme that a kind of technical scheme of the present invention adopts is, a kind of large data analysis and disposal system, comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector and mongodb database burst cluster.

The feature of a kind of technical scheme of the present invention is also,

Physical server comprises host node physical server and from node physical server.

Hadoop MapRuduce module comprises jobtracker unit and tasktracker unit, and jobtracker cell distribution is on host node physical server, and tasktracker cell distribution is in from node physical server.

Mongodb database burst cluster comprises mongood process unit, routing daemon unit and configuration server unit, routing daemon cell distribution is on host node physical server, and mongood process unit and configuration server unit are all distributed in from node physical server.

2 are no less than from the quantity of node physical server.

The technical scheme that the another kind of technical scheme of the present invention adopts is, a kind of large data access method, adopt a kind of large data analysis and disposal system, its structure is: comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector and mongodb database burst cluster;

Physical server comprises host node physical server and from node physical server.；

Hadoop MapRuduce module comprises jobtracker unit and tasktracker unit, and jobtracker cell distribution is on host node physical server, and tasktracker cell distribution is in from node physical server.；

Mongodb database burst cluster comprises mongood process unit, routing daemon unit and configuration server unit, routing daemon cell distribution is on host node physical server, and mongood process unit and configuration server unit are all distributed in from node physical server;

2 are no less than from the quantity of node physical server;

Adopt the large data access method of above-mentioned large data analysis and disposal system, specifically implement according to following steps:

Step 1, user submits mapreduce operation to Hadoop, and the data source of configuration hadoop mapreduce is mongodb database, the address that described mapreduce operation comprises data source address, result data exports and concrete map and reduce process;

Step 2, Hadoop gets the storage information of data by access routing daemon unit, and Data Segmentation is become the input block of Hadoop mapreduce;

Step 3, data block information is distributed to different tasktracker unit by jobtracker unit, and tasktracker unit obtains concrete data according to the data block information obtained to mongodb burst cluster;

Step 4, the data of acquisition are adapted to through mongo-Hadoop connector data type that HadoopMapReduce can directly process and send to mapreduce,

Wherein, data type refers to BooleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text form;

Step 5, mapreduce carries out parallel computation process to the data in step 4 after adaptation;

Step 6, result is sent to mongodb burst cluster by tasktracker unit after the data layout that the adaptive mongodb of mongo-Hadoop connector can write, and stored in mongodb database, wherein, the data layout that mongodb can write refers to BSON form.

The invention has the beneficial effects as follows this process of HDFS eliminated in Hadoop, the data in MongoDB are directly accessed by the MapReduce assembly of hadoop, hadoop can be met read efficiently and process the data stored in MongoDB, and smoothly result can be returned to MongoDB database, the efficiency of data handling procedure is improved significantly.

Accompanying drawing explanation

Fig. 1 is the structural representation of a kind of large data analysis of the present invention and disposal system;

Fig. 2 is the schematic flow sheet of a kind of large data access method of the present invention.

In figure, 1.mongo-hadoop connector, 2.jobtracker unit, 3.tasktracker unit, 4.mongood process unit, 5. routing daemon unit, 6. configuration server unit.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

A kind of large data analysis of the present invention and disposal system, as shown in Figure 1, comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector 1 and mongodb database burst cluster.Physical server comprises host node physical server and from node physical server.Hadoop MapRuduce module comprises jobtracker unit 2 and tasktracker unit 3, jobtracker unit 2 is distributed on host node physical server, and tasktracker unit 3 is distributed in from node physical server.Mongodb database burst cluster comprises mongood process unit 4, routing daemon unit 5 and configuration server unit 6, routing daemon unit 5 is distributed on host node physical server, and mongood process unit 4 and configuration server unit 6 are all distributed in from node physical server.2 are no less than from the quantity of node physical server.

Specifically the acting as of modules in a kind of large data analysis of the present invention and disposal system:

1, Mapreduce module

A () TaskTracker unit 3 computing node in the cluster manages and perform each Map and Reduce operation;

B () JobTracker unit 2 accepts Hand up homework, provide the monitor and forecast of operation, management role, and distributes operation on the node of TaskTracker unit 3.

2, Mongodb database burst cluster

(1) non-relational database burst 1, non-relational database burst 2,3 mongod process unit 4 are used to form a non-relational database copy set (for data reliable memory in we's invention, for the replicanism of oneself, automatically Failure Transfer can be carried out), form a non-relational database burst, for storing a part of data block of actual cluster;

(2) configuration server unit 6, stores the cluster metadata information of whole mongodb burst cluster, comprises global set group configuration, the position of each database, set and particular range data, a change record;

(3) routing daemon unit 5, provides an interface and connects whole cluster, all read-write requests be directed on suitable burst;

3, mongo-hadoop connector 1

Effect connects mongodb and Hadoop to carry out data interaction, mongodb is adapted for the input data (BSON) of Hadoop the data type (BooleanWritable that Hadoop mapreduce can directly process, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text etc.), and by the result data type (BooleanWritable of Hadoop mapreduce, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text etc.) being adapted to can directly stored in the data type (BSON) of mongodb.

The principle of work of a kind of large data analysis of the present invention and disposal system is, data are stored in mongodb burst cluster, user is to Hadoop submit job, Hadoop gets the storage information of data by access routing daemon, Data Segmentation is become the input block of Hadoop mapreduce, data block information is distributed to different tasktracker unit 3 by jobtracker unit 2, tasktracker unit 3 obtains concrete data (pilot process data carry out adaptation by mongo-Hadoop connector) according to the data block information obtained to mongodb burst cluster and carries out mapreduce process, process rear tasktracker unit 3 and result is returned to mongodb burst cluster (pilot process data carry out adaptation by mongo-Hadoop connector).

The another kind of large data access method of the present invention, adopts the structure of above-mentioned large data analysis and disposal system, as shown in Figure 2, specifically implements according to following steps:

Step 1, user submits mapreduce operation to Hadoop, and the data source of configuration hadoop mapreduce is mongodb database, the address that mapreduce operation comprises data source address, result data exports and concrete map and reduce process;

Step 2, Hadoop gets the storage information of data by access routing daemon unit 5, and Data Segmentation is become the input block of Hadoop mapreduce;

Step 3, data block information is distributed to different tasktracker unit 3 by jobtracker unit 2, and different tasktracker unit 3 obtains concrete data according to the data block information obtained to mongodb burst cluster;

Step 4, the data obtained are adapted to through mongo-Hadoop connector 1 data type that Hadoop MapReduce can directly process and send to mapreduce, and data type refers to BooleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text form;

Step 6, result is sent to mongodb burst cluster by tasktracker unit 3 after the data layout that the adaptive mongodb of mongo-Hadoop connector 1 can write, and stored in mongodb database, wherein, the data layout that mongodb can write refers to BSON form.

Claims

1. large data analysis and a disposal system, is characterized in that, comprises the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector (1) and mongodb database burst cluster.

2. the large data analysis of one according to claim 1 and disposal system, is characterized in that, described physical server comprises host node physical server and from node physical server.

3. the large data analysis of one according to claim 2 and disposal system, it is characterized in that, described Hadoop MapRuduce module comprises jobtracker unit (2) and tasktracker unit (3), described jobtracker unit (2) is distributed on host node physical server, and described tasktracker unit (3) is distributed in from node physical server.

4. the large data analysis of one according to claim 1 and disposal system, it is characterized in that, described mongodb database burst cluster comprises mongood process unit (4), routing daemon unit (5) and configuration server unit (6), described routing daemon unit (5) is distributed on host node physical server, and described mongood process unit (4) and configuration server unit (6) are all distributed in from node physical server.

5. the large data analysis of the one according to Claims 2 or 3 or 4 and disposal system, is characterized in that, the described quantity from node physical server is no less than 2.

6. a large data access method, it is characterized in that, adopt a kind of large data analysis and disposal system, its structure is: comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector (1) and mongodb database burst cluster;

Described physical server comprises host node physical server and from node physical server;

Described Hadoop MapRuduce module comprises jobtracker unit (2) and tasktracker unit (3), described jobtracker unit (2) is distributed on host node physical server, and described tasktracker unit (3) is distributed in from node physical server;

Described mongodb database burst cluster comprises mongood process unit (4), routing daemon unit (5) and configuration server unit (6), described routing daemon unit (5) is distributed on host node physical server, and described mongood process unit (4) and configuration server unit (6) are all distributed in from node physical server;

The described quantity from node physical server is no less than 2;

Adopt above-mentioned based on the large data analysis of Hadoop and MongoDB and the large data access method of disposal system, specifically implement according to following steps:

Step 2, Hadoop gets the storage information of data by access routing daemon unit (5), and Data Segmentation is become the input block of Hadoop mapreduce;

Step 3, data block information is distributed to different tasktracker unit (3) by jobtracker unit (2), and described different tasktracker unit (3) obtains concrete data according to the data block information obtained to mongodb burst cluster;

Step 4, the data of acquisition are adapted to through mongo-Hadoop connector (1) data type that Hadoop MapReduce can directly process and send to mapreduce,

Described data type refers to BooleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text form;

Step 6, result is sent to mongodb burst cluster by tasktracker unit (3) after the data layout that mongo-Hadoop connector (1) adaptive mongodb can write, and stored in mongodb database, wherein, the data layout that mongodb can write refers to BSON form.