CN104317899A - Big-data analyzing and processing system and access method - Google Patents

Big-data analyzing and processing system and access method Download PDF

Info

Publication number
CN104317899A
CN104317899A CN201410577412.1A CN201410577412A CN104317899A CN 104317899 A CN104317899 A CN 104317899A CN 201410577412 A CN201410577412 A CN 201410577412A CN 104317899 A CN104317899 A CN 104317899A
Authority
CN
China
Prior art keywords
unit
hadoop
data
physical server
mongodb
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410577412.1A
Other languages
Chinese (zh)
Inventor
王茜
葛新
李安颖
史晨昱
梁小江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Following International Information Ltd Co
Original Assignee
Xi'an Following International Information Ltd Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Following International Information Ltd Co filed Critical Xi'an Following International Information Ltd Co
Priority to CN201410577412.1A priority Critical patent/CN104317899A/en
Publication of CN104317899A publication Critical patent/CN104317899A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a big-data analyzing and processing system which comprises a Hadoop MapRuduce module, mongg-hadoop connector and a mongodb database sharding cluster which are distributed on a physical server. Data in MongoDB can be directly processed through a MapReduce component of Hadoop by the big-data analyzing and processing system based on the Hadoop and the MongoDB, and processing results can be directly written back to a MongoDB database. The invention aims to further provide a big-data access method of the big-data analyzing and processing system based on the hadoop and the MongoDB, so that the data in the MongoDB can be directly processed through the MapReduce of the hadoop, and the processing results can be directly written back to the MongoDB database.

Description

A kind of large data analysis and disposal system and access method
Technical field
The invention belongs to large technical field of data processing, relate to a kind of large data analysis and disposal system, the invention still further relates to a kind of large data access method.
Background technology
Along with the development of infotech, information content presents geometric growth, various non-relational data structure is full of in internet, traditional Relational DataBase is difficult to satisfied new demand, simultaneously, centralized data analysis and process from magnanimity information express-analysis with count the information really needed and just becoming more and more difficult, so data store all should possess distributed treatment ability with data analysis, can the growth of process information as required, constantly expanding system scale is to strengthen system storage ability, information analysis and processing power.The current problems faced that appears as of NoSQL database technology provides new solution, and it have employed the mode of distributed multinode, is more applicable to the store and management of large data.NoSQL database is paid special attention in design to the high concurrent read-write of data and the storage etc. to mass data, and compared with relevant database, they have done " subtraction " in framework and data model, and expansion and concurrent etc. in done " addition ".Computer Architecture now requires to possess huge horizontal extension in data storage, and NoSQL is devoted to change this present situation.Current Google, Yahoo, Facebook, Twitter, Amazon are at extensive application NoSQL type database.NoSQL database little by little becomes a part indispensable in database field.
MongoDB is one the most popular in NoSQL database product.It is a product between relational database and non-relational database, is that in the middle of non-relational database, function is the abundantest, as relational database.The data structure that it is supported is very loose, is the bjson form of similar json, therefore can stores the data type of more complicated.The maximum feature of Mongo is that the query language that he supports is very powerful, and its grammer is similar to a little OO query language, almost can realize most functions of similarity relation database list table inquiry, but also supports to set up index to data.Its feature is high-performance, easily disposes, easily uses, and stores data very convenient.
The mode of distributed cloud computing technology by reallocating resources, for reducing costs computing platform that provide a kind of simplification with energy consumption, that concentrate.Hadoop is a distributed parallel computing platform of increasing income, and its Map/Reduce calculation function is widely used in data analysis and process field, and Hadoop is developing into excellent large data analysing method.
Hadoop software is the complete Open Framework for large data analysis.It comprises a distributed file system (HDFS), a parallel processing framework (Apache HadoopMapReduce) and multiple different assembly, the functions such as supported data acquisition, workflow coordination, task management and cluster monitoring.Hadoop more more economically than classic method can process large-scale unstructured data collection efficiently.
When mass data storage is in NoSQL database, way when hadoop will process these data be first by the data importing that will analyze in NoSQL database in HDFS, and then MapReduce operation is carried out to it, again data are write in HDFS after MapReduce process completes, finally result is write back NoSQL database.In whole process, HDFS has just done the middleware that data store, substantial analyzing and processing is not carried out to data, and the instrument of NoSQL database inherently data persistence, if this process of HDFS omitted, the efficiency of data handling procedure will improve a lot.
Summary of the invention
The object of this invention is to provide a kind of large data analysis and disposal system, directly can be processed the data in MongoDB by the MapReduce assembly of hadoop, and result is directly write back MongoDB database.
Another object of the present invention is to provide a kind of large data access method, directly can be processed the data in MongoDB, and result is directly write back MongoDB database by the MapReduce assembly of hadoop.
The technical scheme that a kind of technical scheme of the present invention adopts is, a kind of large data analysis and disposal system, comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector and mongodb database burst cluster.
The feature of a kind of technical scheme of the present invention is also,
Physical server comprises host node physical server and from node physical server.
Hadoop MapRuduce module comprises jobtracker unit and tasktracker unit, and jobtracker cell distribution is on host node physical server, and tasktracker cell distribution is in from node physical server.
Mongodb database burst cluster comprises mongood process unit, routing daemon unit and configuration server unit, routing daemon cell distribution is on host node physical server, and mongood process unit and configuration server unit are all distributed in from node physical server.
2 are no less than from the quantity of node physical server.
The technical scheme that the another kind of technical scheme of the present invention adopts is, a kind of large data access method, adopt a kind of large data analysis and disposal system, its structure is: comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector and mongodb database burst cluster;
Physical server comprises host node physical server and from node physical server.;
Hadoop MapRuduce module comprises jobtracker unit and tasktracker unit, and jobtracker cell distribution is on host node physical server, and tasktracker cell distribution is in from node physical server.;
Mongodb database burst cluster comprises mongood process unit, routing daemon unit and configuration server unit, routing daemon cell distribution is on host node physical server, and mongood process unit and configuration server unit are all distributed in from node physical server;
2 are no less than from the quantity of node physical server;
Adopt the large data access method of above-mentioned large data analysis and disposal system, specifically implement according to following steps:
Step 1, user submits mapreduce operation to Hadoop, and the data source of configuration hadoop mapreduce is mongodb database, the address that described mapreduce operation comprises data source address, result data exports and concrete map and reduce process;
Step 2, Hadoop gets the storage information of data by access routing daemon unit, and Data Segmentation is become the input block of Hadoop mapreduce;
Step 3, data block information is distributed to different tasktracker unit by jobtracker unit, and tasktracker unit obtains concrete data according to the data block information obtained to mongodb burst cluster;
Step 4, the data of acquisition are adapted to through mongo-Hadoop connector data type that HadoopMapReduce can directly process and send to mapreduce,
Wherein, data type refers to BooleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text form;
Step 5, mapreduce carries out parallel computation process to the data in step 4 after adaptation;
Step 6, result is sent to mongodb burst cluster by tasktracker unit after the data layout that the adaptive mongodb of mongo-Hadoop connector can write, and stored in mongodb database, wherein, the data layout that mongodb can write refers to BSON form.
The invention has the beneficial effects as follows this process of HDFS eliminated in Hadoop, the data in MongoDB are directly accessed by the MapReduce assembly of hadoop, hadoop can be met read efficiently and process the data stored in MongoDB, and smoothly result can be returned to MongoDB database, the efficiency of data handling procedure is improved significantly.
Accompanying drawing explanation
Fig. 1 is the structural representation of a kind of large data analysis of the present invention and disposal system;
Fig. 2 is the schematic flow sheet of a kind of large data access method of the present invention.
In figure, 1.mongo-hadoop connector, 2.jobtracker unit, 3.tasktracker unit, 4.mongood process unit, 5. routing daemon unit, 6. configuration server unit.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
A kind of large data analysis of the present invention and disposal system, as shown in Figure 1, comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector 1 and mongodb database burst cluster.Physical server comprises host node physical server and from node physical server.Hadoop MapRuduce module comprises jobtracker unit 2 and tasktracker unit 3, jobtracker unit 2 is distributed on host node physical server, and tasktracker unit 3 is distributed in from node physical server.Mongodb database burst cluster comprises mongood process unit 4, routing daemon unit 5 and configuration server unit 6, routing daemon unit 5 is distributed on host node physical server, and mongood process unit 4 and configuration server unit 6 are all distributed in from node physical server.2 are no less than from the quantity of node physical server.
Specifically the acting as of modules in a kind of large data analysis of the present invention and disposal system:
1, Mapreduce module
A () TaskTracker unit 3 computing node in the cluster manages and perform each Map and Reduce operation;
B () JobTracker unit 2 accepts Hand up homework, provide the monitor and forecast of operation, management role, and distributes operation on the node of TaskTracker unit 3.
2, Mongodb database burst cluster
(1) non-relational database burst 1, non-relational database burst 2,3 mongod process unit 4 are used to form a non-relational database copy set (for data reliable memory in we's invention, for the replicanism of oneself, automatically Failure Transfer can be carried out), form a non-relational database burst, for storing a part of data block of actual cluster;
(2) configuration server unit 6, stores the cluster metadata information of whole mongodb burst cluster, comprises global set group configuration, the position of each database, set and particular range data, a change record;
(3) routing daemon unit 5, provides an interface and connects whole cluster, all read-write requests be directed on suitable burst;
3, mongo-hadoop connector 1
Effect connects mongodb and Hadoop to carry out data interaction, mongodb is adapted for the input data (BSON) of Hadoop the data type (BooleanWritable that Hadoop mapreduce can directly process, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text etc.), and by the result data type (BooleanWritable of Hadoop mapreduce, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text etc.) being adapted to can directly stored in the data type (BSON) of mongodb.
The principle of work of a kind of large data analysis of the present invention and disposal system is, data are stored in mongodb burst cluster, user is to Hadoop submit job, Hadoop gets the storage information of data by access routing daemon, Data Segmentation is become the input block of Hadoop mapreduce, data block information is distributed to different tasktracker unit 3 by jobtracker unit 2, tasktracker unit 3 obtains concrete data (pilot process data carry out adaptation by mongo-Hadoop connector) according to the data block information obtained to mongodb burst cluster and carries out mapreduce process, process rear tasktracker unit 3 and result is returned to mongodb burst cluster (pilot process data carry out adaptation by mongo-Hadoop connector).
The another kind of large data access method of the present invention, adopts the structure of above-mentioned large data analysis and disposal system, as shown in Figure 2, specifically implements according to following steps:
Step 1, user submits mapreduce operation to Hadoop, and the data source of configuration hadoop mapreduce is mongodb database, the address that mapreduce operation comprises data source address, result data exports and concrete map and reduce process;
Step 2, Hadoop gets the storage information of data by access routing daemon unit 5, and Data Segmentation is become the input block of Hadoop mapreduce;
Step 3, data block information is distributed to different tasktracker unit 3 by jobtracker unit 2, and different tasktracker unit 3 obtains concrete data according to the data block information obtained to mongodb burst cluster;
Step 4, the data obtained are adapted to through mongo-Hadoop connector 1 data type that Hadoop MapReduce can directly process and send to mapreduce, and data type refers to BooleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text form;
Step 5, mapreduce carries out parallel computation process to the data in step 4 after adaptation;
Step 6, result is sent to mongodb burst cluster by tasktracker unit 3 after the data layout that the adaptive mongodb of mongo-Hadoop connector 1 can write, and stored in mongodb database, wherein, the data layout that mongodb can write refers to BSON form.

Claims (6)

1. large data analysis and a disposal system, is characterized in that, comprises the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector (1) and mongodb database burst cluster.
2. the large data analysis of one according to claim 1 and disposal system, is characterized in that, described physical server comprises host node physical server and from node physical server.
3. the large data analysis of one according to claim 2 and disposal system, it is characterized in that, described Hadoop MapRuduce module comprises jobtracker unit (2) and tasktracker unit (3), described jobtracker unit (2) is distributed on host node physical server, and described tasktracker unit (3) is distributed in from node physical server.
4. the large data analysis of one according to claim 1 and disposal system, it is characterized in that, described mongodb database burst cluster comprises mongood process unit (4), routing daemon unit (5) and configuration server unit (6), described routing daemon unit (5) is distributed on host node physical server, and described mongood process unit (4) and configuration server unit (6) are all distributed in from node physical server.
5. the large data analysis of the one according to Claims 2 or 3 or 4 and disposal system, is characterized in that, the described quantity from node physical server is no less than 2.
6. a large data access method, it is characterized in that, adopt a kind of large data analysis and disposal system, its structure is: comprise the Hadoop MapRuduce module be distributed on physical server, mongo-hadoop connector (1) and mongodb database burst cluster;
Described physical server comprises host node physical server and from node physical server;
Described Hadoop MapRuduce module comprises jobtracker unit (2) and tasktracker unit (3), described jobtracker unit (2) is distributed on host node physical server, and described tasktracker unit (3) is distributed in from node physical server;
Described mongodb database burst cluster comprises mongood process unit (4), routing daemon unit (5) and configuration server unit (6), described routing daemon unit (5) is distributed on host node physical server, and described mongood process unit (4) and configuration server unit (6) are all distributed in from node physical server;
The described quantity from node physical server is no less than 2;
Adopt above-mentioned based on the large data analysis of Hadoop and MongoDB and the large data access method of disposal system, specifically implement according to following steps:
Step 1, user submits mapreduce operation to Hadoop, and the data source of configuration hadoop mapreduce is mongodb database, the address that described mapreduce operation comprises data source address, result data exports and concrete map and reduce process;
Step 2, Hadoop gets the storage information of data by access routing daemon unit (5), and Data Segmentation is become the input block of Hadoop mapreduce;
Step 3, data block information is distributed to different tasktracker unit (3) by jobtracker unit (2), and described different tasktracker unit (3) obtains concrete data according to the data block information obtained to mongodb burst cluster;
Step 4, the data of acquisition are adapted to through mongo-Hadoop connector (1) data type that Hadoop MapReduce can directly process and send to mapreduce,
Described data type refers to BooleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, Text form;
Step 5, mapreduce carries out parallel computation process to the data in step 4 after adaptation;
Step 6, result is sent to mongodb burst cluster by tasktracker unit (3) after the data layout that mongo-Hadoop connector (1) adaptive mongodb can write, and stored in mongodb database, wherein, the data layout that mongodb can write refers to BSON form.
CN201410577412.1A 2014-10-24 2014-10-24 Big-data analyzing and processing system and access method Pending CN104317899A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410577412.1A CN104317899A (en) 2014-10-24 2014-10-24 Big-data analyzing and processing system and access method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410577412.1A CN104317899A (en) 2014-10-24 2014-10-24 Big-data analyzing and processing system and access method

Publications (1)

Publication Number Publication Date
CN104317899A true CN104317899A (en) 2015-01-28

Family

ID=52373131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410577412.1A Pending CN104317899A (en) 2014-10-24 2014-10-24 Big-data analyzing and processing system and access method

Country Status (1)

Country Link
CN (1) CN104317899A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104918117A (en) * 2015-03-24 2015-09-16 四川长虹电器股份有限公司 Intelligent television advertisement and user label recommending method
CN106599253A (en) * 2016-12-21 2017-04-26 济南浪潮高新科技投资发展有限公司 Method for achieving distributed computation by using NoSQL database
CN106778351A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Data desensitization method and device
CN106844399A (en) * 2015-12-07 2017-06-13 中兴通讯股份有限公司 Distributed data base system and its adaptive approach
CN108446371A (en) * 2018-03-15 2018-08-24 平安科技(深圳)有限公司 Data return guiding method, device, computer equipment and storage medium
CN109471837A (en) * 2018-10-08 2019-03-15 国网经济技术研究院有限公司 Distributed storage method of power infrastructure data
EP3819774A1 (en) * 2019-11-06 2021-05-12 Microsoft Technology Licensing, LLC Confidential computing mechanism
CN114911876A (en) * 2022-05-18 2022-08-16 山东浪潮科学研究院有限公司 Distributed computing method for realizing digital energy management system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937980A (en) * 2012-10-18 2013-02-20 亿赞普(北京)科技有限公司 Method for inquiring data of cluster database
US20130144605A1 (en) * 2011-12-06 2013-06-06 Mehrman Law Office, PC Text Mining Analysis and Output System

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144605A1 (en) * 2011-12-06 2013-06-06 Mehrman Law Office, PC Text Mining Analysis and Output System
CN102937980A (en) * 2012-10-18 2013-02-20 亿赞普(北京)科技有限公司 Method for inquiring data of cluster database

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张广弟: "分布式环境下海量空间数据的存储和并行查询技术研究", 《中国优秀硕士学位论文全文数据库基础科学辑》 *
雷德龙: "基于MongoDB的矢量空间数据云存储与处理***", 《地球信息科学学报》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104918117A (en) * 2015-03-24 2015-09-16 四川长虹电器股份有限公司 Intelligent television advertisement and user label recommending method
CN106844399B (en) * 2015-12-07 2022-08-09 中兴通讯股份有限公司 Distributed database system and self-adaptive method thereof
CN106844399A (en) * 2015-12-07 2017-06-13 中兴通讯股份有限公司 Distributed data base system and its adaptive approach
CN106599253A (en) * 2016-12-21 2017-04-26 济南浪潮高新科技投资发展有限公司 Method for achieving distributed computation by using NoSQL database
CN106778351B (en) * 2016-12-30 2020-04-21 中国民航信息网络股份有限公司 Data desensitization method and device
CN106778351A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Data desensitization method and device
CN108446371A (en) * 2018-03-15 2018-08-24 平安科技(深圳)有限公司 Data return guiding method, device, computer equipment and storage medium
CN108446371B (en) * 2018-03-15 2020-10-27 平安科技(深圳)有限公司 Data back-leading method and device, computer equipment and storage medium
CN109471837A (en) * 2018-10-08 2019-03-15 国网经济技术研究院有限公司 Distributed storage method of power infrastructure data
EP3819774A1 (en) * 2019-11-06 2021-05-12 Microsoft Technology Licensing, LLC Confidential computing mechanism
WO2021091744A1 (en) * 2019-11-06 2021-05-14 Microsoft Technology Licensing, Llc Confidential computing mechanism
US12013794B2 (en) 2019-11-06 2024-06-18 Microsoft Technology Licensing, Llc Confidential computing mechanism
CN114911876A (en) * 2022-05-18 2022-08-16 山东浪潮科学研究院有限公司 Distributed computing method for realizing digital energy management system
CN114911876B (en) * 2022-05-18 2024-05-31 山东浪潮科学研究院有限公司 Distributed computing method for realizing digital energy management system

Similar Documents

Publication Publication Date Title
CN104317899A (en) Big-data analyzing and processing system and access method
Ji et al. Big data processing in cloud computing environments
CN103106249B (en) A kind of parallel data processing system based on Cassandra
CN103440288A (en) Big data storage method and device
JP6697392B2 (en) Transparent discovery of semi-structured data schema
Ji et al. Big data processing: Big challenges and opportunities
JP6964384B2 (en) Methods, programs, and systems for the automatic discovery of relationships between fields in a mixed heterogeneous data source environment.
Kim et al. Fast, energy efficient scan inside flash memory SSDs
Liang et al. Express supervision system based on NodeJS and MongoDB
JP6159908B1 (en) Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment
CN111078781A (en) Multi-source streaming big data fusion convergence processing framework model implementation method
CN104199889A (en) RTLogic big data processing system and method based on CEP technology
JPWO2017170459A6 (en) Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment
CN104239470A (en) Distributed environment-oriented space data compound processing system and method
Suriarachchi et al. Big provenance stream processing for data intensive computations
CN104571946A (en) Memory device supporting quick query of logical circuit and access method of memory device
Zhang et al. Quegel: A general-purpose system for querying big graphs
CN105426119A (en) Storage apparatus and data processing method
CN108319604B (en) Optimization method for association of large and small tables in hive
Cheng et al. [Retracted] Sports Big Data Analysis Based on Cloud Platform and Its Impact on Sports Economy
Asbern et al. Performance evaluation of association mining in Hadoop single node cluster with Big Data
CN102591978B (en) Distributed text copy detection system
CN102637200B (en) Method for distributing multi-level associated data to same node of cluster
Zhou [Retracted] Application of K‐Means Clustering Algorithm in Energy Data Analysis
CN105243063A (en) Information recommendation method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150128