CN105512336A

CN105512336A - Method and device for mass data processing based on Hadoop

Info

Publication number: CN105512336A
Application number: CN201511009913.0A
Authority: CN
Inventors: 王明龙; 王力; 彭塨烨; 谢潇宇; 王伟; 包辰明; 赵金鑫; 张舜华; 陈暑生
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2016-04-20

Abstract

The invention provides a method and device for mass data processing based on Hadoop. The method comprises the steps of collecting data, integrating the collected data, storing the integrated data into an Hbase database, performing index statistics respectively according to a data update cycle in the Hbase database, and storing the index statistics results into a relational database. By implementing the method and device, the database management pressure in mass data processing can be relieved, and the display of statistics results of mass data is facilitated.

Description

A kind of mass data processing method based on Hadoop and device

Technical field

The present invention relates to data processing field, specifically, relate to a kind of mass data processing method and device.

Background technology

In e-commerce website Correlation method for data processing, usually by unified for data asynchronous, discrete to each business master library, access log, flowing water daily record etc. process, quasi real time and in the recent period monitoring termly to system indexs such as service traffics, visit capacity, user, products is realized.Nowadays, along with the fast development of ecommerce, the data that website produces are explosive growth, and how quickly and efficiently Storage and Processing mass data becomes the important technological problems that people face.

The database of current main employing relationship type processes mass data, but all there is the problem required database transaction consistency in traditional relevant database, and in data mining or data analysis process, do not need to be strict with db transaction characteristic and read consistency.Therefore, calculate based on the issued transaction in the database of relationship type for being used for carrying out data and being a white elephant for data mining.Therefore, the mass data processing scheme designing the calculating of a set of applicable data and excavation becomes the technical matters needing solution badly.

Summary of the invention

For solving the problems of the technologies described above, the invention provides a kind of mass data processing method based on Hadoop and device.

According to the first aspect of embodiment of the present invention, provide a kind of mass data processing method based on Hadoop, the method can comprise: image data; Gathered data are integrated, by the data after integration stored in Hbase database, carries out indicator-specific statistics respectively, the result of indicator-specific statistics stored in relevant database according to the update cycle of data in described Hbase database.

In certain embodiments of the present invention, described image data comprises: embed javascript script and the asynchronous log collection daily record data of rsyslog in front end page, and/or, by the business datum of rsync synchronous acquisition application server.

It is in certain embodiments of the present invention, described that gathered data to be carried out integration be based on FlumeNG framework.

In certain embodiments of the present invention, the data of described collection carry out buffer memory with the queue of file type in FlumeNG framework.

In certain embodiments of the present invention, described method also comprises: the result of described indicator-specific statistics is saved as regular snapshot document, and is outwards provided by described regular snapshot document by BDE.

In certain embodiments of the present invention, described method also comprises: the querying condition receiving user's input, and access according to described querying condition the result that described relevant database obtains described indicator-specific statistics, then the result of described indicator-specific statistics is shown to described user.

According to the second aspect of embodiment of the present invention, provide a kind of mass data processing device based on Hadoop, this device can comprise: acquisition module, for image data; Integrate module, integrates for the data gathered by described acquisition module; Memory module, for the data after described integrate module is integrated stored in Hbase database, processing module, for carrying out indicator-specific statistics respectively according to the update cycle of data in described Hbase database, wherein, described memory module, also for the result of the indicator-specific statistics by described processing module stored in relevant database.

In certain embodiments of the present invention, described acquisition module image data comprises: embed javascript script and the asynchronous log collection daily record data of rsyslog in front end page, and/or, by the business datum of rsync synchronous acquisition application server.

In certain embodiments of the present invention, described integrate module is based on FlumeNG framework.

In certain embodiments of the present invention, described integrate module carries out buffer memory with the queue of file type in FlumeNG framework.

In certain embodiments of the present invention, described processing module, also for the result of described indicator-specific statistics is saved as regular snapshot document, and is outwards provided described regular snapshot document by BDE.

In certain embodiments of the present invention, described device also comprises: represent module, for receiving the querying condition of user's input, and accessing according to described querying condition the result that described relevant database obtains described indicator-specific statistics, then the result of described indicator-specific statistics being shown to described user.

The mass data processing method based on Hadoop that embodiment of the present invention provides and device, be stored into dissimilar database respectively by the mass data after collection, integration with by the statistics that mass data processing obtains, while improving the data base administration efficiency of mass data, also facilitate inquiry and the displaying of mass data statistics; And the form of the data in different update cycle by snapshot is externally unified for number, unify external confession number frequency, facilitate data analysis and the excavation of mass data.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the mass data processing method based on Hadoop according to one embodiment of the present invention;

Fig. 2 is the configuration diagram of the mass data processing based on Hadoop according to one embodiment of the present invention;

Fig. 3 is the structural representation of the mass data processing device based on Hadoop according to one embodiment of the present invention;

Fig. 4 is the structural representation of the mass data processing device based on Hadoop according to one embodiment of the present invention.

Embodiment

Be described in detail to various aspects of the present invention below in conjunction with the drawings and specific embodiments.Wherein, well-known module, unit and connection each other, link, communication or operation do not illustrate or do not elaborate.Further, described feature, framework or function can combine by any way in one or more embodiments.It will be appreciated by those skilled in the art that following various embodiments are only for illustrating, but not for limiting the scope of the invention.Can also easy understand, the module in each embodiment described herein and shown in the drawings or unit or processing mode can be undertaken combining and designing by various different configuration.

See the schematic flow sheet that Fig. 1, Fig. 1 are the mass data processing methods based on Hadoop according to one embodiment of the present invention, the method can comprise:

S101, image data;

Gathered data are integrated by S102,

S103, by the data after integration stored in Hbase database,

S104, carries out indicator-specific statistics respectively according to the update cycle of data in described Hbase database,

S105, the result of indicator-specific statistics stored in relevant database.

Hadoop of the present invention refers to the distributed system architecture developed by Apache fund club, and the power that can make full use of cluster carries out high-speed computation and storage.Hadoop achieves a distributed file system HDFS (HadoopDistributedFileSystem), can be deployed on cheap hardware, provides high-throughput to visit the data of application program.MapReduce is the main execution framework of Hadoop, for the programming model of distributed parallel process, with the data set that a kind of reliably fault-tolerant mode parallel processing is ultra-large.Hbase is a NoSQL database towards row being structured on HDFS.

Data processing method of the present invention can comprise: step S101, image data, such as, carries out data acquisition by the acquisition module being arranged at data source.For electronic emporium, such as, can collect the data that the business datum of shopping mall website, access log, flowing water daily record etc. are asynchronous, discrete, be follow-up data processing providing source data.

In some embodiments, the front end page that step S101 image data can be included in website embeds javascript script and the asynchronous log collection daily record data of rsyslog quasi real time (such as, access log and flowing water daily record etc.), and by the business datum of the application server of rsync backup tool synchronous acquisition store main website.In other embodiment, can also according to applying the class needing only to gather in daily record data or business datum.

Next, perform step S102, the data gathered by step S101 are integrated, and described integration can comprise resolving gathered business datum, carrying out correlation inquiry and filtration etc. by resolving the critical field of raw data obtained.Data Integration of the present invention is based on ApacheFlumeNG framework.Flume framework mainly comprises three module: Source (reception) module in charge data access and monitors, corresponding above-mentioned different acquisition mode adopts different access snoop agents end Agent, such as: LogAgent (journaling agent end) is responsible for monitoring the daily record data integrated rsyslog asynchronous transmission and come, and DBAgent (database broker end) is responsible for the business datum integrating relevant database in application server; Channel (queue) module in charge data buffering, after namely receiving Source data, Channel is mail in unification, supplies Sink resume module below.In embodiments of the present invention, the data gathered carry out buffer memory with the queue of file type, are FileChannel mode by Channel by unified definition; Sink module in charge data write.

Then, perform step S103, the data after being integrated by step S102 are stored in Hbase database.Data storage of the present invention can adopt HBaseSink interface, by the sink module in FlumeNG framework, data are saved in the Hbase database of accumulation layer, the redundant data table etc. that the data being stored into Hbase database can comprise access log, flowing water daily record, business datum and be formed after business datum being associated with business datum table, these data can provide basic data for report generation and data mining.The present invention utilizes the data being used for report generation and data analysis and excavation not need to be strict with the feature of db transaction characteristic and read consistency, by gather mass data storage to Hbase database the oracle database of non-relational, significantly can improve the data base administration efficiency of mass data processing, facilitate data analysis and the data mining of mass data.

Then, perform step S104, the update cycle according to the data in Hbase database carries out indicator-specific statistics respectively.From description above, the data that step S101 gathers can be asynchronous data, these asynchronous data can have the different update cycles (such as, 10 minutes, 1 hour etc.), indicator-specific statistics can be carried out respectively according to the update cycle of adopted data, that is, for the data in different update cycle, adopt the corresponding update cycle to carry out indicator-specific statistics.Also can comprise the duplicate removal to data to the statistics of data in step S104, wherein, the mode of timed task can be adopted the duplicate removal of data to carry out with statistics.For the data in different update cycle, after obtaining the result of indicator-specific statistics by timed task statistics, buffer memory can be carried out to the result of indicator-specific statistics, such as, the result of indicator-specific statistics can be saved as regular snapshot document (such as, day snapshot document), by BDE (BorlandDatabaseEngine), preserved regular snapshot document is outwards provided, carry out data analysis or data mining etc.Such as, by BDE.net, snapshot document outwards can be transmitted.The external of the data source in different update cycle is unified for the number update cycle by snapshot document by the present invention, greatly facilitates the statistical treatment of various dissimilar mass data.

Then, step S105 is performed, by the result of indicator-specific statistics in step S104 stored in relevant database, such as, oracle database.The result of indicator-specific statistics stored in relevant database, can be improved the efficiency of correlation inquiry in indicator-specific statistics result queries and displaying, reduce the query time of the statistics of mass data, facilitate the visual presentation of mass data statistics by the present invention.

Mass data processing method based on Hadoop of the present invention also can comprise: receive the querying condition of user's input (such as, need the keyword etc. of inquiry), and the result of indicator-specific statistics is obtained according to this querying condition access relation type database, then the result of indicator-specific statistics is shown to this user.Such as, can by the Structured Query Language (SQL) (StructuredQueryLanguage of standard, SQL) statistics or in the relational database of HQL (HibernateQueryLanguage) interface polls, these statisticss can by document form to present customers.In other embodiment, the data in Hive interface polls high-volume database Hbase can also be set.Hive is a Tool for Data Warehouse of Hadoop, structurized data file can be mapped as a database table, provide complete SQL query function, and SQL statement can be converted to MapReduce task run.

In a kind of specific embodiment, the framework of the mass data processing based on Hadoop of the present invention can be as shown in Figure 2.In fig. 2, image data, these data can comprise script (such as, javascript script) data, WEB daily record data and business datum.The data gathered are integrated by Flume Data Collection.Integrating the data obtained can stored in Hbase/HDFS/oracle.These data stored can by SQLAPI (ApplicationprogrammingInterface, application programming interface) and HQL and search engine data communication, search engine can provide Search Results by HIVE interface to data query interface, checks for user.The data stored can also be added up data respectively by timed task, generating report forms record, form embedded report controls.The data stored also externally supply number by the form of snapshot document, form day whole snapshot document.

Describe the flow process of the mass data processing method based on Hadoop above in conjunction with specific embodiments, describe the mass data processing device based on Hadoop of application said method below in conjunction with specific embodiment.

See the structural representation that Fig. 3, Fig. 3 are the mass data processing devices based on Hadoop according to one embodiment of the present invention, this device 200 can comprise:

Acquisition module 201, for image data;

Integrate module 202, integrates for the data gathered by described acquisition module 201;

Memory module 203, for the data after described integrate module 202 is integrated stored in Hbase database,

Processing module 204, for carrying out indicator-specific statistics respectively according to the update cycle of data in described Hbase database, wherein,

Described memory module 203, also for the result of the indicator-specific statistics by described processing module 204 stored in relevant database.

Mass data processing device 200 based on Hadoop of the present invention can comprise acquisition module 201, integrate module 202, memory module 203 and processing module 204, and modules can be arranged at different servers respectively.

Acquisition module 201 can be used for image data, such as, can be arranged at data source (such as, providing the database of the application server of miscellaneous service process).For electronic emporium, such as, can collect the data that the business datum of shopping mall website, access log, flowing water daily record etc. are asynchronous, discrete, be follow-up data processing providing source data.

In some embodiments, the front end page that acquisition module 201 image data can be included in website embeds javascript script and the asynchronous log collection daily record data of rsyslog quasi real time (such as, access log and flowing water daily record etc.), and by the business datum of the application server of rsync backup tool synchronous acquisition store main website.In other embodiment, acquisition module 201 can also according to applying the class needing only to gather in daily record data or business datum.

The data that acquisition module 201 gathers are integrated by integrate module 202, and described integration can comprise resolving gathered business datum, carrying out correlation inquiry and filtration etc. by resolving the critical field of raw data obtained.Data Integration of the present invention is based on ApacheFlumeNG framework.Flume framework mainly comprises three module: Source (reception) module in charge data access and monitors, corresponding above-mentioned different acquisition mode adopts different accesses to monitor Agent, as: LogAgent is responsible for monitoring the data integrated rsyslog asynchronous transmission and come, and DBAgent is responsible for the business datum integrating relevant database in application server; Channel (queue) module in charge data buffering, after namely receiving Source data, Channel is mail in unification, supplies Sink resume module below.In embodiments of the present invention, the data gathered carry out buffer memory with the queue of file type, are FileChannel mode by Channel by unified definition; Sink module in charge data write.

Memory module 203 integrate module 202 is integrated after data stored in Hbase database.Data storage of the present invention can adopt HBaseSink interface, by the sink module in FlumeNG framework, data are saved in the Hbase database of accumulation layer, the redundant data table etc. that the data being stored into Hbase database can comprise access log, flowing water daily record, business datum and be formed after business datum being associated with business datum table, these data can provide basic data for report generation and data mining.The present invention utilizes the data being used for report generation and data analysis and excavation not need to be strict with the feature of db transaction characteristic and read consistency, by gather mass data storage to Hbase database non-relational database, significantly can improve the data base administration efficiency of mass data processing, facilitate data analysis and the data mining of mass data.

Processing module 204 carries out indicator-specific statistics respectively according to the update cycle of the data in Hbase database.From description above, the data that acquisition module 201 gathers can be asynchronous data, these asynchronous data can have the different update cycles (such as, 10 minutes, 1 hour etc.), indicator-specific statistics can be carried out respectively according to the update cycle of adopted data, that is, for the data in different update cycle, adopt the corresponding update cycle to carry out indicator-specific statistics.The statistics of processing module 204 pairs of data also can comprise the duplicate removal to data, wherein, the mode of timed task can be adopted the duplicate removal of data to carry out with statistics.For the data in different update cycle, after obtaining the result of indicator-specific statistics by timed task statistics, buffer memory can be carried out, such as, the result of indicator-specific statistics can be saved as regular snapshot document (such as, day snapshot document), by BDE, preserved regular snapshot document is outwards provided.Such as, by BDE.net, snapshot document outwards can be transmitted.The external of the data source in different update cycle is unified for the number update cycle by snapshot document by the present invention, greatly facilitates the statistical treatment of various dissimilar mass data.

Memory module 203 can also be used for by the result of indicator-specific statistics in processing module 204 stored in relevant database, such as, and oracle database.The result of indicator-specific statistics stored in relevant database, can be improved the efficiency of correlation inquiry in indicator-specific statistics result queries and displaying, reduce the query time of the statistics of mass data, facilitate the visual presentation of mass data statistics by the present invention.

Mass data processing device based on Hadoop of the present invention also can comprise display module 205, as shown in Figure 4.Display module 205 can receive the querying condition (such as, needing the keyword of inquiry) of user's input, and obtains the result of indicator-specific statistics according to this querying condition access relation type database, then the result of indicator-specific statistics is shown to this user.Such as, can by the statistics in the Structured Query Language (SQL) SQL of standard or the relational database of HQL interface polls, these statisticss can by document form to present customers.In other embodiment, the data in Hive interface polls high-volume database Hbase can also be set.Hive is a Tool for Data Warehouse of Hadoop, structurized data file can be mapped as a database table, provide complete SQL query function, and SQL statement can be converted to MapReduce task run.

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode of software combined with hardware platform.Based on such understanding, what technical scheme of the present invention contributed to background technology can embody with the form of software product in whole or in part, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, smart mobile phone or the network equipment etc.) perform the method described in some part of each embodiment of the present invention or embodiment.

The term used in instructions of the present invention and wording, just to illustrating, are not meaned and are formed restriction.It will be appreciated by those skilled in the art that under the prerequisite of the ultimate principle not departing from disclosed embodiment, can various change be carried out to each details in above-mentioned embodiment.Therefore, scope of the present invention is only determined by claim, and in the claims, except as otherwise noted, all terms should be understood by the most wide in range rational meaning.

Claims

1. based on a mass data processing method of Hadoop, it is characterized in that, described method comprises:

Image data,

Gathered data are integrated,

By the data after integration stored in Hbase database,

Indicator-specific statistics is carried out respectively according to the update cycle of data in described Hbase database,

The result of indicator-specific statistics stored in relevant database.

2. method according to claim 1, is characterized in that, described image data comprises: embed javascript script and the asynchronous log collection daily record data of rsyslog in front end page, and/or, by the business datum of rsync synchronous acquisition application server.

3. method according to claim 1, is characterized in that, described gathered data to be carried out integration be based on FlumeNG framework.

4. method according to claim 3, is characterized in that, the data of described collection carry out buffer memory with the queue of file type in FlumeNG framework.

5. method according to claim 1, is characterized in that, described method also comprises:

The result of described indicator-specific statistics is saved as regular snapshot document, and by BDE, described regular snapshot document is outwards provided.

6. method according to claim 1, is characterized in that, described method also comprises:

Receive the querying condition of user's input, and access according to described querying condition the result that described relevant database obtains described indicator-specific statistics, then the result of described indicator-specific statistics is shown to described user.

7. based on a mass data processing device of Hadoop, it is characterized in that, described device comprises:

Acquisition module, for image data;

Integrate module, integrates for the data gathered by described acquisition module;

Memory module, for the data after described integrate module is integrated stored in Hbase database,

Processing module, for carrying out indicator-specific statistics respectively according to the update cycle of data in described Hbase database, wherein,

Described memory module, also for the result of the indicator-specific statistics by described processing module stored in relevant database.

8. device according to claim 7, it is characterized in that, described acquisition module image data comprises: embed javascript script and the asynchronous log collection daily record data of rsyslog in front end page, and/or, by the business datum of rsync synchronous acquisition application server.

9. device according to claim 7, is characterized in that, described integrate module is based on FlumeNG framework.

10. device according to claim 9, is characterized in that, described integrate module carries out buffer memory with the queue of file type in FlumeNG framework.

11. devices according to claim 7, is characterized in that, described processing module, also for the result of described indicator-specific statistics is saved as regular snapshot document, and are outwards provided by described regular snapshot document by BDE.

12. devices according to claim 7, is characterized in that, described device also comprises:

Represent module, for receiving the querying condition of user's input, and accessing according to described querying condition the result that described relevant database obtains described indicator-specific statistics, then the result of described indicator-specific statistics being shown to described user.