CN103617211A

CN103617211A - HBase loaded data importing method

Info

Publication number: CN103617211A
Application number: CN201310584702.4A
Authority: CN
Inventors: 郭美思; 王秀娟; 吴楠
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2014-03-05

Abstract

The invention discloses an HBase loaded data importing method. The method includes in Region predistribution, setting environment and configuration parameters in a cluster; creating a HBase list according to compiling of a function determining the number of Regions; after Region predistribution is finished, utilizing distributed-type computation framework analyzing and processing capability and characteristics of parallel computation to compile a MapReduce program to enable source data to generate an Hfile file; using a completebulkload order to complete importing of the data, and importing the data into the HBase list according to a preset format. By the method, a well-generated HFile file can be directly loaded into a running HBase cluster, so that network traffic generated during data transmission and HBase loading in the process of data migration is reduced, data importing efficiency is improved, and CPU (central processing unit) and network resources are saved.

Description

A kind of HBase loads the introduction method of data

Technical field

The present invention relates to the introduction method that HBase loads data.

Technical background

Along with the develop rapidly of network technology, the rapid growth of data volume, in order to analyze and to utilize these huge data resources, traditional technology has run into huge obstacle already, cannot be competent at the task of large data analysis.And in order to meet the requirement of large data analysis, Google has proposed MapReduce technology, it is a kind of programming model towards large-scale data analyzing and processing and parallel computation.In the required technology of large data, distributed file system, distributed data base etc. is all the technology that is applicable to large data.HBase be a kind of extendible, support large-scale distributed database.It is to utilize Hadoop HDFS as its document storage system.Because it is with good expansibility, fault-tolerance, and random reading capability, support MapReduce parallel computation, by increasing company, accepted.But find after deliberation, the data importing instrument carrying in HBase has certain limitation, and it can not make user control data loading procedure completely, and the expection form can not self-defining data loading.Therefore it is very important, loading the introduction method that the HBase with specific format loads data.

The Bulk load carrying in HBase at present supports mass data to be loaded in HBase efficiently.Bulk load realizes by a MapReduce Job, and the inside HFile formatted file that directly generates a HBase by Job forms a special HBase tables of data, then directly data file is loaded in the cluster of operation.Use the simplest mode of bulk load function to use exactly importtsv instrument.Importtsv is a built-in tool from the direct loading content of TSV file to HBase.It is by a MapReduce Job of operation, the table of HBase that data are write direct from TSV file or write the own formatted data file of a HBase.

Although importtsv instrument is very useful when needs import HBase by text data, but there is certain situation, such as importing the data of extended formatting, you can wish to carry out generated data with programming, and MapReduce processes the most effective mode of mass data.This may be also in HBase, to have loaded mass data the most feasible unique method.Certainly we can use MapReduce to import data to HBase, but the data set of magnanimity can make MapReduce Job also become very heavy.If deal with improperly, may make the handling capacity in job when operation of MapReduce very little.

Summary of the invention

The technical problem to be solved in the present invention is: the Region quantity according in rational design HBase table, makes the pre-data uniform distribution importing in cluster.Adopt again the programming model of distributed computing framework MapReduce again to realize the HFile file that Map interface and Reduce interface obtain the specific format of expection, then utilize completebulkload instrument that file is loaded in HBase according to expection form.

In HBase, data merging is a frequent write operation task of carrying out, unless we can generate the internal data file of HBase, and directly loads.Although the writing speed of HBase is always very fast like this, if merging process does not have suitable configuration, also likely cause write operation often to get clogged.Another problem that the task that write operation is very heavy may cause is exactly that data have been write to identical group's server (region server), and this situation often appears at mass data is imported in a newly-built HBase.Once data centralization is at identical server, it is uneven that whole cluster just becomes, and writing rate can reduce significantly.Therefore,, first by predistribution Region, its fundamental purpose is HBase to be imported to data build cluster before, and can make the data uniform distribution importing in cluster.Then by MapReduce program, produce specific document format data.Finally HFile file is directly loaded in HBase.Aforesaid way will be guaranteed the preallocated rationality of Region, MapReduce program design and the rationality of writing.The method can improve importing efficiency, and supports parallel computation, therefore more efficient.

The technical solution adopted in the present invention is:

A kind of HBase loads the introduction method of data, first in Region predistribution, set environment and configuration parameter in cluster, then according to writing, determine that the function of Region quantity creates HBase table, treat that Region predistribution finishes, utilize the feature of distributed computing framework analyzing and processing ability and parallel computation to write MapReduce program source data is generated to Hfile file, finally, by complete the importing of data with completebulkload order, data have been imported to HBase table according to predetermined form.Can improve importing efficiency like this.The introduction method of these loading data is mainly realized by MapReduce module.

Each row of data in HBase all belongs to a specific Region, and a Region has comprised according to the HBase data line of sequence number sequence, and it is managed by RegionServer.Create after described HBase table, this table can start in an independent Region, first the data of all insertions can enter in this Region, when reaching a limit, data can be split into two Region, separated Region can be distributed on other Region Server, to reach the load balancing in cluster.Therefore, when data importing, first predistribution Region, is distributed to data in whole cluster and reaches load balancing with suitable algorithm, accelerates the speed that data load.

Described, write in MapReduce program, MapReduce framework is responsible for data to divide, using a storage block Block of file as a division, then extract the key-value pair set <K1 of the record in dividing, V1> inputs as Map, in the mapper of appointment class, by the form of row data-switching appointment of input, Map module is according to key-value pair conversion row data and generate row key, and the specify columns Praenomen title that claims and be listed as; In map method, set up Put object, by Put.add () function, the data after conversion are added in Put object, call context.write () method data are write in intermediate file; Then according to rowkey and Put object, generate middle key-value pair <rowkey, put>, and intermediate result is write to local disk.Reduce module is according to the position that obtains intermediate result from Master, by remote interface reading out data data are write and front row arranged from carry out the disk of TaskTracker of Map task, meet the output format of expection, thereby draw last Output rusults HFile file.

In Reduce class, can the result after processing be outputed in the file of appointment by the map output file form of user program appointment according to these records of processing of reduce method iteration, by setOutputPath function, set the path of output.

Described Region predistribution is according to the number that imports the data volume of HBase and the make out the scale Region of distributed type assemblies, then by data volume, design in advance and distribute Region, can significantly reduce the number of times of Region Split, Split not even, the object of load balancing while reaching data importing.

Described MapReduce programming framework, is the processing procedure that obtains HFile formatted file, Map resume module according to data layout, obtain appropriate design rowkey, then deal with data obtains intermediate result; In Reducer module, be organized into rational data layout, finally by HFile file output in the outgoing route of appointment; This process can arrange a plurality of map quantity, has improved treatment effeciency, has greatly promoted performance.Programming trouble when MapReduce framework has been simplified concurrent processor, provides the DLL (dynamic link library) traveling through.

Described introduction method is realized by writing MapReduce program: first in principal function, create a Job example, input path, outgoing route, mapper class, reducer class and the key of map output and the type of value of this example are set, then the configuration of HBase is set, as the node of zookeeper in cluster-specific.Again, according to the configuration arranging, set up HBase table, finally output is set as to HFileOutputFormat and can generates HFile file.

Beneficial effect of the present invention is:

What the present invention adopted is before data are transferred to HBase, to allocate the effect that Region quantity reaches cluster load balancing in advance, then utilizes MapReduce programming to generate HFile file, has computation capability.The method can directly load generated HFile file into operating HBase cluster.The network traffics that produce when data transmission and HBase load have so just been reduced in data migration process.This method has improved data importing efficiency simultaneously, saves CPU and Internet resources.

Accompanying drawing explanation

Fig. 1 is the flowchart of HBase data importing;

Fig. 2 is the preallocated sequential chart of Region in HBase data importing.

Embodiment

With reference to the accompanying drawings, in conjunction with the embodiments to the detailed description of the invention.

Embodiment 1:

In the present embodiment, first according to the environment of the pre-data that import and cluster, reasonably calculate the number of predistribution Region, the environment of existing cluster is 8 station servers, internal memory 96G, operating system is centos6.3, installs the assembly of cluster as HDFS, MapReduce, HBase etc. according to installation steps.The data file of this importing is 10,000,000,000 data, and data layout the following is: A75566620131107,121212,33333.The flowchart of HBase data importing as shown in Figure 1, first rationally arranges the parameter in cluster, then designs the preallocated algorithm of Region, guarantees data load balance.Then according to the preallocated quantity of Region, data are generated to HFile file by MapReduce program, finally utilize completebulkload order that the HFile file of predetermined form has been loaded in HBase table, complete the importing of data.

Embodiment 2:

The present embodiment is predistribution Region, as shown in Figure 2: first configure correlation parameter, then according to the HBase table of the Region number in Getsplit function creation algorithm for design.In this algorithm, be to comprise letter and date according to pre-importing in data layout, environment and the data scale of considering cluster design preallocated Region number again, letter and combination of numbers are used as dividing Region scope, and the scope of letter is A-Z, and the scope of numeral is that 01-12 represented for 12 month.Can there is like this 24*12=288 region to load uniformly these data, and because each letter and the data in month are uniformly, so the data in each region are also uniform.Can guarantee that like this load in Region is balanced, there is no the Region that load is heavy especially, also there is no the Region that load is light especially.

Embodiment 3:

HBase loads the introduction method of data and realizes by writing MapReduce program.First new configuration () in principal function, then creates a Job example new job () according to conf.Input path, outgoing route, mapper class, reducer class and the key of map output and the type of value of this example are set.Thereupon by the configuration of set () function setup HBase, as the node of zookeeper in cluster-specific.Again, according to the configuration arranging, set up HBase table, finally according to HFileOutputFormat.configureIncrementalLoad (job, htable) method, output is set as generating HFile file, wherein htable creates by new HTable ().

Embodiment 4:

In the mapper of appointment class by the form of row data-switching appointment of input, by to the rational rowkey of the operational design of row data, letter and numeral are reasonably arranged in together, and Region is corresponding with predistribution, the title that specify columns Praenomen claims and is listed as simultaneously.In map method, set up Put object, by Put.add () method, the data after conversion are added in Put object.Then calling context.write () method writes data in intermediate file.Can be by the map output file form of user program appointment according to these records of processing of reduce method iteration in Reduce class, by Iterator<Put> iter=puts.iterator () iteration value value is added in map, TreeSet<KeyValue> map=new TreeSet<KeyValue> (KeyValue.COMPARATOR) wherein.Then row and kv are write by context.write (row, kv).Finally the result after processing is outputed in the file of appointment, can set by setOutputPath function the path of output.Utilize completebulkload order that the HFile file of predetermined form has been loaded in HBase table, complete the importing of data.

In the operational process of program, some bottleneck and obstacle that the daily record that can generate by monitoring interface, Hadoop or Hbase exists while going to monitor the MapReduce loading data in cluster.According to the prompting showing in daily record, can adjust the environmental parameter in corresponding configuration parameter and cluster, as revised map quantity and reduce quantity, to make it the efficiency of operation higher; Adjust JVM storehouse size and memory size etc.By suitable modification configuration parameter, can improve the ability of cpu busy percentage and parallel computation, improve data importing efficiency, save Internet resources.

Claims

1. a HBase loads the introduction method of data, it is characterized in that: first in Region predistribution, set environment and configuration parameter in cluster, then according to writing, determine that the function of Region quantity creates HBase table, treat that Region predistribution finishes, utilize the feature of distributed computing framework analyzing and processing ability and parallel computation to write MapReduce program source data is generated to Hfile file, finally, by complete the importing of data with completebulkload order, data have been imported to HBase table according to predetermined form.

2. a kind of HBase according to claim 1 loads the introduction method of data, it is characterized in that: create after described HBase table, this table can start in an independent Region, first the data of all insertions enter in this Region, data are split into two Region while reaching a limit, separated Region is distributed on other Region Server, to reach the load balancing in cluster.

3. a kind of HBase according to claim 1 loads the introduction method of data, it is characterized in that: in the described MapReduce of writing program, MapReduce framework is responsible for data to divide, using a storage block Block of file as a division, then extract the key-value pair set <K1 of the record in dividing, V1> inputs as Map, in the mapper of appointment class by the form of row data-switching appointment of input, Map module is according to key-value pair conversion row data and generate row key, and the specify columns Praenomen title that claims and be listed as, in map method, set up Put object, by Put.add () function, the data after conversion are added in Put object, call context.write () method data are write in intermediate file, then according to rowkey and Put object, generate middle key-value pair <rowkey, put>, and intermediate result is write to local disk, Reduce module is according to the position that obtains intermediate result from Master, by remote interface reading out data data are write and front row arranged from carry out the disk of TaskTracker of Map task, meet the output format of expection, thereby draw last Output rusults HFile file.

4. a kind of HBase according to claim 1 loads the introduction method of data, it is characterized in that: described Region predistribution is according to the number that imports the data volume of HBase and the make out the scale Region of distributed type assemblies, then by data volume, design in advance and distribute Region, can significantly reduce the number of times of Region Split, Split not even, the object of load balancing while reaching data importing.

5. a kind of HBase according to claim 3 loads the introduction method of data, it is characterized in that: described MapReduce programming framework, it is the processing procedure that obtains HFile formatted file, Map resume module according to data layout, obtain appropriate design rowkey, then deal with data obtains intermediate result; In Reducer module, be organized into rational data layout, finally by HFile file output in the outgoing route of appointment.

6. according to a kind of HBase described in the above-mentioned arbitrary claim of claim, load the introduction method of data, it is characterized in that, described introduction method is to realize by writing MapReduce program: first in principal function, create a Job example, input path, outgoing route, mapper class, reducer class and the key of map output and the type of value of this example are set, then the configuration of HBase is set, again, according to the configuration arranging, set up HBase table, finally output is set as to HFileOutputFormat and can generates HFile file.