CN104714983B

CN104714983B - The generation method and device of distributed index

Info

Publication number: CN104714983B
Application number: CN201310695615.6A
Authority: CN
Inventors: 韩丙卫
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2013-12-17
Filing date: 2013-12-17
Publication date: 2019-02-19
Anticipated expiration: 2033-12-17
Also published as: WO2014180411A1; CN104714983A

Abstract

The invention discloses a kind of generation method of distributed index and devices, in the above-mentioned methods, the quantity of the map operation in Hadoop are determined according to the data volume of initial data；It will treated that data are distributed to multiple reduce operations by each map operation, and generate index database corresponding with each reduce operation, wherein, the corresponding relationship between the quantity of reduce operation and each reduce operation and one or more map operation is to be pre-configured with completion；Index database corresponding with each reduce operation is merged.The technical solution provided according to the present invention is realized and efficiently, is rapidly indexed to mass data.

Description

The generation method and device of distributed index

Technical field

The present invention relates to the communications fields, in particular to the generation method and device of a kind of distributed index.

Background technique

With the arriving of cloud era, big data (Big data) has also attracted more and more concerns.Big data is usually used Come describe a company create a large amount of unstructured and semi-structured data, these data downloading to relevant database use Excessive time and money can be expended when analysis.Big data analysis is often linked together with cloud computing, because large-scale in real time Data set analysis needs the frame as MapReduce to share out the work to tens of, hundreds of or even thousands of computers.And it is big Data generally refer to such a phenomenon: the user network that Internet company generates in daily operation, accumulates in internet industry Network behavioral data.The scale of these data be so it is huge, so that it cannot being measured using G or T.

Does big data have much on earth? only pass through one day time, the full content that internet generates can carve full 1.68 hundred million Open DVD；The volume of mail of transmission can achieve as many as 294,000,000,000 envelopes；The community post of sending can reach 2,000,000；The hand of sale Machine is 37.8 ten thousand ...

To 2012, data volume was risen to from TB(1TB=1024GB) rank to PB(1PB=1024TB), EB for cut-off (1EB=1024PB) or even ZB(1ZB=1024EB) rank.The result of study of International Data Corporation (IDC) (IDC) shows the whole world in 2008 The data volume of generation is 0.49ZB, and the data volume that the whole world in 2009 generates is 0.8ZB, and the data volume that the whole world in 2010 generates increases Data volume for 1.2ZB, and whole world generation in 2011 is more up to 1.82ZB, and being equivalent to the whole world, everyone generates 200GB's or more Data.Until 2012, the data volume of all printing materials of human being's production is 200PB, the institute that the whole mankind said in history The data volume for having words is about 5EB.IBM's studies have shown that 90% was in entire human civilization total data obtained It goes to generate in two years.And the year two thousand twenty has been arrived, data scale caused by the whole world is up to 44 times of today.

Currently, how fast and effeciently to have searched out user's data of concern from big data in big data era As increasingly important problem.Efficiently quickly creating index is the premise that user scans for, and is usually adopted in the related technology The technical solution of creation index is single thread, and when facing mass data, there are performance bottlenecks, due to wanting to system It asks higher, and the limited system expanding ability, can no longer meet user and fast and effeciently carry out data in mass data The demand of retrieval.

Summary of the invention

The present invention provides a kind of generation method of distributed index and device, at least solve in the related technology can not be right The problem of mass data creation efficiently quickly indexes.

According to an aspect of the invention, there is provided a kind of generation method of distributed index.

The generation method of distributed index according to the present invention includes: to be determined in Hadoop according to the data volume of initial data Mapping (map) operation quantity；It will treated that data distribute makees to multiple specifications (reduce) by each map operation Industry, and generate index database corresponding with each reduce operation, wherein the quantity of reduce operation and each reduce operation Corresponding relationship between one or more map operations is to be pre-configured with completion；To corresponding with each reduce operation Index database merges.

Preferably, generating index database corresponding with each reduce operation includes: the file system for obtaining and currently supporting Type；The generating mode of index database corresponding with each reduce operation is determined according to the type of file system；According to generation side Formula generates index database corresponding with each reduce operation.

Preferably, generating index database corresponding with each reduce operation according to generating mode includes: when file system When type is Hadoop distributed file system (HDFS), index corresponding with each reduce operation is generated in local disk Then the index database generated in local disk is uploaded to HDFS by library；Alternatively, when file system type be except HDFS it When outer remaining supports shared distributed file system (DFS), directly support to generate in shared DFS at remaining with it is each The corresponding index database of reduce operation.

Preferably, merging to index database corresponding with each reduce operation includes: when the type of file system is When HDFS, the index database corresponding with each reduce operation in HDFS is downloaded to local disk；In local disk pair and often The corresponding index database of a reduce operation merges；The index database obtained after merging is uploaded to HDFS, and by local disk In index database corresponding with each reduce operation deleted.

Preferably, merging to index database corresponding with each reduce operation includes: when the type of file system is When remaining supports shared DFS, remaining is supported the index database corresponding with each reduce operation generated in shared DFS into Row merges；Remaining is supported the index database corresponding with each reduce operation generated in shared DFS delete.

According to another aspect of the present invention, a kind of generating means of distributed index are provided.

The generating means of distributed index according to the present invention comprise determining that module, for the data according to initial data Measure the quantity for determining the map operation of the mapping in Hadoop；Generation module, for will be by each map operation treated data It distributes to multiple specification reduce operations, and generates index database corresponding with each reduce operation, wherein reduce operation Corresponding relationship between quantity and each reduce operation and one or more map operation is to be pre-configured with completion；It closes And module, for being merged to index database corresponding with each reduce operation.

Preferably, generation module includes: acquiring unit, for obtaining the type for the file system currently supported；It determines single Member, for determining the generating mode of index database corresponding with each reduce operation according to the type of file system；Generation unit, For generating index database corresponding with each reduce operation according to generating mode.

Preferably, generation unit, for when the type of file system be Hadoop distributed file system HDFS when, this Index database corresponding with each reduce operation is generated in local disk, then uploads the index database generated in local disk To HDFS；Alternatively, generation unit, is the distribution text that remaining support in addition to HDFS is shared for the type when file system When part system DFS, directly support to generate index database corresponding with each reduce operation in shared DFS at remaining.

Preferably, merging module includes: download unit, for when the type of file system be HDFS when, will be in HDFS Index database corresponding with each reduce operation is downloaded to local disk；First combining unit, in local disk pair and often The corresponding index database of a reduce operation merges；First processing units, the index database for obtaining after merging are uploaded to HDFS, and the index database corresponding with each reduce operation in local disk is deleted.

Preferably, merging module includes: the second combining unit, is that remaining supports shared for the type when file system When DFS, the index database corresponding with each reduce operation generated in shared DFS is supported remaining to merge；At second Unit is managed, for supporting the index database corresponding with each reduce operation generated in shared DFS to delete remaining.

Through the embodiment of the present invention, the quantity of the map operation in Hadoop is determined using the data volume according to initial data； Will by each map operation, treated that data distribute to multiple reduce operations, and generate corresponding with each reduce operation Index database, between the quantity of the reduce operation and each reduce operation and one or more map operation it is corresponding pass System is to be pre-configured with completion；Index database corresponding with each reduce operation is merged, i.e., by using Hadoop In map operation and reduce operation initial data is handled, generate corresponding with each reduce operation index database, so Index database corresponding with each reduce operation is merged afterwards, thus solving in the related technology can not create mass data The problem of efficiently quickly indexing is built, and then realizes and mass data efficiently, is rapidly indexed.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of the generation method of distributed index according to an embodiment of the present invention；

Fig. 2 is the flow chart of the generation method of distributed index according to the preferred embodiment of the invention；

Fig. 3 is the structural block diagram of the generating means of distributed index according to an embodiment of the present invention；

Fig. 4 is the structural block diagram of the generating means of distributed index according to the preferred embodiment of the invention.

Specific embodiment

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.

Fig. 1 is the flow chart of the generation method of distributed index according to an embodiment of the present invention.As shown in Figure 1, this method May include following processing step:

Step S102: the quantity of the map operation in Hadoop is determined according to the data volume of initial data；

Step S104: will by each map operation, treated that data distribute to multiple reduce operations, and generate with it is every The corresponding index database of a reduce operation, wherein the quantity of reduce operation and each reduce operation and one or more Corresponding relationship between map operation is to be pre-configured with completion；

Step S106: index database corresponding with each reduce operation is merged.

In the related technology, mass data can not be created and efficiently, is quickly indexed.Using method as shown in Figure 1, pass through Using in Hadoop map operation and reduce operation initial data is handled, generate it is corresponding with each reduce operation Index database, then index database corresponding with each reduce operation is merged, thus solves and in the related technology can not The problem of mass data creation is efficiently quickly indexed, and then realize and mass data efficiently, is rapidly indexed.

Preferably, in step S104, generating index database corresponding with each reduce operation may include following operation:

Step S1: the type for the file system currently supported is obtained；

Step S2: the generating mode of index database corresponding with each reduce operation is determined according to the type of file system；

Step S3: index database corresponding with each reduce operation is generated according to generating mode.

In a preferred embodiment, firstly, it is necessary to determine the size of the data volume of initial data to be obtained, and it is divided into M (M is positive integer) part, wherein every part of data respectively correspond a map operation.Certainly, data volume handled by each map operation It can be with dynamic configuration.Map data processing plug-in unit is set as a result,.In addition, in being generated after each map operation processing Between key-value pair collection credit union be periodically written local disk, it is positive integer that local disk can be divided into N(N again) it is a, N be user from Definition setting, each subregion respectively corresponds a reduce operation.By configuring the maximum number of reduce operation, to improve The creation efficiency of distributed index, and inserted according to the setting reduce data processing of the quantity of the reduce operation of user configuration Part.In the preferred embodiment, creation index can support Hadoop distributed file system (HDFS) and other support Shared distributed file system (DFS).It therefore, can be poor according to the type for the file system supported in creation Index process Then the generating mode of different determination index database corresponding with each reduce operation generates and each reduce according to generating mode The corresponding index database of operation.

Preferably, in step s3, generating index database corresponding with each reduce operation according to generating mode can wrap Include one of following steps:

Step S31: raw in local disk when the type of file system is Hadoop distributed file system (HDFS) At index database corresponding with each reduce operation, the index database generated in local disk is then uploaded to HDFS；

Step S32: when the type of file system is remaining distributed file system for supporting to share in addition to HDFS (DFS) it when, directly supports to generate index database corresponding with each reduce operation in shared DFS at remaining.

In a preferred embodiment, if the type for the file system currently supported is HDFS, each reduce operation Interim index database is generated in local file system (i.e. local disk)；Then, the scale removal process last in reduce operation In, the interim index database generated in local file system can be uploaded in HDFS file system.If currently supported The type of file system is that remaining supports shared DFS, then interim index database can be directly generated in DFS file system.

Preferably, in step s 106, index database corresponding with each reduce operation is merged may include with Lower operation:

Step S4: when the type of file system is HDFS, by the index corresponding with each reduce operation in HDFS Library is downloaded to local disk；

Step S5: it is merged in local disk pair index database corresponding with each reduce operation；

Step S6: the index database obtained after merging is uploaded to HDFS, and will be made in local disk with each reduce The corresponding index database of industry is deleted.

In a preferred embodiment, if the type for the file system currently supported is HDFS, first by Hadoop's It indexes host node (master) and all interim index databases is downloaded to local file system from HDFS file system；Secondly, The all interim index database in local file system is merged on index host node, generates complete index database；Again, exist Complete index database is uploaded in HDFS file system on index host node；Then, by local file on index host node Each interim index database in system is deleted；Finally, the index of Hadoop is from node (slave) from HDFS file system Complete index database is downloaded in local file system, is used to retrieve.

Preferably, in step s 106, index database corresponding with each reduce operation is merged may include with Lower step:

Step S7: when the type of file system is that remaining supports shared DFS, remaining is supported raw in shared DFS At index database corresponding with each reduce operation merge；

Step S8: remaining is supported the index database corresponding with each reduce operation generated in shared DFS delete It removes.

In a preferred embodiment, if the type for the file system currently supported is that remaining supports shared DFS, first The interim index database in DFS file system is merged into complete index database by the index host node of Hadoop, is made to retrieve With；Each interim index database in DFS file system is deleted on index host node again.

Above-mentioned preferred implementation process is further described below in conjunction with preferred embodiment shown in Fig. 2.

Fig. 2 is the flow chart of the generation method of distributed index according to the preferred embodiment of the invention.As shown in Fig. 2, should Process may include following processing stage:

First stage: data acquisition phase, i.e. the map sessions of Hadoop, data acquisition phase are setting indexes The preposition preparation stage can provide data for creation index and support.It is distributed used by the map sessions of Hadoop Implementation, can concurrently handle data, wherein the quantity of map operation needs are dynamically determined by the data volume acquired. Data are handled using the acquisition text file or database file of the map operation of Hadoop, generate creation index institute The content of each field (i.e. key-value pair (key, value) is gathered) needed, thus greatly improves data processing performance.And In acquisition due to supporting plug-in unit processing, different processing modes can be customized according to data volume.

Second stage: creation index stage, i.e. the reduce sessions of Hadoop create distributed index library.Pass through The number of reduce operation is set to determine the greatest measure reduceNum of reduce job parallelism processing.Rank is acquired in data The data of Duan Shengcheng distribute specific data by HashCode () %reduceNum to each reduce operation as rope Draw, each reduce operation generates the interim index library file of itself respectively.

It should be noted that creation index can support Hadoop distributed file system (HDFS) and other support Shared distributed file system (DFS).

Phase III: index merging phase generates each according to each reduce operation that the creation index stage obtains Interim index database calls index to merge and each interim index database is merged into a complete index database by index host node.It is holding When line index merges, each interim index database can be read one by one, interim index database is incorporated into individual master index library, finally Each interim index database is deleted, and provides retrieval service by master index library.

Fig. 3 is the structural block diagram of the generating means of distributed index according to an embodiment of the present invention.As shown in figure 3, the dress Setting may include: determining module 10, and the number of the mapping map operation in Hadoop is determined for the data volume according to initial data Amount；Generation module 20, treated for will pass through each map operation, and data are distributed to multiple specification reduce operations, and raw At index database corresponding with each reduce operation, wherein the quantity of reduce operation and each reduce operation and one Or the corresponding relationship between multiple map operations is to be pre-configured with completion；Merging module 30, for making to each reduce The corresponding index database of industry merges.

Using device as shown in Figure 3, quickly index can not be created efficiently to mass data in the related technology by solving The problem of, and then realize and mass data efficiently, is rapidly indexed.

Preferably, as shown in figure 4, generation module 20 may include: acquiring unit 200, for obtaining the text currently supported The type of part system；Determination unit 202, for determining index corresponding with each reduce operation according to the type of file system The generating mode in library；Generation unit 204, for generating index database corresponding with each reduce operation according to generating mode.

Preferably, as shown in figure 4, generation unit 204, is Hadoop distributed field system for the type when file system When system HDFS, index database corresponding with each reduce operation is generated in local disk, then will be generated in local disk Index database be uploaded to HDFS；Alternatively, generation unit 204, is remaining in addition to HDFS for the type when file system When supporting shared distributed file system DFS, directly support to generate and each reduce operation pair in shared DFS at remaining The index database answered.

Preferably, as shown in figure 4, merging module 30 may include: download unit 300, for working as the type of file system When for HDFS, the index database corresponding with each reduce operation in HDFS is downloaded to local disk；First combining unit 302, for being merged in local disk pair index database corresponding with each reduce operation；First processing units 304, are used for The index database obtained after merging is uploaded to HDFS, and by the index database corresponding with each reduce operation in local disk into Row is deleted.

Preferably, as shown in figure 4, merging module 30 may include: the second combining unit 306, for when file system When type is that remaining supports shared DFS, the rope corresponding with each reduce operation generated in shared DFS is supported remaining Draw library to merge；The second processing unit 308 is generating with each reduce operation in shared DFS for supporting remaining Corresponding index database is deleted.

It can be seen from the above description that above example implements following technical effect (it should be noted that these Effect is the effect that certain preferred embodiments can achieve): using technical solution provided by the embodiment of the present invention, can pass through Initial data is handled using the map-reduce programming model in Hadoop, is generated corresponding with each reduce operation Then index database merges index database corresponding with each reduce operation, a complete index database is formed, to examine Rope uses, and thus solves the problems, such as that mass data can not be created efficiently quickly index in the related technology, and then realize Mass data efficiently, is rapidly indexed.

Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of generation method of distributed index characterized by comprising

The quantity of the mapping map operation in Hadoop is determined according to the data volume of initial data；

Will by each map operation, treated that data distribute to multiple specification reduce operations, and generate and each reduce The corresponding index database of operation, wherein the quantity of the reduce operation and each reduce operation and one or more Corresponding relationship between map operation is to be pre-configured with completion；

Index database corresponding with each reduce operation is merged；

Wherein, the class that index database corresponding with each reduce operation includes: the file system that acquisition is currently supported is generated Type；According to the generating mode of the determining index database corresponding with each reduce operation of the type of the file system；According to The generating mode generates index database corresponding with each reduce operation.

2. the method according to claim 1, wherein being generated and each reduce according to the generating mode The corresponding index database of operation includes:

When the type of the file system is Hadoop distributed file system HDFS, generated in local disk and described every The corresponding index database of a reduce operation, is then uploaded to the HDFS for the index database generated in the local disk； Alternatively,

When the type of the file system is the distributed file system DFS that remaining support in addition to the HDFS is shared, Directly it is described remaining support to generate index database corresponding with each reduce operation in shared DFS.

3. according to the method described in claim 2, it is characterized in that, to index database corresponding with each reduce operation into Row merges

It, will be corresponding with each reduce operation in the HDFS when the type of the file system is the HDFS Index database is downloaded to the local disk；

It is merged in the local disk pair index database corresponding with each reduce operation；

The index database obtained after merging is uploaded to the HDFS, and will be made in the local disk with each reduce The corresponding index database of industry is deleted.

4. according to the method described in claim 2, it is characterized in that, to index database corresponding with each reduce operation into Row merges

It is raw in the DFS shared to remaining support when the type of the file system is the shared DFS of remaining support At index database corresponding with each reduce operation merge；

Remaining described index database corresponding with each reduce operation for supporting to generate in shared DFS is deleted.

5. a kind of generating means of distributed index characterized by comprising

Determining module determines the quantity of the mapping map operation in Hadoop for the data volume according to initial data；

Generation module, treated for will pass through each map operation, and data are distributed to multiple specification reduce operations, and are generated Index database corresponding with each reduce operation, wherein the quantity of the reduce operation and each reduce operation Corresponding relationship between one or more map operations is to be pre-configured with completion；

Merging module, for being merged to index database corresponding with each reduce operation；

Wherein, the generation module includes: acquiring unit, for obtaining the type for the file system currently supported；Determination unit, For the generating mode according to the determining index database corresponding with each reduce operation of the type of the file system；It generates Unit, for generating index database corresponding with each reduce operation according to the generating mode.

6. device according to claim 5, which is characterized in that the generation unit, for working as the class of the file system When type is Hadoop distributed file system HDFS, index corresponding with each reduce operation is generated in local disk Then the index database generated in the local disk is uploaded to the HDFS by library；Alternatively, the generation unit, is used for When the type of the file system is the distributed file system DFS that remaining support in addition to the HDFS is shared, directly It is described remaining support to generate index database corresponding with each reduce operation in shared DFS.

7. device according to claim 6, which is characterized in that the merging module includes:

Download unit, for when the type of the file system is the HDFS, by the HDFS with it is described each The corresponding index database of reduce operation is downloaded to the local disk；

First combining unit, for being closed in the local disk pair index database corresponding with each reduce operation And；

First processing units, the index database for obtaining after merging are uploaded to the HDFS, and will be in the local disk Index database corresponding with each reduce operation is deleted.

8. device according to claim 6, which is characterized in that the merging module includes:

Second combining unit, for when the type of the file system be it is described remaining support shared DFS when, to it is described remaining The index database corresponding with each reduce operation for supporting to generate in shared DFS merges；

The second processing unit, it is corresponding with each reduce operation for will be generated in the shared DFS of remaining support Index database deleted.