CN102779134B

CN102779134B - Lucene-based distributed search method

Info

Publication number: CN102779134B
Application number: CN201110122631.7A
Authority: CN
Inventors: 吴志祥; 张海龙; 马和平; 王专; 吴剑; 郭凤林; 王晓钟; 庞绍进
Original assignee: Tongcheng Network Technology Co Ltd
Current assignee: Tongcheng Network Technology Co Ltd
Priority date: 2011-05-12
Filing date: 2011-05-12
Publication date: 2015-05-13
Anticipated expiration: 2031-05-12
Also published as: CN102779134A

Abstract

The invention relates to a Lucene-based distributed search method, which is characterized by comprising an indexing step and a searching step. According to the indexing step, at least one indexing host for establishment of an index is combined with at least two slave servers via a distributed file system. According to the searching process, a searching engine is formed by the at least one searching host and the at least two slave servers. Therefore, the problems of poor stand-alone search performance and individual index errors can be solved effectively. Meanwhile, extension can be performed effectively due to mutual cooperation of multiple servers. More importantly, the defect of excessive consumption of the server performance in index maintenance can be overcome when the number of index data is increased, and search cannot be affected.

Description

Based on the distributed search methods of Lucene

Technical field

The present invention relates to a kind of searching method, particularly relate to a kind of distributed search methods based on Lucene.

Background technology

Search, index and index maintenance program are placed on a station server, and in configuration, the more convenient but problem brought is, search for concurrency large when, cannot expand, when index data amount increases, index maintenance extremely consumes server performance, can have an impact to search.

General introduction Lucene describes and structure

I. what is Lucene

Lucene is a full text information retrieval kit based on Java, it is not a complete search for application, but the framework of a full-text search engine, provide complete query engine and index engine, part text analyzing engine (English and German bilingual).Lucene is a top open source projects in ApacheJakarta family at present.Its author is Doug Cutting, and he is a senior full-text index/retrieval expert.

Ii.Lucene system basic structure

The service that Lucene provides is actual comprises two parts: one enters one goes out.It is write that what is called enters, and the source (essence is character string) provided by you writes index or it deleted from index; It is read that what is called goes out, and namely provides full-text search service to user, allows user can by keyword locating source.Figure below illustrates one and enters one and go out, and show also the relation between search for application and Lucene:

Write stream: source string is first through analyzer process, comprise: participle, be divided into and the information needed in source added in each Field of Document after word one by one, and needing the Field index of index, needing the Field stored store and index is write storer, storer can be internal memory or disk

Read stream: user provides search keyword, through analyzer process.Corresponding Document is found out to the keyword search index after process.User extracts the Field of needs as required from the Document found.

a)Mapreduce：

Hadoop mapping/reduction framework is master/slave (master/slave) framework. it is by a master server (Jobtracker) and somely form from server (tasktracker).Master server is the key that user and system are come into contacts with.User is by self-defining

Master server is submitted in mapping/reduction operation.Operation is put into job queue and processes according to the task of first come first served basis to queue by master server.Master server is used for mapping or reduction operation is distributed to different from server.From server executable operations under the control of master server, meanwhile, differently to transmit from also carrying out data in mapping and reduction stages between server

b)Hadoop DFS

The distributed file system (HDFS) of Hadoop is designed between cluster computer, store large data file.This design derives from Google file system (GFS).Each file stores as one group of data block by Hadoop distributed file system, and all data blocks except last data block in a file all have identical size.As fault-tolerant processing, these data blocks are duplicated in order to a lot of part.The data block size of each file can the person's of being managed configuration with the number copied.In addition, the file that it should be noted that in HDFS be all Write-once and each time point strict only allow a thread execution write operation.

C) Distributed Architecture describes

Former stand-alone server search performance has the limit.And later stage expansion is almost impossible, existing solr, hibernateSearch cannot meet existing demand and hereby work out a distributed parallel search framework based on lucene+hadoop+mapReduce, can provide the solution that search performance under big data quantity is low.

Summary of the invention

Object of the present invention is exactly to solve the above-mentioned problems in the prior art, provides a kind of distributed search methods based on Lucene.

Object of the present invention is achieved through the following technical solutions:

Based on the distributed search methods of Lucene, wherein: include index step and search step;

Described index step is, 1. step, is set up the index main frame of index, be combined with at least two dependent servers by distributed file system mode by least one; 2., the index content that index main frame does not store all is stored on dependent server step, during each newly-built renewal index, is needed to set up which partial index, then task is distributed to every platform dependent server by the every platform dependent server of index Framework computing; 3., dependent server obtains corresponding data and is built up index step in the data source of correspondence, and after executing by index situation report-back to index main frame; 4., if 3. step builds up index failure, then index main frame sends and rebuilds order to corresponding dependent server step again;

Described search routine is, 1. step, forms search engine by least one base unit search and at least two dependent servers; The open remote method invocation (RMI) search interface of dependent server, base unit search is connected with dependent server by remote method invocation (RMI) search interface.2., when user inquires about, according to index distributed strategy, base unit search judges which dependent server corresponding index data exists on to step, then send to corresponding dependent server querying command; 3., dependent server carries out concurrent inquiry to step, and result is returned to base unit search, and result is carried out stipulations by base unit search, then result is returned to user.

The above-mentioned distributed search methods based on Lucene, wherein: the step of described index step is 3. middle adopts duplicate removal process, the Duplicate Removal Algorithm duplicate removal alone that every platform dependent server is distributed according to index main frame, is carried out the duplicate removal of multiple dependent server again by index main frame after duplicate removal is complete.

Further, the above-mentioned distributed search methods based on Lucene, wherein: the step of described index step 4. in, rear index host notification base unit search index all built in index renewal, then base unit search reloading index.

Further, the above-mentioned distributed search methods based on Lucene, wherein: in described index step, the index on dependent server backups on corresponding backup server according to backup policy by index main frame.

Further, the above-mentioned distributed search methods based on Lucene, wherein: in described search step, whether base unit search timing is carried out dependent server can the detection of normal queries.

Further, the above-mentioned distributed search methods based on Lucene, wherein: in described search step, when certain dependent server cannot connect or server power-off time, base unit search can attempt searching on the backup server that dependent server is corresponding.

Again further, the above-mentioned distributed search methods based on Lucene, wherein: the described distributed search methods based on Lucene, it is characterized in that: in described search step, when dependent server and corresponding backup server all can not be searched for, base unit search can shield the search of corresponding dependent server, returns the data that other dependent servers search, and sends warning message to keeper to investigate the problem of corresponding dependent server in time.

The advantage of technical solution of the present invention is mainly reflected in: effectively can solve the problem of unit search performance difference, problem made mistakes individually in index.Meanwhile, by cooperating with each other of multiple server, can effectively expand.What is more important, when index data amount increases, index maintenance there will not be the defect too much consuming server performance, guarantees that search is unaffected.

Accompanying drawing explanation

Object of the present invention, advantage and disadvantage, by for illustration and explanation for the non-limitative illustration passing through preferred embodiment below.These embodiments are only the prominent examples of application technical solution of the present invention, allly take equivalent replacement or equivalent transformation and the technical scheme that formed, all drop within the scope of protection of present invention.In the middle of these accompanying drawings,

Fig. 1 adopts this configuration based on the distributed search methods of Lucene to implement schematic diagram.

In figure, the implication of each Reference numeral is as follows:

1 index main frame 2 dependent server

3 backup server 4 base unit search

Embodiment

The distributed search methods based on Lucene as shown in Figure 1, its special feature is: include index step and search step.

Specifically, index step of the present invention is: first, the index main frame 1 (MasterIndex) of index is set up by least one, combined by distributed file system (Hadoop Distributed File System is called for short HDFS) mode with at least two dependent servers 2.

Afterwards, the index content that index main frame 1 does not store all is stored on dependent server 2, and during each newly-built renewal index, calculating every platform dependent server 2 by index main frame 1 needs to set up which partial index, then task is distributed to every platform dependent server 2.

Then, dependent server 2 obtains corresponding data and is built up index in the data source of correspondence, and after executing by index situation report-back to index main frame 1.During this period, adopt duplicate removal process, the Duplicate Removal Algorithm duplicate removal alone that every platform dependent server 2 is distributed according to index main frame 1, is carried out the duplicate removal of multiple dependent server 2 again by index main frame 1 after duplicate removal is complete.Meanwhile, if build up index failure in said process, then index main frame 1 again sends and rebuilds order to corresponding dependent server 2.Further, index has all been built rear index main frame 1 and has been notified that base unit search 4 index has renewal, and then base unit search 4 reloads index.

Meanwhile, in whole index step, the index on dependent server 2 backups on corresponding backup server 3 according to backup policy by index main frame 1.

Further, described search routine of the present invention is: first, form search engine by least one base unit search 4 (MasterSearch) and at least two dependent servers 2.Certainly, consider the convenient of overall search, in follow-up use procedure, also can expand to multiple stage base unit search 4.Open remote method invocation (RMI) (Remote Method Invocation, the RMI) search interface of dependent server 2, base unit search 4 is connected with dependent server 2 by remote method invocation (RMI) search interface.

Afterwards, when user inquires about, according to index distributed strategy, base unit search 4 judges that corresponding index data exists on which dependent server 2, then sends to corresponding dependent server 2 querying command.

Then, dependent server 2 carries out concurrent inquiry, and result is returned to base unit search 4, and result is carried out stipulations by base unit search 4, then result is returned to user.

With regard to the present invention one preferably embodiment, in the index distributed strategy that the present invention adopts, index can horizontal cutting, and the such as index in certain Beijing, Shanghai is placed on a dependent server 2, and the index in Hong Kong, Guangzhou is placed on another dependent server 2.Again further, in search step, whether base unit search 4 timing is carried out dependent server 2 can the detection of normal queries, both heartbeat detection.

Moreover, consider that the continuity that this is searched for is not destroyed, in search step, when certain dependent server 2 cannot connect or server power-off time, base unit search 4 can attempt on the backup server 3 of dependent server 2 correspondence search for.And, when dependent server 2 and corresponding backup server 3 all can not be searched for, base unit search 4 can shield corresponding dependent server 2 and search for, and returns the data that other dependent servers 2 search, and affects overall function of search when avoiding certain dependent server 2 to connect.Send warning message to keeper to investigate the problem of corresponding dependent server 2 in time.

In conjunction with actual service condition of the present invention, as user's inquiry " hotel of Pekinese ", according to index distributed strategy, base unit search 4 judges that Pekinese's index data exists on a dependent server 2 (Slaver1), then after dependent server 2 (Slaver1) inquiry, data are returned to base unit search 4, base unit search 4 returns to user after calculating again.

Can be found out by above-mentioned character express, after adopting the present invention, effectively can solve the problem of unit search performance difference, problem made mistakes individually in index.Meanwhile, by cooperating with each other of multiple server, can effectively expand.What is more important, when index data amount increases, index maintenance there will not be the defect too much consuming server performance, guarantees that search is unaffected.

Claims

1. based on the distributed search methods of Lucene, it is characterized in that: include index step and search step;

Described index step is, 1. step, is set up the index main frame of index, be combined with at least two dependent servers by distributed file system mode by least one;

2., the index content that index main frame does not store all is stored on dependent server step, during each newly-built renewal index, is needed to set up which partial index, then task is distributed to every platform dependent server by the every platform dependent server of index Framework computing;

Step 3., dependent server obtains corresponding data and is built up index in the data source of correspondence, and after executing by index situation report-back to index main frame, adopt duplicate removal process, the Duplicate Removal Algorithm duplicate removal alone that every platform dependent server is distributed according to index main frame, is carried out the duplicate removal of multiple dependent server again by index main frame after duplicate removal is complete;

4., if 3. step builds up index failure, then index main frame sends and rebuilds order to corresponding dependent server step again, and rear index host notification base unit search index all built in index renewal, then base unit search reloading index;

In described index step, the index on dependent server backups on corresponding backup server according to backup policy by index main frame;

Described search step is, step 1., form search engine by least one base unit search and at least two dependent servers, the open remote method invocation (RMI) search interface of dependent server, base unit search is connected with dependent server by remote method invocation (RMI) search interface;

2., when user inquires about, according to index distributed strategy, base unit search judges which dependent server corresponding index data exists on to step, then send to corresponding dependent server querying command;

3., dependent server carries out concurrent inquiry to step, and result is returned to base unit search, and result is carried out stipulations by base unit search, then result is returned to user;

In described search step, whether base unit search timing is carried out dependent server can the detection of normal queries, when certain dependent server cannot connect or server power-off time, base unit search can attempt searching on the backup server that dependent server is corresponding, when dependent server and corresponding backup server all can not be searched for, base unit search can shield the search of corresponding dependent server, return the data that other dependent servers search, send warning message to keeper to investigate the problem of corresponding dependent server in time.