Distributed search methods based on Lucene
Technical field
The present invention relates to a kind of searching method, relate in particular to a kind of distributed search methods based on Lucene.
Background technology
Search, index and index maintenance program are placed on the station server, and more convenient but problem that bring is in the configuration, under the big situation of search concurrency; Can't expand; When the index data amount increased, index was safeguarded and is extremely consumed server performance, can exert an influence to search.
General introduction Lucene describes and structure
I. what is Lucene
Lucene is a full text information retrieval kit based on Java; It is not a complete search for application; But the framework of a full-text search engine provides complete query engine and index engine, part text analyzing engine (English and German bilingual).Lucene is the top project of increasing income of in the ApacheJakarta family at present.Its author is Doug Cutting, and he is a senior full-text index/retrieval expert.
Ii.Lucene system basic structure
The actual two parts that comprise of the service that Lucene provides: one goes into one goes out.It is to write that what is called is gone into, and the source (essence is character string) that being about to you provides writes index or it is deleted from index; It is to read that what is called goes out, and promptly to the user full-text search service is provided, and lets the user can pass through the keyword locating source.Figure below has represented that one goes into one and goes out, and has also represented the relation between search for application and the Lucene:
Write and become a mandarin: source string is at first handled through analyzer; Comprise: participle; The information that is divided into behind the word one by one needs in the source adds among each Field of Document; And needing the Field index of index, store the Field of needs storage and with the index write store, storer can be internal memory or disk
Read stream: the user provides searching key word, handles through analyzer.Corresponding Document found out in keyword search index to after handling.The user Field that extraction needs from the Document that finds as required.
a)Mapreduce:
Hadoop mapping/reduction framework is master/slave (master/slave) framework. it is made up of from server (tasktracker) a master server (Jobtracker) and some.Master server is the key that user and system come into contacts with.The user is with self-defining
Master server is submitted in mapping/reduction operation.Master server is put into job queue with operation and according to first come first served basis the task of formation is handled.Master server is used for distributing to mapping or reduction operation different from server.From server executable operations under the control of master server, simultaneously, different from also carrying out data transmission in mapping and reduction stages between server
b)Hadoop?DFS
The distributed file system of Hadoop (HDFS) is designed to storage large data file between cluster computer.This design derives from Google file system (GFS).The Hadoop distributed file system is stored each file as one group of data block, all data blocks except last data block in file all have identical size.As fault-tolerant processing, these data blocks are duplicated into for a lot of parts.The data block size and the umber that duplicates of each file can be by administrator configurations.In addition, it should be noted that file among the HDFS all be Write-once and each time point strict only allow a thread execution write operation.
C) Distributed Architecture is described
Former stand-alone server search performance has the limit.And the later stage expansion almost is impossible; Existing solr; HibernateSearch can't satisfy existing demand and work out a distributed parallel search framework based on lucene+hadoop+mapReduce hereby, and the low solution of search performance under the big data quantity can be provided.
Summary of the invention
The object of the invention is exactly the problems referred to above that exist in the prior art in order to solve, and a kind of distributed search methods based on Lucene is provided.
The object of the invention is realized through following technical scheme:
Based on the distributed search methods of Lucene, wherein: include index step and search step;
Described index step is that 1. step through at least one the index main frame of setting up index, combines through the distributed file system mode with at least two dependent servers; Step 2., the index content that the index main frame is not stored all is stored on the dependent server, during each newly-built renewal index, calculates every dependent server by the index main frame and need set up which partial index, then task is distributed to every dependent server; Step 3., dependent server obtains corresponding data it is built up index in the data source of correspondence, and gives the index main frame with the index situation report-back after executing; 4. step fails if 3. step builds up index, and then the index main frame sends once more and rebuilds order to corresponding dependent server;
Described search routine is that 1. step forms search engine by at least one base unit search and at least two dependent servers; The open RMI search interface of dependent server, base unit search is connected with dependent server through the RMI search interface.Step 2., when the user inquired about, base unit search was judged corresponding index data according to the index distributed strategy and is existed on which dependent server, sends to corresponding dependent server querying command then; Step 3., dependent server carries out concurrent inquiry, and the result is returned to base unit search, base unit search is carried out stipulations with the result, then the result is returned to the user.
Above-mentioned distributed search methods based on Lucene; Wherein: the 3. middle heavy process of going that adopts of the step of said index step; Every dependent server goes alone heavily according to the method for reruning of going of index main frame distribution, carries out going heavily of a plurality of dependent servers by the index main frame again after going heavily to finish.
Further, above-mentioned distributed search methods based on Lucene, wherein: the step of said index step 4. in, back index host notification base unit search index all built in index has renewal, base unit search heavily loads index then.
Further, above-mentioned distributed search methods based on Lucene, wherein: in the said index step, the index main frame backups to the index on the dependent server on the corresponding backup server according to backup policy.
Further, above-mentioned distributed search methods based on Lucene, wherein: in the described search step, base unit search regularly to dependent server whether carry out can normal queries detection.
Further, above-mentioned distributed search methods based on Lucene, wherein: in the described search step, when certain dependent server can't connect or server when outage, base unit search can attempt on the corresponding backup server of dependent server, searching for.
Again further; Above-mentioned distributed search methods based on Lucene, wherein: described distributed search methods based on Lucene is characterized in that: in the described search step; When dependent server all can not be searched for corresponding backup server; Base unit search can shield corresponding dependent server search, returns the data that other dependent servers search, and sends warning message and gives the keeper problem with the corresponding dependent server of timely investigation.
The advantage of technical scheme of the present invention is mainly reflected in: problem, the index that can effectively the solve unit search performance difference problem of makeing mistakes individually.Simultaneously, through cooperating with each other of a plurality of servers, can effectively expand.What is more important, when the index data amount increased, the defective that server performance can not occur too much consuming safeguarded in index, guarantees to search for unaffected.
Description of drawings
The object of the invention, advantage and characteristics will illustrate through the non-limitative illustration of following preferred embodiment and explain.These embodiment only are the prominent examples of using technical scheme of the present invention, and all technical schemes of taking to be equal to replacement or equivalent transformation and forming all drop within the scope of requirement protection of the present invention.In the middle of these accompanying drawings,
Fig. 1 adopts this configuration based on the distributed search methods of Lucene to implement synoptic diagram.
The implication of each Reference numeral is following among the figure:
1 index main frame, 2 dependent servers
3 backup servers, 4 base unit search
Embodiment
Distributed search methods based on Lucene as shown in Figure 1, its special feature is: include index step and search step.
Specifically; The index step that the present invention adopted is: at first; Through at least one the index main frame 1 (MasterIndex) of setting up index, combine through distributed file system (Hadoop Distributed File System is called for short HDFS) mode with at least two dependent servers 2.
Afterwards, the index content that index main frame 1 is not stored all is stored on the dependent server 2, during each newly-built renewal index, calculates every dependent server 2 by index main frame 1 and need set up which partial index, then task is distributed to every dependent server 2.
Then, dependent server 2 obtains corresponding data it is built up index in the data source of correspondence, and gives index main frame 1 with the index situation report-back after executing.During this period, adopt the heavy process of going, every dependent server 2 goes alone heavily according to the method for reruning of going of index main frame 1 distribution, carries out going heavily of a plurality of dependent servers 2 by index main frame 1 again after going heavily to finish.Simultaneously, if build up the index failure in the said process, then index main frame 1 sends once more and rebuilds order to corresponding dependent server 2.And back index main frame 1 notice base unit search 4 index all built in index has renewal, and base unit search 4 heavily loads index then.
Simultaneously, in whole index step, index main frame 1 backups to the index on the dependent server 2 on the corresponding backup server 3 according to backup policy.
Further, the described search routine that the present invention adopted is: at first, form search engine by at least one base unit search 4 (MasterSearch) and at least two dependent servers 2.Certainly, consider the convenient of whole search, in follow-up use, also can expand to many base unit search 4.(base unit search 4 is connected with dependent server 2 through the RMI search interface dependent server 2 open RMIs for Remote Method Invocation, RMI) search interface.
Afterwards, when the user inquired about, base unit search 4 was judged corresponding index data according to the index distributed strategy and is existed on which dependent server 2, sends to corresponding dependent server 2 querying commands then.
Then, dependent server 2 carries out concurrent inquiry, and the result is returned to base unit search 4, and base unit search 4 is carried out stipulations with the result, then the result is returned to the user.
With regard to the present invention's one preferred implementation, index can horizontal cutting in the index distributed strategy that the present invention adopts, and is placed on the dependent server 2 such as the index in certain Beijing, Shanghai, and the index in Hong Kong, Guangzhou is placed on another dependent server 2.Again further, in search step, base unit search 4 regularly to dependent server 2 whether carry out can normal queries detection, both heartbeat detection.
Moreover, consider that the continuity of this search is not destroyed, in the search step, when certain dependent server 2 can't connect or server when outage, base unit search 4 can be attempted search on the corresponding backup server 3 of dependent server 2.And; When dependent server 2 all can not be searched for corresponding backup server 3; Base unit search 4 can be searched for by the corresponding dependent server 2 of shielding, returns the data that other dependent servers 2 search, and influences whole function of search when avoiding certain dependent server 2 to connect.Send warning message and give the keeper problem with the corresponding dependent server 2 of timely investigation.
In conjunction with actual operating position of the present invention; When user inquiring " hotel, Pekinese "; Base unit search 4 is judged Pekinese's index data according to the index distributed strategy and is existed on the dependent server 2 (Slaver1); After dependent server 2 (Slaver1) inquiry data are returned to base unit search 4 then, return to the user again after base unit search 4 is calculated.
Can find out through above-mentioned character express, adopt the present invention after, problem, the index that can effectively the solve unit search performance difference problem of makeing mistakes individually.Simultaneously, through cooperating with each other of a plurality of servers, can effectively expand.What is more important, when the index data amount increased, the defective that server performance can not occur too much consuming safeguarded in index, guarantees to search for unaffected.