CN102779134B - Lucene-based distributed search method - Google Patents

Lucene-based distributed search method Download PDF

Info

Publication number
CN102779134B
CN102779134B CN201110122631.7A CN201110122631A CN102779134B CN 102779134 B CN102779134 B CN 102779134B CN 201110122631 A CN201110122631 A CN 201110122631A CN 102779134 B CN102779134 B CN 102779134B
Authority
CN
China
Prior art keywords
index
search
server
dependent
dependent server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110122631.7A
Other languages
Chinese (zh)
Other versions
CN102779134A (en
Inventor
吴志祥
张海龙
马和平
王专
吴剑
郭凤林
王晓钟
庞绍进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongcheng Network Technology Co Ltd
Original Assignee
Tongcheng Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongcheng Network Technology Co Ltd filed Critical Tongcheng Network Technology Co Ltd
Priority to CN201110122631.7A priority Critical patent/CN102779134B/en
Publication of CN102779134A publication Critical patent/CN102779134A/en
Application granted granted Critical
Publication of CN102779134B publication Critical patent/CN102779134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Lucene-based distributed search method, which is characterized by comprising an indexing step and a searching step. According to the indexing step, at least one indexing host for establishment of an index is combined with at least two slave servers via a distributed file system. According to the searching process, a searching engine is formed by the at least one searching host and the at least two slave servers. Therefore, the problems of poor stand-alone search performance and individual index errors can be solved effectively. Meanwhile, extension can be performed effectively due to mutual cooperation of multiple servers. More importantly, the defect of excessive consumption of the server performance in index maintenance can be overcome when the number of index data is increased, and search cannot be affected.

Description

Based on the distributed search methods of Lucene
Technical field
The present invention relates to a kind of searching method, particularly relate to a kind of distributed search methods based on Lucene.
Background technology
Search, index and index maintenance program are placed on a station server, and in configuration, the more convenient but problem brought is, search for concurrency large when, cannot expand, when index data amount increases, index maintenance extremely consumes server performance, can have an impact to search.
General introduction Lucene describes and structure
I. what is Lucene
Lucene is a full text information retrieval kit based on Java, it is not a complete search for application, but the framework of a full-text search engine, provide complete query engine and index engine, part text analyzing engine (English and German bilingual).Lucene is a top open source projects in ApacheJakarta family at present.Its author is Doug Cutting, and he is a senior full-text index/retrieval expert.
Ii.Lucene system basic structure
The service that Lucene provides is actual comprises two parts: one enters one goes out.It is write that what is called enters, and the source (essence is character string) provided by you writes index or it deleted from index; It is read that what is called goes out, and namely provides full-text search service to user, allows user can by keyword locating source.Figure below illustrates one and enters one and go out, and show also the relation between search for application and Lucene:
Write stream: source string is first through analyzer process, comprise: participle, be divided into and the information needed in source added in each Field of Document after word one by one, and needing the Field index of index, needing the Field stored store and index is write storer, storer can be internal memory or disk
Read stream: user provides search keyword, through analyzer process.Corresponding Document is found out to the keyword search index after process.User extracts the Field of needs as required from the Document found.
a)Mapreduce:
Hadoop mapping/reduction framework is master/slave (master/slave) framework. it is by a master server (Jobtracker) and somely form from server (tasktracker).Master server is the key that user and system are come into contacts with.User is by self-defining
Master server is submitted in mapping/reduction operation.Operation is put into job queue and processes according to the task of first come first served basis to queue by master server.Master server is used for mapping or reduction operation is distributed to different from server.From server executable operations under the control of master server, meanwhile, differently to transmit from also carrying out data in mapping and reduction stages between server
b)Hadoop DFS
The distributed file system (HDFS) of Hadoop is designed between cluster computer, store large data file.This design derives from Google file system (GFS).Each file stores as one group of data block by Hadoop distributed file system, and all data blocks except last data block in a file all have identical size.As fault-tolerant processing, these data blocks are duplicated in order to a lot of part.The data block size of each file can the person's of being managed configuration with the number copied.In addition, the file that it should be noted that in HDFS be all Write-once and each time point strict only allow a thread execution write operation.
C) Distributed Architecture describes
Former stand-alone server search performance has the limit.And later stage expansion is almost impossible, existing solr, hibernateSearch cannot meet existing demand and hereby work out a distributed parallel search framework based on lucene+hadoop+mapReduce, can provide the solution that search performance under big data quantity is low.
Summary of the invention
Object of the present invention is exactly to solve the above-mentioned problems in the prior art, provides a kind of distributed search methods based on Lucene.
Object of the present invention is achieved through the following technical solutions:
Based on the distributed search methods of Lucene, wherein: include index step and search step;
Described index step is, 1. step, is set up the index main frame of index, be combined with at least two dependent servers by distributed file system mode by least one; 2., the index content that index main frame does not store all is stored on dependent server step, during each newly-built renewal index, is needed to set up which partial index, then task is distributed to every platform dependent server by the every platform dependent server of index Framework computing; 3., dependent server obtains corresponding data and is built up index step in the data source of correspondence, and after executing by index situation report-back to index main frame; 4., if 3. step builds up index failure, then index main frame sends and rebuilds order to corresponding dependent server step again;
Described search routine is, 1. step, forms search engine by least one base unit search and at least two dependent servers; The open remote method invocation (RMI) search interface of dependent server, base unit search is connected with dependent server by remote method invocation (RMI) search interface.2., when user inquires about, according to index distributed strategy, base unit search judges which dependent server corresponding index data exists on to step, then send to corresponding dependent server querying command; 3., dependent server carries out concurrent inquiry to step, and result is returned to base unit search, and result is carried out stipulations by base unit search, then result is returned to user.
The above-mentioned distributed search methods based on Lucene, wherein: the step of described index step is 3. middle adopts duplicate removal process, the Duplicate Removal Algorithm duplicate removal alone that every platform dependent server is distributed according to index main frame, is carried out the duplicate removal of multiple dependent server again by index main frame after duplicate removal is complete.
Further, the above-mentioned distributed search methods based on Lucene, wherein: the step of described index step 4. in, rear index host notification base unit search index all built in index renewal, then base unit search reloading index.
Further, the above-mentioned distributed search methods based on Lucene, wherein: in described index step, the index on dependent server backups on corresponding backup server according to backup policy by index main frame.
Further, the above-mentioned distributed search methods based on Lucene, wherein: in described search step, whether base unit search timing is carried out dependent server can the detection of normal queries.
Further, the above-mentioned distributed search methods based on Lucene, wherein: in described search step, when certain dependent server cannot connect or server power-off time, base unit search can attempt searching on the backup server that dependent server is corresponding.
Again further, the above-mentioned distributed search methods based on Lucene, wherein: the described distributed search methods based on Lucene, it is characterized in that: in described search step, when dependent server and corresponding backup server all can not be searched for, base unit search can shield the search of corresponding dependent server, returns the data that other dependent servers search, and sends warning message to keeper to investigate the problem of corresponding dependent server in time.
The advantage of technical solution of the present invention is mainly reflected in: effectively can solve the problem of unit search performance difference, problem made mistakes individually in index.Meanwhile, by cooperating with each other of multiple server, can effectively expand.What is more important, when index data amount increases, index maintenance there will not be the defect too much consuming server performance, guarantees that search is unaffected.
Accompanying drawing explanation
Object of the present invention, advantage and disadvantage, by for illustration and explanation for the non-limitative illustration passing through preferred embodiment below.These embodiments are only the prominent examples of application technical solution of the present invention, allly take equivalent replacement or equivalent transformation and the technical scheme that formed, all drop within the scope of protection of present invention.In the middle of these accompanying drawings,
Fig. 1 adopts this configuration based on the distributed search methods of Lucene to implement schematic diagram.
In figure, the implication of each Reference numeral is as follows:
1 index main frame 2 dependent server
3 backup server 4 base unit search
Embodiment
The distributed search methods based on Lucene as shown in Figure 1, its special feature is: include index step and search step.
Specifically, index step of the present invention is: first, the index main frame 1 (MasterIndex) of index is set up by least one, combined by distributed file system (Hadoop Distributed File System is called for short HDFS) mode with at least two dependent servers 2.
Afterwards, the index content that index main frame 1 does not store all is stored on dependent server 2, and during each newly-built renewal index, calculating every platform dependent server 2 by index main frame 1 needs to set up which partial index, then task is distributed to every platform dependent server 2.
Then, dependent server 2 obtains corresponding data and is built up index in the data source of correspondence, and after executing by index situation report-back to index main frame 1.During this period, adopt duplicate removal process, the Duplicate Removal Algorithm duplicate removal alone that every platform dependent server 2 is distributed according to index main frame 1, is carried out the duplicate removal of multiple dependent server 2 again by index main frame 1 after duplicate removal is complete.Meanwhile, if build up index failure in said process, then index main frame 1 again sends and rebuilds order to corresponding dependent server 2.Further, index has all been built rear index main frame 1 and has been notified that base unit search 4 index has renewal, and then base unit search 4 reloads index.
Meanwhile, in whole index step, the index on dependent server 2 backups on corresponding backup server 3 according to backup policy by index main frame 1.
Further, described search routine of the present invention is: first, form search engine by least one base unit search 4 (MasterSearch) and at least two dependent servers 2.Certainly, consider the convenient of overall search, in follow-up use procedure, also can expand to multiple stage base unit search 4.Open remote method invocation (RMI) (Remote Method Invocation, the RMI) search interface of dependent server 2, base unit search 4 is connected with dependent server 2 by remote method invocation (RMI) search interface.
Afterwards, when user inquires about, according to index distributed strategy, base unit search 4 judges that corresponding index data exists on which dependent server 2, then sends to corresponding dependent server 2 querying command.
Then, dependent server 2 carries out concurrent inquiry, and result is returned to base unit search 4, and result is carried out stipulations by base unit search 4, then result is returned to user.
With regard to the present invention one preferably embodiment, in the index distributed strategy that the present invention adopts, index can horizontal cutting, and the such as index in certain Beijing, Shanghai is placed on a dependent server 2, and the index in Hong Kong, Guangzhou is placed on another dependent server 2.Again further, in search step, whether base unit search 4 timing is carried out dependent server 2 can the detection of normal queries, both heartbeat detection.
Moreover, consider that the continuity that this is searched for is not destroyed, in search step, when certain dependent server 2 cannot connect or server power-off time, base unit search 4 can attempt on the backup server 3 of dependent server 2 correspondence search for.And, when dependent server 2 and corresponding backup server 3 all can not be searched for, base unit search 4 can shield corresponding dependent server 2 and search for, and returns the data that other dependent servers 2 search, and affects overall function of search when avoiding certain dependent server 2 to connect.Send warning message to keeper to investigate the problem of corresponding dependent server 2 in time.
In conjunction with actual service condition of the present invention, as user's inquiry " hotel of Pekinese ", according to index distributed strategy, base unit search 4 judges that Pekinese's index data exists on a dependent server 2 (Slaver1), then after dependent server 2 (Slaver1) inquiry, data are returned to base unit search 4, base unit search 4 returns to user after calculating again.
Can be found out by above-mentioned character express, after adopting the present invention, effectively can solve the problem of unit search performance difference, problem made mistakes individually in index.Meanwhile, by cooperating with each other of multiple server, can effectively expand.What is more important, when index data amount increases, index maintenance there will not be the defect too much consuming server performance, guarantees that search is unaffected.

Claims (1)

1. based on the distributed search methods of Lucene, it is characterized in that: include index step and search step;
Described index step is, 1. step, is set up the index main frame of index, be combined with at least two dependent servers by distributed file system mode by least one;
2., the index content that index main frame does not store all is stored on dependent server step, during each newly-built renewal index, is needed to set up which partial index, then task is distributed to every platform dependent server by the every platform dependent server of index Framework computing;
Step 3., dependent server obtains corresponding data and is built up index in the data source of correspondence, and after executing by index situation report-back to index main frame, adopt duplicate removal process, the Duplicate Removal Algorithm duplicate removal alone that every platform dependent server is distributed according to index main frame, is carried out the duplicate removal of multiple dependent server again by index main frame after duplicate removal is complete;
4., if 3. step builds up index failure, then index main frame sends and rebuilds order to corresponding dependent server step again, and rear index host notification base unit search index all built in index renewal, then base unit search reloading index;
In described index step, the index on dependent server backups on corresponding backup server according to backup policy by index main frame;
Described search step is, step 1., form search engine by least one base unit search and at least two dependent servers, the open remote method invocation (RMI) search interface of dependent server, base unit search is connected with dependent server by remote method invocation (RMI) search interface;
2., when user inquires about, according to index distributed strategy, base unit search judges which dependent server corresponding index data exists on to step, then send to corresponding dependent server querying command;
3., dependent server carries out concurrent inquiry to step, and result is returned to base unit search, and result is carried out stipulations by base unit search, then result is returned to user;
In described search step, whether base unit search timing is carried out dependent server can the detection of normal queries, when certain dependent server cannot connect or server power-off time, base unit search can attempt searching on the backup server that dependent server is corresponding, when dependent server and corresponding backup server all can not be searched for, base unit search can shield the search of corresponding dependent server, return the data that other dependent servers search, send warning message to keeper to investigate the problem of corresponding dependent server in time.
CN201110122631.7A 2011-05-12 2011-05-12 Lucene-based distributed search method Active CN102779134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110122631.7A CN102779134B (en) 2011-05-12 2011-05-12 Lucene-based distributed search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110122631.7A CN102779134B (en) 2011-05-12 2011-05-12 Lucene-based distributed search method

Publications (2)

Publication Number Publication Date
CN102779134A CN102779134A (en) 2012-11-14
CN102779134B true CN102779134B (en) 2015-05-13

Family

ID=47124051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110122631.7A Active CN102779134B (en) 2011-05-12 2011-05-12 Lucene-based distributed search method

Country Status (1)

Country Link
CN (1) CN102779134B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412933A (en) * 2013-08-20 2013-11-27 南京物联网应用研究院有限公司 Cloud search platform
CN106484694B (en) * 2015-08-25 2019-09-20 杭州华为数字技术有限公司 Full-text search method and system based on distributed data base
CN107066527B (en) * 2017-02-24 2019-10-29 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile memory
CN107679248A (en) * 2017-10-30 2018-02-09 江苏鸿信***集成有限公司 A kind of intelligent data search method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051309A (en) * 2006-04-06 2007-10-10 中国科学院计算技术研究所 Researching system and method used in digital labrary
CN101667179A (en) * 2008-09-03 2010-03-10 华为技术有限公司 Mobile search method and system, and method for synchronizing search capability of search server

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7865499B2 (en) * 2001-01-16 2011-01-04 Lakeside Software, Inc. System and method for managing information for a plurality of computer systems in a distributed network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051309A (en) * 2006-04-06 2007-10-10 中国科学院计算技术研究所 Researching system and method used in digital labrary
CN101667179A (en) * 2008-09-03 2010-03-10 华为技术有限公司 Mobile search method and system, and method for synchronizing search capability of search server

Also Published As

Publication number Publication date
CN102779134A (en) 2012-11-14

Similar Documents

Publication Publication Date Title
Verbitski et al. Amazon aurora: Design considerations for high throughput cloud-native relational databases
US10180946B2 (en) Consistent execution of partial queries in hybrid DBMS
US9671967B2 (en) Method and system for implementing a distributed operations log
CN101676855B (en) Scalable secondary storage systems and methods
Aiyer et al. Storage infrastructure behind Facebook messages: Using HBase at scale.
US9529881B2 (en) Difference determination in a database environment
US9251008B2 (en) Client object replication between a first backup server and a second backup server
EP2976714B1 (en) Method and system for byzantine fault tolerant data replication
US10089320B2 (en) Method and apparatus for maintaining data consistency in an in-place-update file system with data deduplication
JP5308403B2 (en) Data processing failure recovery method, system and program
US11321302B2 (en) Computer system and database management method
JP6336090B2 (en) Method and apparatus for maintaining data for online analytical processing in a database system
CN107665219B (en) Log management method and device
US10048978B2 (en) Apparatus and method for identifying a virtual machine having changeable settings
US11748215B2 (en) Log management method, server, and database system
US20180032567A1 (en) Method and device for processing data blocks in a distributed database
CN102779134B (en) Lucene-based distributed search method
CN103377090A (en) Method and system for sharing catalogs of different data sets in a multiprocessor system
WO2016117007A1 (en) Database system and database management method
Schindler Profiling and analyzing the I/O performance of NoSQL DBs
US11074003B2 (en) Storage system and restoration method
US11163469B2 (en) Data management system and data management method
US20200249876A1 (en) System and method for data storage management
CN114816224A (en) Data management method and data management device
CN107590286B (en) Method and device for managing transaction information in cluster file system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: TONGCHENG NETWORK TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: SUZHOU TONGCHENG TRAVEL NETWORK TECHNOLOGY CO., LTD.

Effective date: 20121212

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 215123 SUZHOU, JIANGSU PROVINCE TO: 215021 SUZHOU, JIANGSU PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20121212

Address after: Xinghu Street Industrial Park of Suzhou city in Jiangsu province 215021 Creative Industry Park 5 Building No. 328

Applicant after: Tongcheng Network Technology Co., Ltd.

Address before: Xinghu Street Industrial Park of Suzhou city in Jiangsu province 215123 Creative Industry Park 5 Building No. 328

Applicant before: Suzhou Tongcheng Travel Network Technology Co.,Ltd.

C14 Grant of patent or utility model
GR01 Patent grant