CN102779134A - Lucene-based distributed search method - Google Patents

Lucene-based distributed search method Download PDF

Info

Publication number
CN102779134A
CN102779134A CN2011101226317A CN201110122631A CN102779134A CN 102779134 A CN102779134 A CN 102779134A CN 2011101226317 A CN2011101226317 A CN 2011101226317A CN 201110122631 A CN201110122631 A CN 201110122631A CN 102779134 A CN102779134 A CN 102779134A
Authority
CN
China
Prior art keywords
index
search
server
dependent
base unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101226317A
Other languages
Chinese (zh)
Other versions
CN102779134B (en
Inventor
吴志祥
张海龙
马和平
王专
吴剑
郭凤林
王晓钟
庞绍进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongcheng Network Technology Co Ltd
Original Assignee
SUZHOU TONGCHENG TRAVEL NETWORK TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SUZHOU TONGCHENG TRAVEL NETWORK TECHNOLOGY CO LTD filed Critical SUZHOU TONGCHENG TRAVEL NETWORK TECHNOLOGY CO LTD
Priority to CN201110122631.7A priority Critical patent/CN102779134B/en
Publication of CN102779134A publication Critical patent/CN102779134A/en
Application granted granted Critical
Publication of CN102779134B publication Critical patent/CN102779134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Lucene-based distributed search method, which is characterized by comprising an indexing step and a searching step. According to the indexing step, at least one indexing host for establishment of an index is combined with at least two slave servers via a distributed file system. According to the searching process, a searching engine is formed by the at least one searching host and the at least two slave servers. Therefore, the problems of poor stand-alone search performance and individual index errors can be solved effectively. Meanwhile, extension can be performed effectively due to mutual cooperation of multiple servers. More importantly, the defect of excessive consumption of the server performance in index maintenance can be overcome when the number of index data is increased, and search cannot be affected.

Description

Distributed search methods based on Lucene
Technical field
The present invention relates to a kind of searching method, relate in particular to a kind of distributed search methods based on Lucene.
Background technology
Search, index and index maintenance program are placed on the station server, and more convenient but problem that bring is in the configuration, under the big situation of search concurrency; Can't expand; When the index data amount increased, index was safeguarded and is extremely consumed server performance, can exert an influence to search.
General introduction Lucene describes and structure
I. what is Lucene
Lucene is a full text information retrieval kit based on Java; It is not a complete search for application; But the framework of a full-text search engine provides complete query engine and index engine, part text analyzing engine (English and German bilingual).Lucene is the top project of increasing income of in the ApacheJakarta family at present.Its author is Doug Cutting, and he is a senior full-text index/retrieval expert.
Ii.Lucene system basic structure
The actual two parts that comprise of the service that Lucene provides: one goes into one goes out.It is to write that what is called is gone into, and the source (essence is character string) that being about to you provides writes index or it is deleted from index; It is to read that what is called goes out, and promptly to the user full-text search service is provided, and lets the user can pass through the keyword locating source.Figure below has represented that one goes into one and goes out, and has also represented the relation between search for application and the Lucene:
Write and become a mandarin: source string is at first handled through analyzer; Comprise: participle; The information that is divided into behind the word one by one needs in the source adds among each Field of Document; And needing the Field index of index, store the Field of needs storage and with the index write store, storer can be internal memory or disk
Read stream: the user provides searching key word, handles through analyzer.Corresponding Document found out in keyword search index to after handling.The user Field that extraction needs from the Document that finds as required.
a)Mapreduce:
Hadoop mapping/reduction framework is master/slave (master/slave) framework. it is made up of from server (tasktracker) a master server (Jobtracker) and some.Master server is the key that user and system come into contacts with.The user is with self-defining
Master server is submitted in mapping/reduction operation.Master server is put into job queue with operation and according to first come first served basis the task of formation is handled.Master server is used for distributing to mapping or reduction operation different from server.From server executable operations under the control of master server, simultaneously, different from also carrying out data transmission in mapping and reduction stages between server
b)Hadoop?DFS
The distributed file system of Hadoop (HDFS) is designed to storage large data file between cluster computer.This design derives from Google file system (GFS).The Hadoop distributed file system is stored each file as one group of data block, all data blocks except last data block in file all have identical size.As fault-tolerant processing, these data blocks are duplicated into for a lot of parts.The data block size and the umber that duplicates of each file can be by administrator configurations.In addition, it should be noted that file among the HDFS all be Write-once and each time point strict only allow a thread execution write operation.
C) Distributed Architecture is described
Former stand-alone server search performance has the limit.And the later stage expansion almost is impossible; Existing solr; HibernateSearch can't satisfy existing demand and work out a distributed parallel search framework based on lucene+hadoop+mapReduce hereby, and the low solution of search performance under the big data quantity can be provided.
Summary of the invention
The object of the invention is exactly the problems referred to above that exist in the prior art in order to solve, and a kind of distributed search methods based on Lucene is provided.
The object of the invention is realized through following technical scheme:
Based on the distributed search methods of Lucene, wherein: include index step and search step;
Described index step is that 1. step through at least one the index main frame of setting up index, combines through the distributed file system mode with at least two dependent servers; Step 2., the index content that the index main frame is not stored all is stored on the dependent server, during each newly-built renewal index, calculates every dependent server by the index main frame and need set up which partial index, then task is distributed to every dependent server; Step 3., dependent server obtains corresponding data it is built up index in the data source of correspondence, and gives the index main frame with the index situation report-back after executing; 4. step fails if 3. step builds up index, and then the index main frame sends once more and rebuilds order to corresponding dependent server;
Described search routine is that 1. step forms search engine by at least one base unit search and at least two dependent servers; The open RMI search interface of dependent server, base unit search is connected with dependent server through the RMI search interface.Step 2., when the user inquired about, base unit search was judged corresponding index data according to the index distributed strategy and is existed on which dependent server, sends to corresponding dependent server querying command then; Step 3., dependent server carries out concurrent inquiry, and the result is returned to base unit search, base unit search is carried out stipulations with the result, then the result is returned to the user.
Above-mentioned distributed search methods based on Lucene; Wherein: the 3. middle heavy process of going that adopts of the step of said index step; Every dependent server goes alone heavily according to the method for reruning of going of index main frame distribution, carries out going heavily of a plurality of dependent servers by the index main frame again after going heavily to finish.
Further, above-mentioned distributed search methods based on Lucene, wherein: the step of said index step 4. in, back index host notification base unit search index all built in index has renewal, base unit search heavily loads index then.
Further, above-mentioned distributed search methods based on Lucene, wherein: in the said index step, the index main frame backups to the index on the dependent server on the corresponding backup server according to backup policy.
Further, above-mentioned distributed search methods based on Lucene, wherein: in the described search step, base unit search regularly to dependent server whether carry out can normal queries detection.
Further, above-mentioned distributed search methods based on Lucene, wherein: in the described search step, when certain dependent server can't connect or server when outage, base unit search can attempt on the corresponding backup server of dependent server, searching for.
Again further; Above-mentioned distributed search methods based on Lucene, wherein: described distributed search methods based on Lucene is characterized in that: in the described search step; When dependent server all can not be searched for corresponding backup server; Base unit search can shield corresponding dependent server search, returns the data that other dependent servers search, and sends warning message and gives the keeper problem with the corresponding dependent server of timely investigation.
The advantage of technical scheme of the present invention is mainly reflected in: problem, the index that can effectively the solve unit search performance difference problem of makeing mistakes individually.Simultaneously, through cooperating with each other of a plurality of servers, can effectively expand.What is more important, when the index data amount increased, the defective that server performance can not occur too much consuming safeguarded in index, guarantees to search for unaffected.
Description of drawings
The object of the invention, advantage and characteristics will illustrate through the non-limitative illustration of following preferred embodiment and explain.These embodiment only are the prominent examples of using technical scheme of the present invention, and all technical schemes of taking to be equal to replacement or equivalent transformation and forming all drop within the scope of requirement protection of the present invention.In the middle of these accompanying drawings,
Fig. 1 adopts this configuration based on the distributed search methods of Lucene to implement synoptic diagram.
The implication of each Reference numeral is following among the figure:
1 index main frame, 2 dependent servers
3 backup servers, 4 base unit search
Embodiment
Distributed search methods based on Lucene as shown in Figure 1, its special feature is: include index step and search step.
Specifically; The index step that the present invention adopted is: at first; Through at least one the index main frame 1 (MasterIndex) of setting up index, combine through distributed file system (Hadoop Distributed File System is called for short HDFS) mode with at least two dependent servers 2.
Afterwards, the index content that index main frame 1 is not stored all is stored on the dependent server 2, during each newly-built renewal index, calculates every dependent server 2 by index main frame 1 and need set up which partial index, then task is distributed to every dependent server 2.
Then, dependent server 2 obtains corresponding data it is built up index in the data source of correspondence, and gives index main frame 1 with the index situation report-back after executing.During this period, adopt the heavy process of going, every dependent server 2 goes alone heavily according to the method for reruning of going of index main frame 1 distribution, carries out going heavily of a plurality of dependent servers 2 by index main frame 1 again after going heavily to finish.Simultaneously, if build up the index failure in the said process, then index main frame 1 sends once more and rebuilds order to corresponding dependent server 2.And back index main frame 1 notice base unit search 4 index all built in index has renewal, and base unit search 4 heavily loads index then.
Simultaneously, in whole index step, index main frame 1 backups to the index on the dependent server 2 on the corresponding backup server 3 according to backup policy.
Further, the described search routine that the present invention adopted is: at first, form search engine by at least one base unit search 4 (MasterSearch) and at least two dependent servers 2.Certainly, consider the convenient of whole search, in follow-up use, also can expand to many base unit search 4.(base unit search 4 is connected with dependent server 2 through the RMI search interface dependent server 2 open RMIs for Remote Method Invocation, RMI) search interface.
Afterwards, when the user inquired about, base unit search 4 was judged corresponding index data according to the index distributed strategy and is existed on which dependent server 2, sends to corresponding dependent server 2 querying commands then.
Then, dependent server 2 carries out concurrent inquiry, and the result is returned to base unit search 4, and base unit search 4 is carried out stipulations with the result, then the result is returned to the user.
With regard to the present invention's one preferred implementation, index can horizontal cutting in the index distributed strategy that the present invention adopts, and is placed on the dependent server 2 such as the index in certain Beijing, Shanghai, and the index in Hong Kong, Guangzhou is placed on another dependent server 2.Again further, in search step, base unit search 4 regularly to dependent server 2 whether carry out can normal queries detection, both heartbeat detection.
Moreover, consider that the continuity of this search is not destroyed, in the search step, when certain dependent server 2 can't connect or server when outage, base unit search 4 can be attempted search on the corresponding backup server 3 of dependent server 2.And; When dependent server 2 all can not be searched for corresponding backup server 3; Base unit search 4 can be searched for by the corresponding dependent server 2 of shielding, returns the data that other dependent servers 2 search, and influences whole function of search when avoiding certain dependent server 2 to connect.Send warning message and give the keeper problem with the corresponding dependent server 2 of timely investigation.
In conjunction with actual operating position of the present invention; When user inquiring " hotel, Pekinese "; Base unit search 4 is judged Pekinese's index data according to the index distributed strategy and is existed on the dependent server 2 (Slaver1); After dependent server 2 (Slaver1) inquiry data are returned to base unit search 4 then, return to the user again after base unit search 4 is calculated.
Can find out through above-mentioned character express, adopt the present invention after, problem, the index that can effectively the solve unit search performance difference problem of makeing mistakes individually.Simultaneously, through cooperating with each other of a plurality of servers, can effectively expand.What is more important, when the index data amount increased, the defective that server performance can not occur too much consuming safeguarded in index, guarantees to search for unaffected.

Claims (7)

1. based on the distributed search methods of Lucene, it is characterized in that: include index step and search step;
Described index step is that 1. step through at least one the index main frame of setting up index, combines through the distributed file system mode with at least two dependent servers;
Step 2., the index content that the index main frame is not stored all is stored on the dependent server, during each newly-built renewal index, calculates every dependent server by the index main frame and need set up which partial index, then task is distributed to every dependent server;
Step 3., dependent server obtains corresponding data it is built up index in the data source of correspondence, and gives the index main frame with the index situation report-back after executing;
4. step fails if 3. step builds up index, and then the index main frame sends once more and rebuilds order to corresponding dependent server;
Described search routine is that 1. step forms search engine by at least one base unit search and at least two dependent servers; The open RMI search interface of dependent server, base unit search is connected with dependent server through the RMI search interface.
Step 2., when the user inquired about, base unit search was judged corresponding index data according to the index distributed strategy and is existed on which dependent server, sends to corresponding dependent server querying command then;
Step 3., dependent server carries out concurrent inquiry, and the result is returned to base unit search, base unit search is carried out stipulations with the result, then the result is returned to the user.
2. the distributed search methods based on Lucene according to claim 1; It is characterized in that: the 3. middle heavy process of going that adopts of the step of said index step; Every dependent server goes alone heavily according to the method for reruning of going of index main frame distribution, carries out going heavily of a plurality of dependent servers by the index main frame again after going heavily to finish.
3. the distributed search methods based on Lucene according to claim 1 is characterized in that: the step of said index step 4. in, back index host notification base unit search index all built in index has renewal, base unit search heavily loads index then.
4. the distributed search methods based on Lucene according to claim 1 is characterized in that: in the said index step, the index main frame backups to the index on the dependent server on the corresponding backup server according to backup policy.
5. the distributed search methods based on Lucene according to claim 1 is characterized in that: in the described search step, base unit search regularly to dependent server whether carry out can normal queries detection.
6. the distributed search methods based on Lucene according to claim 1; It is characterized in that: in the described search step; When certain dependent server can't connect or server when outage, base unit search can attempt on the corresponding backup server of dependent server, searching for.
7. the distributed search methods based on Lucene according to claim 6; It is characterized in that: described distributed search methods based on Lucene; It is characterized in that: in the described search step, when dependent server all can not be searched for corresponding backup server, base unit search can shield corresponding dependent server search; Return the data that other dependent servers search, send warning message and give the keeper problem with the corresponding dependent server of timely investigation.
CN201110122631.7A 2011-05-12 2011-05-12 Lucene-based distributed search method Active CN102779134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110122631.7A CN102779134B (en) 2011-05-12 2011-05-12 Lucene-based distributed search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110122631.7A CN102779134B (en) 2011-05-12 2011-05-12 Lucene-based distributed search method

Publications (2)

Publication Number Publication Date
CN102779134A true CN102779134A (en) 2012-11-14
CN102779134B CN102779134B (en) 2015-05-13

Family

ID=47124051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110122631.7A Active CN102779134B (en) 2011-05-12 2011-05-12 Lucene-based distributed search method

Country Status (1)

Country Link
CN (1) CN102779134B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412933A (en) * 2013-08-20 2013-11-27 南京物联网应用研究院有限公司 Cloud search platform
CN106484694A (en) * 2015-08-25 2017-03-08 杭州华为数字技术有限公司 Full-text search method based on distributed data base and system
CN107066527A (en) * 2017-02-24 2017-08-18 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile internal memory
CN107679248A (en) * 2017-10-30 2018-02-09 江苏鸿信***集成有限公司 A kind of intelligent data search method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060089946A1 (en) * 2001-01-16 2006-04-27 Schumacher Michael K System and method for managing information for a plurality of computer systems in a distributed network
CN101051309A (en) * 2006-04-06 2007-10-10 中国科学院计算技术研究所 Researching system and method used in digital labrary
CN101667179A (en) * 2008-09-03 2010-03-10 华为技术有限公司 Mobile search method and system, and method for synchronizing search capability of search server

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060089946A1 (en) * 2001-01-16 2006-04-27 Schumacher Michael K System and method for managing information for a plurality of computer systems in a distributed network
CN101051309A (en) * 2006-04-06 2007-10-10 中国科学院计算技术研究所 Researching system and method used in digital labrary
CN101667179A (en) * 2008-09-03 2010-03-10 华为技术有限公司 Mobile search method and system, and method for synchronizing search capability of search server

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412933A (en) * 2013-08-20 2013-11-27 南京物联网应用研究院有限公司 Cloud search platform
CN106484694A (en) * 2015-08-25 2017-03-08 杭州华为数字技术有限公司 Full-text search method based on distributed data base and system
CN106484694B (en) * 2015-08-25 2019-09-20 杭州华为数字技术有限公司 Full-text search method and system based on distributed data base
CN107066527A (en) * 2017-02-24 2017-08-18 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile internal memory
CN107066527B (en) * 2017-02-24 2019-10-29 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile memory
CN107679248A (en) * 2017-10-30 2018-02-09 江苏鸿信***集成有限公司 A kind of intelligent data search method

Also Published As

Publication number Publication date
CN102779134B (en) 2015-05-13

Similar Documents

Publication Publication Date Title
US10180946B2 (en) Consistent execution of partial queries in hybrid DBMS
Vora Hadoop-HBase for large-scale data
US9582520B1 (en) Transaction model for data stores using distributed file systems
CN101577735B (en) Method, device and system for taking over fault metadata server
CN101334797B (en) Distributed file systems and its data block consistency managing method
CN101676855B (en) Scalable secondary storage systems and methods
US9251008B2 (en) Client object replication between a first backup server and a second backup server
US10216588B2 (en) Database system recovery using preliminary and final slave node replay positions
US10825477B2 (en) RAID storage system with logical data group priority
US20110153570A1 (en) Data replication and recovery method in asymmetric clustered distributed file system
CN102725739A (en) Distributed database system by sharing or replicating the meta information on memory caches
US11321302B2 (en) Computer system and database management method
EP3722973B1 (en) Data processing method and device for distributed database, storage medium, and electronic device
CN104391873A (en) Database operation separation method and database operation separation system
US8555107B2 (en) Computer system and data processing method for computer system
US20120284244A1 (en) Transaction processing device, transaction processing method and transaction processing program
CN110309233A (en) Method, apparatus, server and the storage medium of data storage
US20150006485A1 (en) High Scalability Data Management Techniques for Representing, Editing, and Accessing Data
CN104750720A (en) Method for achieving high-performance data processing under multithread concurrent access environment
CN103345502A (en) Transaction processing method and system of distributed type database
US20180032567A1 (en) Method and device for processing data blocks in a distributed database
CN102779138A (en) Hard disk access method of real time data
CN102779134B (en) Lucene-based distributed search method
CN103513932B (en) A kind of data processing method and device
CN115587147A (en) Data processing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: TONGCHENG NETWORK TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: SUZHOU TONGCHENG TRAVEL NETWORK TECHNOLOGY CO., LTD.

Effective date: 20121212

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 215123 SUZHOU, JIANGSU PROVINCE TO: 215021 SUZHOU, JIANGSU PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20121212

Address after: Xinghu Street Industrial Park of Suzhou city in Jiangsu province 215021 Creative Industry Park 5 Building No. 328

Applicant after: Tongcheng Network Technology Co., Ltd.

Address before: Xinghu Street Industrial Park of Suzhou city in Jiangsu province 215123 Creative Industry Park 5 Building No. 328

Applicant before: Suzhou Tongcheng Travel Network Technology Co.,Ltd.

C14 Grant of patent or utility model
GR01 Patent grant