CN103399915A

CN103399915A - Optimal reading method for index file of search engine

Info

Publication number: CN103399915A
Application number: CN2013103293419A
Authority: CN
Inventors: 姜贤武
Original assignee: BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Current assignee: BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Priority date: 2013-07-31
Filing date: 2013-07-31
Publication date: 2013-11-20

Abstract

The invention relates to an optimal reading method for an index file of a search engine. The method comprises the steps that an index database fragment file is merged; a virtual memory file system is established to store a merged index database file; the index database file is read from the virtual memory file system into memory to form a memory index directory, and an index operation is conducted on the memory index director. According to the optimal reading method of indexing the file for the search engine, by means of the index database file merging strategy and combined use of the memory index directory and a file system index directory, an index of the virtual memory file system is read into the memory to form the memory index directory, reading speed of the index file is improved, disc space can be effectively utilized, and server resources are saved.

Description

A kind of optimization read method of search engine index file

Technical field

The invention belongs to the Information Technology Agreement field, be specifically related to a kind of optimization read method of search engine index file.

Background technology

Flourish along with internet, people also increase severely to the demand of quantity of information thereupon, and the approach of people's obtaining information is also more and more.

Search engine, as the Core Feature of network information search, is brought into play huge effect in daily life.At present domestic have a lot of searching products, and domestic internet scale is big, and quantity of information is big has also brought no small challenge to search technique, how to accomplish quicker, more accurate, the resource-saving problem that also with regard to becoming search engine businessman, need to solve more.

Setting up index is one of search engine core technology, and the purpose of setting up index is the inquiry that can respond fast the user.The most frequently used index data structure of search engine is inverted entry, and the principle of inverted entry is in fact quite simple, for convenient, processes, and tends to a word and document code and is converted to digital form.

Inverted index also often is called as reverse indexing, inserts archives or reverse archives, it is a kind of indexing means, be used to be stored in the mapping in a document or the heavy memory location of one group of document of certain word under full-text search, it is data structure the most frequently used in DRS.Search engine is realized the most basic high speed retrieval by such inverted index file.

When index of reference is searched, at first the document that needs index is carried out pre-service, set up the index structure about these documents.The technology of index mainly contains three kinds: inverted index, suffix array and signature file.Wherein, Inverted Index Technique is widely used in current most information retrieval system, and it is very effective for the search of keyword, is also this technology of using in Lucene.Suffix array technology has very fast speed in phrase inquiry, but such data structure at structure and while safeguarding all more complicated some.The signature document technology is popular in the period eighties 20th century, but Inverted Index Technique had surmounted it gradually afterwards.

In the technology of search engine, index is a very complicated technology, and index file has following characteristics: 1) file is extremely large, is all generally the TB level files; 2) file is read-only does not write, and index only is used for inquiring about, and only has read-only operation; 3) replacing of index file is not frequent; 4) file need to read fast.Usually wish that index file has following character:

1) index file is deposited as far as possible continuously on disk

Index file is with machine-readable part substantially all in internal memory, and inverted list is in disk, and inverted list is order and deposits, so index file can be regarded the situation that order is read as.

Even if a large file is sequentially accessed, but due to file system, in distribution, be to be distributed in different disk blocks, also can cause looking the access of order, be random access in fact, the seek time of disk is inevitable, the optimization that disk reads, buffer memory for example, look ahead etc. and all can't enjoy.

2) the making cost of index file is controlled, makes mistakes and can make up

Excessive index file, access efficiency are not high, and the reconstruction cost is high, and cost is high if data exception is reformed.Therefore index layering, the making of index segmentation is very important.

3) index file can support to read fast (with machine-readable, order is read)

Prior art adopts the mode of mmap to read usually, and it reads process is that a file or other object map are advanced internal memory.File is mapped on a plurality of pages; if the big or small sum of the big or small not all page of file; the space that last page is not used will zero clearing; the protection of internal memory is take page as base unit; even mapped file only has a byte-sized, kernel also can distribute for mapping the internal memory of a page size.During less than a page size, process can conduct interviews to a page size that starts from mmap () return address, and can not make mistakes when mapped file; But if the address space beyond a page is conducted interviews, generation leads to errors.Therefore, the effective address space size that can be used for interprocess communication can not surpass file size and a page size and, so the waste that may produce memory headroom during file reading.What the present invention was namely emphasis for the data directory file reads, effectively saves the improvement project of utilizing disk space and proposing fast.

Summary of the invention

On the basis of prior art, the object of the present invention is to provide a kind of method that promotes the reading speed of index file,, to realize reading fast of data directory file, can effectively utilize disk space and save server resource.

For achieving the above object, technical scheme of the present invention is:

A kind of optimization read method of search engine index file, the steps include:

1) merge the index database clip file;

2) create the virtual memory file system to store the index database file after merging;

3) the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, this internal memory index list is carried out index operation.

Further, buffer zone is set be used for merges the index database clip file in internal memory, and the size by adjusting buffer zone and toward the frequency of writing index file on disk, improve index creation and search speed.

Further, adjust the size of described buffer zone by merging the factor, minimum document merging number and these three key parameters of maximum document merging number.

Further, merge the frequency of index database file for once a day.

Further, adopt linux operating system, adopt tmpfs virtual memory file system.Described tmpfs uses physical memory, perhaps uses exchange partition.

Further, adopt windows operating system, by internal memory virtual disk mode, create the virtual memory file system.

The present invention is by index database Piece file mergence strategy, and internal memory index list and file system index list are combined with, the index of virtual memory file system is read in internal memory and forms the internal memory index list, improve the reading speed of index file, can effectively utilize disk space and save server resource.

Description of drawings

Fig. 1 is the flow chart of steps of the optimization read method of search engine index file of the present invention.

Fig. 2 is the workflow diagram of full-text index.

Embodiment

Below embodiments of the present invention are described further.

Fig. 1 is the flow chart of steps of the optimization read method of search engine index file of the present invention, specifically comprises:

1. merging index database

Reverse indexing is other all information of this keyword that oppositely obtain according to keyword, and such as the file at this keyword place, the number of times that occurs in file and line number etc., these information are exactly the information that the user will use while searching this keyword.Be put in disk file with reverse indexing, the index new along with passage of time can add, and this process is exactly increment index.The increment index process occurs in " index creation " this link.In general, " index creation " process comprises " initialization establishment " and " increment establishment ".The strategy that index creation adopts is at first to carry out once " initialization establishment ", then carries out unlimited " increment establishment ", and reason is to reduce disk operating, shortens update time, also facilitates the index database expansion of the image height order of magnitude, but increases like this maintenance difficulties.Fig. 2 is the workflow diagram of full-text index (reverse indexing), and wherein, the IndexWriter assembly is used for the index file read-write, the IndexSearcher assembly is used for inquiry, the TopDocsCollector assembly represents to return user's collection of document, and Query is query grammar, carries out query parse.This flow process brief description is: document is added into the IndexWriter assembly to create reverse indexing, and the IndexSearcher assembly reads index by function, then by the TopDocsCollector assembly, returns to user's collection of document.

When needing a large amount of file of index, the bottleneck of Index process is in write the process of index file on disk.Through after a while, the index in internal memory can be larger, if nonjoinder to hard disk, may cause Out of Memory use, and the process that therefore need to merge.In order to address this problem, Lucene holds a block buffer in internal memory, and provide size that merging indexing means (as described in hypomere) adjusts buffer zone and toward the frequency of writing index file on disk, raising index creation and search speed, reduce memory cost.

The change frequency of index database is less, so be difficult for changing, there is no need frequent merging, and consolidation strategy of the present invention is to merge once every day.Can set this merging cycle according to different concrete conditions and actual needs, the attribution data that increases so is in up-to-date index word bank, and the data of revising and deleting may belong to any index word bank.Increase merges the factor (mergeFactor) and minimum document merging number (minMergeDocs) helps to improve performance, reduces the index time.Merging the factor is that index merges the parameter of adjusting buffer zone, and this parameter has determined the frequency that can deposit how many documents and the index block on disk is merged into a large index block in the index block of Lucene.Such as, be 10 if merge the value of the factor, all documents all must be write in a new index block on disk when the number of files in internal memory reaches 10 so.Minimum document merges this parameter of number also can affect the performance of index.It has determined how many number of files in the internal memory reaches at least and they could be write back disk.The default value of this parameter is 10, if you have enough internal memories, that so this value is established as far as possible largerly will improve the index performance significantly.Maximum document merges number this parameter and has determined the number of files of the maximum in an index block.Its default value is Integer.MAX_VALUE, and this parameter is set to larger value can improve index efficiency and retrieval rate, because the default value of this parameter is the maximal value of integer, so generally do not need to change this parameter.The merging factor that IndexWrite assembly by Lucence provides, minimum document merge number and maximum document and merge these three key parameters of number and adjust the size of buffer zone, merge index, again index is write on disk when index in internal memory reaches certain quantity, utilized the hardware resource of machine to improve the efficiency of index.

2. create the virtual memory file system

The virtual memory file system is used for storing the index database file after merging, forms the file system index list.If linux operating system can adopt tmpfs virtual memory file system, tmpfs can use physical memory, also can use exchange partition.If windows also has similar program, by internal memory virtual disk mode.

3. the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, this internal memory index list is carried out index operation.

The operating speed of internal memory index list is very fast, thus the present invention in the operation index the index library file from the virtual memory file system is loaded into internal memory, form the internal memory index list, write back again the virtual memory file system after operation is completed.

Adopt method of the present invention, the index of virtual memory file system is read in internal memory the operating speed that forms the internal memory index list very fast, but the disk space of system can increase to some extent, increasing disk space can solve by enlarging disk.Method of the present invention forms the internal memory index list by index database Piece file mergence strategy, establishment virtual memory file system, accelerate the reading speed of index,, by said method, utilize Piece file mergence strategy, machine hardware resource can bring the reading speed of 10%-20% and the lifting on search speed.

Above embodiment is only in order to technical scheme of the present invention to be described but not be limited; those of ordinary skill in the art can modify or be equal to replacement technical scheme of the present invention; and not breaking away from the spirit and scope of the present invention, protection scope of the present invention should be as the criterion so that claim is described.

Claims

1. the optimization read method of a search engine index file, its step comprises:

1) merge the index database clip file;

3) the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, to this internal memory index order

Index operation is carried out in record.

2. the method for claim 1 is characterized in that: buffer zone is set is used for merging the index database clip file in internal memory, and the size by adjusting buffer zone and toward the frequency of writing index file on disk, improve index creation and search speed.

3. method as claimed in claim 2, is characterized in that: the size of adjusting described buffer zone by merging the factor, minimum document merging number and these three key parameters of maximum document merging number.

4. method as claimed in claim 2, is characterized in that: merge the frequency of index database file for once a day.

5. the method for claim 1, is characterized in that: adopt linux operating system, adopt tmpfs virtual memory file system.

6. method as claimed in claim 5 is characterized in that: described tmpfs uses physical memory, perhaps uses exchange partition.

7. the method for claim 1, is characterized in that: adopt windows operating system, by internal memory virtual disk mode, create the virtual memory file system.