CN103399915A - Optimal reading method for index file of search engine - Google Patents

Optimal reading method for index file of search engine Download PDF

Info

Publication number
CN103399915A
CN103399915A CN2013103293419A CN201310329341A CN103399915A CN 103399915 A CN103399915 A CN 103399915A CN 2013103293419 A CN2013103293419 A CN 2013103293419A CN 201310329341 A CN201310329341 A CN 201310329341A CN 103399915 A CN103399915 A CN 103399915A
Authority
CN
China
Prior art keywords
index
file
memory
internal memory
merging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103293419A
Other languages
Chinese (zh)
Inventor
姜贤武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Original Assignee
BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd filed Critical BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Priority to CN2013103293419A priority Critical patent/CN103399915A/en
Publication of CN103399915A publication Critical patent/CN103399915A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an optimal reading method for an index file of a search engine. The method comprises the steps that an index database fragment file is merged; a virtual memory file system is established to store a merged index database file; the index database file is read from the virtual memory file system into memory to form a memory index directory, and an index operation is conducted on the memory index director. According to the optimal reading method of indexing the file for the search engine, by means of the index database file merging strategy and combined use of the memory index directory and a file system index directory, an index of the virtual memory file system is read into the memory to form the memory index directory, reading speed of the index file is improved, disc space can be effectively utilized, and server resources are saved.

Description

A kind of optimization read method of search engine index file
Technical field
The invention belongs to the Information Technology Agreement field, be specifically related to a kind of optimization read method of search engine index file.
Background technology
Flourish along with internet, people also increase severely to the demand of quantity of information thereupon, and the approach of people's obtaining information is also more and more.
Search engine, as the Core Feature of network information search, is brought into play huge effect in daily life.At present domestic have a lot of searching products, and domestic internet scale is big, and quantity of information is big has also brought no small challenge to search technique, how to accomplish quicker, more accurate, the resource-saving problem that also with regard to becoming search engine businessman, need to solve more.
Setting up index is one of search engine core technology, and the purpose of setting up index is the inquiry that can respond fast the user.The most frequently used index data structure of search engine is inverted entry, and the principle of inverted entry is in fact quite simple, for convenient, processes, and tends to a word and document code and is converted to digital form.
Inverted index also often is called as reverse indexing, inserts archives or reverse archives, it is a kind of indexing means, be used to be stored in the mapping in a document or the heavy memory location of one group of document of certain word under full-text search, it is data structure the most frequently used in DRS.Search engine is realized the most basic high speed retrieval by such inverted index file.
When index of reference is searched, at first the document that needs index is carried out pre-service, set up the index structure about these documents.The technology of index mainly contains three kinds: inverted index, suffix array and signature file.Wherein, Inverted Index Technique is widely used in current most information retrieval system, and it is very effective for the search of keyword, is also this technology of using in Lucene.Suffix array technology has very fast speed in phrase inquiry, but such data structure at structure and while safeguarding all more complicated some.The signature document technology is popular in the period eighties 20th century, but Inverted Index Technique had surmounted it gradually afterwards.
In the technology of search engine, index is a very complicated technology, and index file has following characteristics: 1) file is extremely large, is all generally the TB level files; 2) file is read-only does not write, and index only is used for inquiring about, and only has read-only operation; 3) replacing of index file is not frequent; 4) file need to read fast.Usually wish that index file has following character:
1) index file is deposited as far as possible continuously on disk
Index file is with machine-readable part substantially all in internal memory, and inverted list is in disk, and inverted list is order and deposits, so index file can be regarded the situation that order is read as.
Even if a large file is sequentially accessed, but due to file system, in distribution, be to be distributed in different disk blocks, also can cause looking the access of order, be random access in fact, the seek time of disk is inevitable, the optimization that disk reads, buffer memory for example, look ahead etc. and all can't enjoy.
2) the making cost of index file is controlled, makes mistakes and can make up
Excessive index file, access efficiency are not high, and the reconstruction cost is high, and cost is high if data exception is reformed.Therefore index layering, the making of index segmentation is very important.
3) index file can support to read fast (with machine-readable, order is read)
Prior art adopts the mode of mmap to read usually, and it reads process is that a file or other object map are advanced internal memory.File is mapped on a plurality of pages; if the big or small sum of the big or small not all page of file; the space that last page is not used will zero clearing; the protection of internal memory is take page as base unit; even mapped file only has a byte-sized, kernel also can distribute for mapping the internal memory of a page size.During less than a page size, process can conduct interviews to a page size that starts from mmap () return address, and can not make mistakes when mapped file; But if the address space beyond a page is conducted interviews, generation leads to errors.Therefore, the effective address space size that can be used for interprocess communication can not surpass file size and a page size and, so the waste that may produce memory headroom during file reading.What the present invention was namely emphasis for the data directory file reads, effectively saves the improvement project of utilizing disk space and proposing fast.
Summary of the invention
On the basis of prior art, the object of the present invention is to provide a kind of method that promotes the reading speed of index file,, to realize reading fast of data directory file, can effectively utilize disk space and save server resource.
For achieving the above object, technical scheme of the present invention is:
A kind of optimization read method of search engine index file, the steps include:
1) merge the index database clip file;
2) create the virtual memory file system to store the index database file after merging;
3) the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, this internal memory index list is carried out index operation.
Further, buffer zone is set be used for merges the index database clip file in internal memory, and the size by adjusting buffer zone and toward the frequency of writing index file on disk, improve index creation and search speed.
Further, adjust the size of described buffer zone by merging the factor, minimum document merging number and these three key parameters of maximum document merging number.
Further, merge the frequency of index database file for once a day.
Further, adopt linux operating system, adopt tmpfs virtual memory file system.Described tmpfs uses physical memory, perhaps uses exchange partition.
Further, adopt windows operating system, by internal memory virtual disk mode, create the virtual memory file system.
The present invention is by index database Piece file mergence strategy, and internal memory index list and file system index list are combined with, the index of virtual memory file system is read in internal memory and forms the internal memory index list, improve the reading speed of index file, can effectively utilize disk space and save server resource.
Description of drawings
Fig. 1 is the flow chart of steps of the optimization read method of search engine index file of the present invention.
Fig. 2 is the workflow diagram of full-text index.
Embodiment
Below embodiments of the present invention are described further.
Fig. 1 is the flow chart of steps of the optimization read method of search engine index file of the present invention, specifically comprises:
1. merging index database
Reverse indexing is other all information of this keyword that oppositely obtain according to keyword, and such as the file at this keyword place, the number of times that occurs in file and line number etc., these information are exactly the information that the user will use while searching this keyword.Be put in disk file with reverse indexing, the index new along with passage of time can add, and this process is exactly increment index.The increment index process occurs in " index creation " this link.In general, " index creation " process comprises " initialization establishment " and " increment establishment ".The strategy that index creation adopts is at first to carry out once " initialization establishment ", then carries out unlimited " increment establishment ", and reason is to reduce disk operating, shortens update time, also facilitates the index database expansion of the image height order of magnitude, but increases like this maintenance difficulties.Fig. 2 is the workflow diagram of full-text index (reverse indexing), and wherein, the IndexWriter assembly is used for the index file read-write, the IndexSearcher assembly is used for inquiry, the TopDocsCollector assembly represents to return user's collection of document, and Query is query grammar, carries out query parse.This flow process brief description is: document is added into the IndexWriter assembly to create reverse indexing, and the IndexSearcher assembly reads index by function, then by the TopDocsCollector assembly, returns to user's collection of document.
When needing a large amount of file of index, the bottleneck of Index process is in write the process of index file on disk.Through after a while, the index in internal memory can be larger, if nonjoinder to hard disk, may cause Out of Memory use, and the process that therefore need to merge.In order to address this problem, Lucene holds a block buffer in internal memory, and provide size that merging indexing means (as described in hypomere) adjusts buffer zone and toward the frequency of writing index file on disk, raising index creation and search speed, reduce memory cost.
The change frequency of index database is less, so be difficult for changing, there is no need frequent merging, and consolidation strategy of the present invention is to merge once every day.Can set this merging cycle according to different concrete conditions and actual needs, the attribution data that increases so is in up-to-date index word bank, and the data of revising and deleting may belong to any index word bank.Increase merges the factor (mergeFactor) and minimum document merging number (minMergeDocs) helps to improve performance, reduces the index time.Merging the factor is that index merges the parameter of adjusting buffer zone, and this parameter has determined the frequency that can deposit how many documents and the index block on disk is merged into a large index block in the index block of Lucene.Such as, be 10 if merge the value of the factor, all documents all must be write in a new index block on disk when the number of files in internal memory reaches 10 so.Minimum document merges this parameter of number also can affect the performance of index.It has determined how many number of files in the internal memory reaches at least and they could be write back disk.The default value of this parameter is 10, if you have enough internal memories, that so this value is established as far as possible largerly will improve the index performance significantly.Maximum document merges number this parameter and has determined the number of files of the maximum in an index block.Its default value is Integer.MAX_VALUE, and this parameter is set to larger value can improve index efficiency and retrieval rate, because the default value of this parameter is the maximal value of integer, so generally do not need to change this parameter.The merging factor that IndexWrite assembly by Lucence provides, minimum document merge number and maximum document and merge these three key parameters of number and adjust the size of buffer zone, merge index, again index is write on disk when index in internal memory reaches certain quantity, utilized the hardware resource of machine to improve the efficiency of index.
2. create the virtual memory file system
The virtual memory file system is used for storing the index database file after merging, forms the file system index list.If linux operating system can adopt tmpfs virtual memory file system, tmpfs can use physical memory, also can use exchange partition.If windows also has similar program, by internal memory virtual disk mode.
3. the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, this internal memory index list is carried out index operation.
The operating speed of internal memory index list is very fast, thus the present invention in the operation index the index library file from the virtual memory file system is loaded into internal memory, form the internal memory index list, write back again the virtual memory file system after operation is completed.
Adopt method of the present invention, the index of virtual memory file system is read in internal memory the operating speed that forms the internal memory index list very fast, but the disk space of system can increase to some extent, increasing disk space can solve by enlarging disk.Method of the present invention forms the internal memory index list by index database Piece file mergence strategy, establishment virtual memory file system, accelerate the reading speed of index,, by said method, utilize Piece file mergence strategy, machine hardware resource can bring the reading speed of 10%-20% and the lifting on search speed.
Above embodiment is only in order to technical scheme of the present invention to be described but not be limited; those of ordinary skill in the art can modify or be equal to replacement technical scheme of the present invention; and not breaking away from the spirit and scope of the present invention, protection scope of the present invention should be as the criterion so that claim is described.

Claims (7)

1. the optimization read method of a search engine index file, its step comprises:
1) merge the index database clip file;
2) create the virtual memory file system to store the index database file after merging;
3) the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, to this internal memory index order
Index operation is carried out in record.
2. the method for claim 1 is characterized in that: buffer zone is set is used for merging the index database clip file in internal memory, and the size by adjusting buffer zone and toward the frequency of writing index file on disk, improve index creation and search speed.
3. method as claimed in claim 2, is characterized in that: the size of adjusting described buffer zone by merging the factor, minimum document merging number and these three key parameters of maximum document merging number.
4. method as claimed in claim 2, is characterized in that: merge the frequency of index database file for once a day.
5. the method for claim 1, is characterized in that: adopt linux operating system, adopt tmpfs virtual memory file system.
6. method as claimed in claim 5 is characterized in that: described tmpfs uses physical memory, perhaps uses exchange partition.
7. the method for claim 1, is characterized in that: adopt windows operating system, by internal memory virtual disk mode, create the virtual memory file system.
CN2013103293419A 2013-07-31 2013-07-31 Optimal reading method for index file of search engine Pending CN103399915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103293419A CN103399915A (en) 2013-07-31 2013-07-31 Optimal reading method for index file of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103293419A CN103399915A (en) 2013-07-31 2013-07-31 Optimal reading method for index file of search engine

Publications (1)

Publication Number Publication Date
CN103399915A true CN103399915A (en) 2013-11-20

Family

ID=49563543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103293419A Pending CN103399915A (en) 2013-07-31 2013-07-31 Optimal reading method for index file of search engine

Country Status (1)

Country Link
CN (1) CN103399915A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617133A (en) * 2013-12-06 2014-03-05 北京奇虎科技有限公司 Method and device for compressing virtual memory in Windows system
CN105426124A (en) * 2015-11-06 2016-03-23 江苏省电力公司扬州供电公司 RFS-based fast F-IO read-write system and method
CN107066527A (en) * 2017-02-24 2017-08-18 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile internal memory
CN107992569A (en) * 2017-11-29 2018-05-04 北京小度信息科技有限公司 Data access method, device, electronic equipment and computer-readable recording medium
CN110109927A (en) * 2019-04-25 2019-08-09 上海新炬网络技术有限公司 Oracle database data processing method based on LSM tree
CN112748866A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Method and device for processing incremental index data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1794239A (en) * 2005-12-30 2006-06-28 张天山 Automatic generating system of template network station possessing searching function and its method
CN102129459A (en) * 2011-03-10 2011-07-20 成都四方信息技术有限公司 Omnibearing enterprise data exchange high-speed engine
CN102890682A (en) * 2011-07-21 2013-01-23 腾讯科技(深圳)有限公司 Method for creating index, searching method, device and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1794239A (en) * 2005-12-30 2006-06-28 张天山 Automatic generating system of template network station possessing searching function and its method
CN102129459A (en) * 2011-03-10 2011-07-20 成都四方信息技术有限公司 Omnibearing enterprise data exchange high-speed engine
CN102890682A (en) * 2011-07-21 2013-01-23 腾讯科技(深圳)有限公司 Method for creating index, searching method, device and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FX_SKY: ""Lucene3.6总结篇"", 《CSDN博客》 *
HUANGXC: ""nutch Lucene实现全文索引"", 《CSDN:BLOG.CSDN.NET/HUANGXC/ARTICLE/DETAILS/2197962》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617133A (en) * 2013-12-06 2014-03-05 北京奇虎科技有限公司 Method and device for compressing virtual memory in Windows system
CN103617133B (en) * 2013-12-06 2017-08-25 北京奇虎科技有限公司 Virtual memory compression method and device in a kind of Windows systems
CN105426124A (en) * 2015-11-06 2016-03-23 江苏省电力公司扬州供电公司 RFS-based fast F-IO read-write system and method
CN107066527A (en) * 2017-02-24 2017-08-18 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile internal memory
CN107066527B (en) * 2017-02-24 2019-10-29 湖南蚁坊软件股份有限公司 A kind of method and system of the caching index based on out-pile memory
CN107992569A (en) * 2017-11-29 2018-05-04 北京小度信息科技有限公司 Data access method, device, electronic equipment and computer-readable recording medium
CN110109927A (en) * 2019-04-25 2019-08-09 上海新炬网络技术有限公司 Oracle database data processing method based on LSM tree
CN112748866A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Method and device for processing incremental index data

Similar Documents

Publication Publication Date Title
CN109213772B (en) Data storage method and NVMe storage system
CN110825748B (en) High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism
CN102890722B (en) Indexing method applied to time sequence historical database
CN103399915A (en) Optimal reading method for index file of search engine
US20190042571A1 (en) Update-Insert for Key-Value Storage Interface
CN103229173B (en) Metadata management method and system
CN104346357B (en) The file access method and system of a kind of built-in terminal
CN100468402C (en) Sort data storage and split catalog inquiry method based on catalog tree
CN105975587B (en) A kind of high performance memory database index organization and access method
CN105677826A (en) Resource management method for massive unstructured data
CN103530387A (en) Improved method aimed at small files of HDFS
CN105787093B (en) A kind of construction method of the log file system based on LSM-Tree structure
CN105045850B (en) Junk data recovery method in cloud storage log file system
CN103186350A (en) Hybrid storage system and hot spot data block migration method
CN103577123A (en) Small file optimization storage method based on HDFS
CN104679898A (en) Big data access method
CN105117417A (en) Read-optimized memory database Trie tree index method
CN104778270A (en) Storage method for multiple files
CN103488710B (en) The non-fixed-length data method of efficient storage in big data page
CN103176754A (en) Reading and storing method for massive amounts of small files
CN106502587A (en) Data in magnetic disk management method and magnetic disk control unit
CN102737133B (en) A kind of method of real-time search
CN103885887B (en) User data storage method, read method and system
CN102541985A (en) Organization method of client directory cache in distributed file system
CN101986649B (en) Shared data center used in telecommunication industry billing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131120