CN103399915A - Optimal reading method for index file of search engine - Google Patents
Optimal reading method for index file of search engine Download PDFInfo
- Publication number
- CN103399915A CN103399915A CN2013103293419A CN201310329341A CN103399915A CN 103399915 A CN103399915 A CN 103399915A CN 2013103293419 A CN2013103293419 A CN 2013103293419A CN 201310329341 A CN201310329341 A CN 201310329341A CN 103399915 A CN103399915 A CN 103399915A
- Authority
- CN
- China
- Prior art keywords
- index
- file
- memory
- internal memory
- merging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an optimal reading method for an index file of a search engine. The method comprises the steps that an index database fragment file is merged; a virtual memory file system is established to store a merged index database file; the index database file is read from the virtual memory file system into memory to form a memory index directory, and an index operation is conducted on the memory index director. According to the optimal reading method of indexing the file for the search engine, by means of the index database file merging strategy and combined use of the memory index directory and a file system index directory, an index of the virtual memory file system is read into the memory to form the memory index directory, reading speed of the index file is improved, disc space can be effectively utilized, and server resources are saved.
Description
Technical field
The invention belongs to the Information Technology Agreement field, be specifically related to a kind of optimization read method of search engine index file.
Background technology
Flourish along with internet, people also increase severely to the demand of quantity of information thereupon, and the approach of people's obtaining information is also more and more.
Search engine, as the Core Feature of network information search, is brought into play huge effect in daily life.At present domestic have a lot of searching products, and domestic internet scale is big, and quantity of information is big has also brought no small challenge to search technique, how to accomplish quicker, more accurate, the resource-saving problem that also with regard to becoming search engine businessman, need to solve more.
Setting up index is one of search engine core technology, and the purpose of setting up index is the inquiry that can respond fast the user.The most frequently used index data structure of search engine is inverted entry, and the principle of inverted entry is in fact quite simple, for convenient, processes, and tends to a word and document code and is converted to digital form.
Inverted index also often is called as reverse indexing, inserts archives or reverse archives, it is a kind of indexing means, be used to be stored in the mapping in a document or the heavy memory location of one group of document of certain word under full-text search, it is data structure the most frequently used in DRS.Search engine is realized the most basic high speed retrieval by such inverted index file.
When index of reference is searched, at first the document that needs index is carried out pre-service, set up the index structure about these documents.The technology of index mainly contains three kinds: inverted index, suffix array and signature file.Wherein, Inverted Index Technique is widely used in current most information retrieval system, and it is very effective for the search of keyword, is also this technology of using in Lucene.Suffix array technology has very fast speed in phrase inquiry, but such data structure at structure and while safeguarding all more complicated some.The signature document technology is popular in the period eighties 20th century, but Inverted Index Technique had surmounted it gradually afterwards.
In the technology of search engine, index is a very complicated technology, and index file has following characteristics: 1) file is extremely large, is all generally the TB level files; 2) file is read-only does not write, and index only is used for inquiring about, and only has read-only operation; 3) replacing of index file is not frequent; 4) file need to read fast.Usually wish that index file has following character:
1) index file is deposited as far as possible continuously on disk
Index file is with machine-readable part substantially all in internal memory, and inverted list is in disk, and inverted list is order and deposits, so index file can be regarded the situation that order is read as.
Even if a large file is sequentially accessed, but due to file system, in distribution, be to be distributed in different disk blocks, also can cause looking the access of order, be random access in fact, the seek time of disk is inevitable, the optimization that disk reads, buffer memory for example, look ahead etc. and all can't enjoy.
2) the making cost of index file is controlled, makes mistakes and can make up
Excessive index file, access efficiency are not high, and the reconstruction cost is high, and cost is high if data exception is reformed.Therefore index layering, the making of index segmentation is very important.
3) index file can support to read fast (with machine-readable, order is read)
Prior art adopts the mode of mmap to read usually, and it reads process is that a file or other object map are advanced internal memory.File is mapped on a plurality of pages; if the big or small sum of the big or small not all page of file; the space that last page is not used will zero clearing; the protection of internal memory is take page as base unit; even mapped file only has a byte-sized, kernel also can distribute for mapping the internal memory of a page size.During less than a page size, process can conduct interviews to a page size that starts from mmap () return address, and can not make mistakes when mapped file; But if the address space beyond a page is conducted interviews, generation leads to errors.Therefore, the effective address space size that can be used for interprocess communication can not surpass file size and a page size and, so the waste that may produce memory headroom during file reading.What the present invention was namely emphasis for the data directory file reads, effectively saves the improvement project of utilizing disk space and proposing fast.
Summary of the invention
On the basis of prior art, the object of the present invention is to provide a kind of method that promotes the reading speed of index file,, to realize reading fast of data directory file, can effectively utilize disk space and save server resource.
For achieving the above object, technical scheme of the present invention is:
A kind of optimization read method of search engine index file, the steps include:
1) merge the index database clip file;
2) create the virtual memory file system to store the index database file after merging;
3) the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, this internal memory index list is carried out index operation.
Further, buffer zone is set be used for merges the index database clip file in internal memory, and the size by adjusting buffer zone and toward the frequency of writing index file on disk, improve index creation and search speed.
Further, adjust the size of described buffer zone by merging the factor, minimum document merging number and these three key parameters of maximum document merging number.
Further, merge the frequency of index database file for once a day.
Further, adopt linux operating system, adopt tmpfs virtual memory file system.Described tmpfs uses physical memory, perhaps uses exchange partition.
Further, adopt windows operating system, by internal memory virtual disk mode, create the virtual memory file system.
The present invention is by index database Piece file mergence strategy, and internal memory index list and file system index list are combined with, the index of virtual memory file system is read in internal memory and forms the internal memory index list, improve the reading speed of index file, can effectively utilize disk space and save server resource.
Description of drawings
Fig. 1 is the flow chart of steps of the optimization read method of search engine index file of the present invention.
Fig. 2 is the workflow diagram of full-text index.
Embodiment
Below embodiments of the present invention are described further.
Fig. 1 is the flow chart of steps of the optimization read method of search engine index file of the present invention, specifically comprises:
1. merging index database
Reverse indexing is other all information of this keyword that oppositely obtain according to keyword, and such as the file at this keyword place, the number of times that occurs in file and line number etc., these information are exactly the information that the user will use while searching this keyword.Be put in disk file with reverse indexing, the index new along with passage of time can add, and this process is exactly increment index.The increment index process occurs in " index creation " this link.In general, " index creation " process comprises " initialization establishment " and " increment establishment ".The strategy that index creation adopts is at first to carry out once " initialization establishment ", then carries out unlimited " increment establishment ", and reason is to reduce disk operating, shortens update time, also facilitates the index database expansion of the image height order of magnitude, but increases like this maintenance difficulties.Fig. 2 is the workflow diagram of full-text index (reverse indexing), and wherein, the IndexWriter assembly is used for the index file read-write, the IndexSearcher assembly is used for inquiry, the TopDocsCollector assembly represents to return user's collection of document, and Query is query grammar, carries out query parse.This flow process brief description is: document is added into the IndexWriter assembly to create reverse indexing, and the IndexSearcher assembly reads index by function, then by the TopDocsCollector assembly, returns to user's collection of document.
When needing a large amount of file of index, the bottleneck of Index process is in write the process of index file on disk.Through after a while, the index in internal memory can be larger, if nonjoinder to hard disk, may cause Out of Memory use, and the process that therefore need to merge.In order to address this problem, Lucene holds a block buffer in internal memory, and provide size that merging indexing means (as described in hypomere) adjusts buffer zone and toward the frequency of writing index file on disk, raising index creation and search speed, reduce memory cost.
The change frequency of index database is less, so be difficult for changing, there is no need frequent merging, and consolidation strategy of the present invention is to merge once every day.Can set this merging cycle according to different concrete conditions and actual needs, the attribution data that increases so is in up-to-date index word bank, and the data of revising and deleting may belong to any index word bank.Increase merges the factor (mergeFactor) and minimum document merging number (minMergeDocs) helps to improve performance, reduces the index time.Merging the factor is that index merges the parameter of adjusting buffer zone, and this parameter has determined the frequency that can deposit how many documents and the index block on disk is merged into a large index block in the index block of Lucene.Such as, be 10 if merge the value of the factor, all documents all must be write in a new index block on disk when the number of files in internal memory reaches 10 so.Minimum document merges this parameter of number also can affect the performance of index.It has determined how many number of files in the internal memory reaches at least and they could be write back disk.The default value of this parameter is 10, if you have enough internal memories, that so this value is established as far as possible largerly will improve the index performance significantly.Maximum document merges number this parameter and has determined the number of files of the maximum in an index block.Its default value is Integer.MAX_VALUE, and this parameter is set to larger value can improve index efficiency and retrieval rate, because the default value of this parameter is the maximal value of integer, so generally do not need to change this parameter.The merging factor that IndexWrite assembly by Lucence provides, minimum document merge number and maximum document and merge these three key parameters of number and adjust the size of buffer zone, merge index, again index is write on disk when index in internal memory reaches certain quantity, utilized the hardware resource of machine to improve the efficiency of index.
2. create the virtual memory file system
The virtual memory file system is used for storing the index database file after merging, forms the file system index list.If linux operating system can adopt tmpfs virtual memory file system, tmpfs can use physical memory, also can use exchange partition.If windows also has similar program, by internal memory virtual disk mode.
3. the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, this internal memory index list is carried out index operation.
The operating speed of internal memory index list is very fast, thus the present invention in the operation index the index library file from the virtual memory file system is loaded into internal memory, form the internal memory index list, write back again the virtual memory file system after operation is completed.
Adopt method of the present invention, the index of virtual memory file system is read in internal memory the operating speed that forms the internal memory index list very fast, but the disk space of system can increase to some extent, increasing disk space can solve by enlarging disk.Method of the present invention forms the internal memory index list by index database Piece file mergence strategy, establishment virtual memory file system, accelerate the reading speed of index,, by said method, utilize Piece file mergence strategy, machine hardware resource can bring the reading speed of 10%-20% and the lifting on search speed.
Above embodiment is only in order to technical scheme of the present invention to be described but not be limited; those of ordinary skill in the art can modify or be equal to replacement technical scheme of the present invention; and not breaking away from the spirit and scope of the present invention, protection scope of the present invention should be as the criterion so that claim is described.
Claims (7)
1. the optimization read method of a search engine index file, its step comprises:
1) merge the index database clip file;
2) create the virtual memory file system to store the index database file after merging;
3) the index database file is read internal memory and forms the internal memory index list from the virtual memory file system, to this internal memory index order
Index operation is carried out in record.
2. the method for claim 1 is characterized in that: buffer zone is set is used for merging the index database clip file in internal memory, and the size by adjusting buffer zone and toward the frequency of writing index file on disk, improve index creation and search speed.
3. method as claimed in claim 2, is characterized in that: the size of adjusting described buffer zone by merging the factor, minimum document merging number and these three key parameters of maximum document merging number.
4. method as claimed in claim 2, is characterized in that: merge the frequency of index database file for once a day.
5. the method for claim 1, is characterized in that: adopt linux operating system, adopt tmpfs virtual memory file system.
6. method as claimed in claim 5 is characterized in that: described tmpfs uses physical memory, perhaps uses exchange partition.
7. the method for claim 1, is characterized in that: adopt windows operating system, by internal memory virtual disk mode, create the virtual memory file system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013103293419A CN103399915A (en) | 2013-07-31 | 2013-07-31 | Optimal reading method for index file of search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013103293419A CN103399915A (en) | 2013-07-31 | 2013-07-31 | Optimal reading method for index file of search engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103399915A true CN103399915A (en) | 2013-11-20 |
Family
ID=49563543
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013103293419A Pending CN103399915A (en) | 2013-07-31 | 2013-07-31 | Optimal reading method for index file of search engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103399915A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617133A (en) * | 2013-12-06 | 2014-03-05 | 北京奇虎科技有限公司 | Method and device for compressing virtual memory in Windows system |
CN105426124A (en) * | 2015-11-06 | 2016-03-23 | 江苏省电力公司扬州供电公司 | RFS-based fast F-IO read-write system and method |
CN107066527A (en) * | 2017-02-24 | 2017-08-18 | 湖南蚁坊软件股份有限公司 | A kind of method and system of the caching index based on out-pile internal memory |
CN107992569A (en) * | 2017-11-29 | 2018-05-04 | 北京小度信息科技有限公司 | Data access method, device, electronic equipment and computer-readable recording medium |
CN110109927A (en) * | 2019-04-25 | 2019-08-09 | 上海新炬网络技术有限公司 | Oracle database data processing method based on LSM tree |
CN112748866A (en) * | 2019-10-31 | 2021-05-04 | 北京沃东天骏信息技术有限公司 | Method and device for processing incremental index data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1794239A (en) * | 2005-12-30 | 2006-06-28 | 张天山 | Automatic generating system of template network station possessing searching function and its method |
CN102129459A (en) * | 2011-03-10 | 2011-07-20 | 成都四方信息技术有限公司 | Omnibearing enterprise data exchange high-speed engine |
CN102890682A (en) * | 2011-07-21 | 2013-01-23 | 腾讯科技(深圳)有限公司 | Method for creating index, searching method, device and system |
-
2013
- 2013-07-31 CN CN2013103293419A patent/CN103399915A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1794239A (en) * | 2005-12-30 | 2006-06-28 | 张天山 | Automatic generating system of template network station possessing searching function and its method |
CN102129459A (en) * | 2011-03-10 | 2011-07-20 | 成都四方信息技术有限公司 | Omnibearing enterprise data exchange high-speed engine |
CN102890682A (en) * | 2011-07-21 | 2013-01-23 | 腾讯科技(深圳)有限公司 | Method for creating index, searching method, device and system |
Non-Patent Citations (2)
Title |
---|
FX_SKY: ""Lucene3.6总结篇"", 《CSDN博客》 * |
HUANGXC: ""nutch Lucene实现全文索引"", 《CSDN:BLOG.CSDN.NET/HUANGXC/ARTICLE/DETAILS/2197962》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617133A (en) * | 2013-12-06 | 2014-03-05 | 北京奇虎科技有限公司 | Method and device for compressing virtual memory in Windows system |
CN103617133B (en) * | 2013-12-06 | 2017-08-25 | 北京奇虎科技有限公司 | Virtual memory compression method and device in a kind of Windows systems |
CN105426124A (en) * | 2015-11-06 | 2016-03-23 | 江苏省电力公司扬州供电公司 | RFS-based fast F-IO read-write system and method |
CN107066527A (en) * | 2017-02-24 | 2017-08-18 | 湖南蚁坊软件股份有限公司 | A kind of method and system of the caching index based on out-pile internal memory |
CN107066527B (en) * | 2017-02-24 | 2019-10-29 | 湖南蚁坊软件股份有限公司 | A kind of method and system of the caching index based on out-pile memory |
CN107992569A (en) * | 2017-11-29 | 2018-05-04 | 北京小度信息科技有限公司 | Data access method, device, electronic equipment and computer-readable recording medium |
CN110109927A (en) * | 2019-04-25 | 2019-08-09 | 上海新炬网络技术有限公司 | Oracle database data processing method based on LSM tree |
CN112748866A (en) * | 2019-10-31 | 2021-05-04 | 北京沃东天骏信息技术有限公司 | Method and device for processing incremental index data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109213772B (en) | Data storage method and NVMe storage system | |
CN110825748B (en) | High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism | |
CN102890722B (en) | Indexing method applied to time sequence historical database | |
CN103399915A (en) | Optimal reading method for index file of search engine | |
US20190042571A1 (en) | Update-Insert for Key-Value Storage Interface | |
CN103229173B (en) | Metadata management method and system | |
CN104346357B (en) | The file access method and system of a kind of built-in terminal | |
CN100468402C (en) | Sort data storage and split catalog inquiry method based on catalog tree | |
CN105975587B (en) | A kind of high performance memory database index organization and access method | |
CN105677826A (en) | Resource management method for massive unstructured data | |
CN103530387A (en) | Improved method aimed at small files of HDFS | |
CN105787093B (en) | A kind of construction method of the log file system based on LSM-Tree structure | |
CN105045850B (en) | Junk data recovery method in cloud storage log file system | |
CN103186350A (en) | Hybrid storage system and hot spot data block migration method | |
CN103577123A (en) | Small file optimization storage method based on HDFS | |
CN104679898A (en) | Big data access method | |
CN105117417A (en) | Read-optimized memory database Trie tree index method | |
CN104778270A (en) | Storage method for multiple files | |
CN103488710B (en) | The non-fixed-length data method of efficient storage in big data page | |
CN103176754A (en) | Reading and storing method for massive amounts of small files | |
CN106502587A (en) | Data in magnetic disk management method and magnetic disk control unit | |
CN102737133B (en) | A kind of method of real-time search | |
CN103885887B (en) | User data storage method, read method and system | |
CN102541985A (en) | Organization method of client directory cache in distributed file system | |
CN101986649B (en) | Shared data center used in telecommunication industry billing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20131120 |