CN104166649B - Caching method and equipment for search engine - Google Patents

Caching method and equipment for search engine Download PDF

Info

Publication number
CN104166649B
CN104166649B CN201310182204.7A CN201310182204A CN104166649B CN 104166649 B CN104166649 B CN 104166649B CN 201310182204 A CN201310182204 A CN 201310182204A CN 104166649 B CN104166649 B CN 104166649B
Authority
CN
China
Prior art keywords
list
query
score
scoring
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310182204.7A
Other languages
Chinese (zh)
Other versions
CN104166649A (en
Inventor
宋华青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba (Shanghai) Co., Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310182204.7A priority Critical patent/CN104166649B/en
Publication of CN104166649A publication Critical patent/CN104166649A/en
Application granted granted Critical
Publication of CN104166649B publication Critical patent/CN104166649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a caching method and equipment for a search engine. The method comprises the following steps: receiving a query and elements obtained according to the query; generating a key value of the query based on the query; finding a list corresponding to the key value in a cache; scoring and storing the elements; and updating the cache. According to the technical scheme, the calling times of the MLR are reduced by caching the score and the state information of each element, so that the calculated amount of an engine is reduced, and the timeliness of the engine is improved. The present application also improves cache hit rate by refining the composition of each query and normalizing the queries.

Description

Caching method and equipment for search engine
Technical Field
The present application relates to the field of search engines, and in particular, to a caching method and apparatus for a search engine.
Background
The search engine is a system that collects information from the internet by using a specific computer program according to a certain policy, provides a retrieval service for a user after organizing and processing the information, and displays information related to user retrieval to the user.
As a search engine grows, the amount of data increases and the business requirements become more complex. Accordingly, the engine's computing module, namely: machine Learning Ranking Modules (MLRs) are also becoming more complex. The computing module is often related to various algorithm models, needs a large amount of computation, consumes a large amount of CPU resources of the server, and has increasingly prominent performance problems.
The data referred to herein is also called a retrieved element (document), and the element may be a basic unit such as a web page or a product in a real engine.
Therefore, in order to improve the query efficiency and reduce the calculation amount, the existing search engine can cache the result of one access of the user, and when the same access comes again, the corresponding result is directly found from the cache and returned to the user, so that the pressure of the engine is reduced, and the response time is improved.
However, the cache hit rate of the method is relatively low, and the user can not hit the query after slightly changing the composition of the query. In addition, the caching mode can sacrifice certain timeliness, and the result of cache hit is different from the real condition.
Disclosure of Invention
The present application mainly aims to provide a new technical solution for cache of a search engine to solve the above problems existing in the prior art, wherein:
according to a first aspect of the present application, there is provided a caching method for a search engine, comprising the steps of: receiving a query and elements obtained according to the query; generating a key value of the query based on the query; finding a list corresponding to the key value in a cache; scoring and storing the elements; and updating the cache.
According to a second aspect of the present application, there is provided a caching apparatus for a search engine, comprising: receiving means for receiving the query and the elements obtained from the query; generating means for generating a key value of a query based on the query; the searching device is used for finding the list corresponding to the key value in the cache; the scoring device is used for scoring and storing the elements; and an updating means for updating the cache.
Compared with the prior art, the technical scheme of the application reduces the calling times of the MLR by caching the score and the state information of each element, thereby reducing the calculated amount of the engine and improving the timeliness of the engine. The present application also improves cache hit rate by refining the composition of each query and normalizing the queries.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 schematically illustrates an overall flow diagram of a caching method for a search engine as proposed herein;
FIG. 2 schematically illustrates a data structure in a cache according to an embodiment of the present application;
FIG. 3 schematically shows a detailed flowchart of the scoring and saving steps for an element according to one embodiment of the present application;
FIG. 4 is a schematic diagram illustrating the technical effect of the method of the present application;
fig. 5 schematically shows a block diagram of a cache device for a search engine according to an embodiment of the present application.
In the drawings, the same reference numerals are used to designate the same or similar components.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings and specific embodiments.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.
Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.
Fig. 1 schematically shows an overall flow chart of a caching method 100 for a search engine proposed by the present application.
In step 101, a query and an element derived from the query are received. The query (query) as referred to herein refers to a query (query) that has been identified by a certain algorithmic intent, for distinguishing the needs of different users. As used herein, an element (document) is an element that a search engine finds in response to a query that is relevant to the query.
In step 102, a key value of a query is generated based on the query. Specifically, step 102 further comprises the steps of:
the query is refined. Parameters associated with the score are identified by loading a predetermined configuration including the name of the parameters associated with the score. For example, there are three queries:
Query1:keyword=mp3&addr=Beijing&tab=pop&stat=cat
Query2:tab=normal&keyword=MP3&addr=Hangzhou&stat=attr
Query3:tab=pop&keyword=mP3
wherein, keyword represents the query word, addr represents the filtered address, tab represents the normal search or popularity search, and stat represents the statistic item.
The parameters in the preset configuration are keywords, tab. In this example, keyword, tab may be used as a parameter associated with scoring.
Then refinement of these three queries yields:
Query1:keyword=mp3&tab=pop
Query2:tab=normal&keyword=MP3
Query3:tab=pop&keyword=mP3
and (3) after the parameters obtained by thinning are subjected to case conversion and sorting, a new query is formed by splicing, namely: and (6) normalizing. For the above three refined queries, the new query after normalization is:
Query1:keyword=mp3&tab=pop
Query2:keyword=mp3&tab=normal
Query3:keyword=mp3&tab=pop
and finally, signing the new query to generate a cache key value (cache key). There are many signature algorithms, and here, after signature by using MD5 algorithm, the cache key value of each query is obtained as follows:
Query1:8c2a5244cd5e650b9cb259de4351a887
Query2:e9aee0a751b60863f67a80b3b9f323b8
Query3:8c2a5244cd5e650b9cb259de4351a887
by the steps of refining the query composition and normalizing the query, the cache hit rate can be improved.
In step 103, a list corresponding to the key is found in the cache. Fig. 2 schematically illustrates a data structure diagram in a cache according to an embodiment of the present application, and it can be seen in conjunction with fig. 2 that a storage structure of queried data in the cache of the present application adopts a two-level index manner. In the cache 200, one user query may correspond to only one key 201, and a list 202 of one item may be found by the key 201, where the list 202 includes a plurality of items 203 and 204, and each item includes an element identification number, an element score, and status information. The identification number is a unique label of an element, the element score is a scoring value of the element by a machine learning ranking Module (MLR), and the state information is some states at the time of scoring the element. The status information can be summarized into two pieces of information: the state of the element can be understood as the updating time of the element when entering the cache; the second is the information of the item, which can be understood as the time when the item is put into the cache. There is an upper limit to the number of items that each list can hold. According to an embodiment of the present invention, each element has a timestamp, the time of the element update is recorded, if an element is updated at time t1 and is not updated thereafter, at time t2, there is a query hit on the element, t1 is the update time of the element when entering the cache, and t2 is the time when the item corresponding to the element is put into the cache. Therefore, the status information of the item can be stored as (t1, t 2).
In step 104, the elements are scored and saved. Fig. 3 schematically shows a detailed flowchart of the scoring and saving steps for elements according to an embodiment of the present application, and in conjunction with fig. 3, step 104 may further include the following steps 301 and 303.
In step 301, it is determined whether the element is in the item list. If the element is in the list, the method proceeds to step 302, otherwise to step 303.
In step 302, it is checked whether the score of the element is valid. If valid, the element is not re-scored, i.e.: keeping the original score of the element unchanged, the method of fig. 3 ends directly, otherwise step 303 is entered. Assuming that an access hits the element with item status information (t1, t2) at time t3, the following two checks are performed:
1) checking the time of element update and the element state information in the item, and finding that the time is t1, which indicates that the element is not updated in the time from t1 to t 3;
2) it is checked whether the interval between the time point t2 and the time point t3 in the item status information exceeds a preset threshold, and if not, the item is considered valid.
If the element is not updated and the item is valid during the time period from t1 to t3, the score for the element is considered valid.
In step 303, a machine learning ranking Module (MLR) is invoked to re-score the element and save the score. The method of fig. 3 then ends.
It should be noted that, for the elements obtained by the query, the judgment, the scoring and the saving of the result can be performed according to the steps herein.
As can be seen from the step flow shown in fig. 3, when the elements are scored, for the elements that are not currently in the item list or the elements whose state information is invalid, the MLR module that consumes the CPU resource will not be called to score the elements that are in the item list and whose state information is valid. Therefore, the called times of the MLR module are reduced, the CPU load is reduced, and the response speed is improved.
In step 105, the cache is updated. Step 105 further comprises:
the state information of the element is updated. For the re-scored items, the state information of the re-scored items needs to be updated and saved correspondingly.
Items in the list with high element scores are retained according to an upper limit value of the maximum number of items that the list can hold. Since the number of elements below the item list is large, it is impossible to store all the elements in the cache, and therefore, it is necessary to keep the elements with higher scores in the list and remove the elements with lower scores from the list according to the latest scoring information and the upper limit value. Such as: if the upper limit is 100, only the first 100 elements with scores from high to low are retained, while the other elements are moved out of the list.
A least recently used rule (LRU) is followed to determine whether the list remains in the cache. After the update of the entries in the corresponding list is completed, the least recently used list is moved out of the cache according to a least recently used rule (LRU).
Fig. 4 is a schematic diagram illustrating the technical effect of the method of the present application, and the technical effect of the method of the present application is illustrated in combination with fig. 4.
In the figure, query indicates that the query of the user is "cell phone", doc1 and doc2 are two elements obtained according to the query, and the numbers (900, 1000 and 1100) following doc1 and doc2 are the current scores of the elements.
As shown in FIG. 4A, when the user queries "cell phone" for the first time, the computing module MLR is called to obtain the scores of doc1 and doc2, the scores are cached, and the computing result is returned to the user. This query invokes the MLR twice.
As shown in FIG. 4B, when the user queries "cell phone" for the second time, it is not necessary to call the computing module MLR to obtain the scores of doc1 and doc2, and the scores are directly read out from the cache and returned to the user. This query does not invoke the MLR.
As shown in fig. 4C, after the user queries "cell phone" for the second time, and before querying "cell phone" for the third time, if the content of doc2 is updated by the external module, since no "cell phone" query comes again, the caching method of the present application will not call the computing module MLR to recalculate the score of doc 2. That is, the change in doc2 will not be immediately reflected in the cache. That is to say: before the next query comes, even if the content of the element corresponding to the latest query is updated and changed, the method of the application can keep the score of the element unchanged.
As shown in fig. 4D, after the content of doc2 is updated by the external module, if the user queries "cell phone" again, and the cache finds that doc2 is updated by judging the state information of doc2, the computing module MLR needs to be called to recalculate and store the score, while the state information of doc1 is not changed and still exists in the project list, and the computing module MLR will not be called to recalculate the score, and only needs to directly return the score to the user. Therefore, for the user query at this time, the computing module MLR is called only once, and the computing amount is saved once.
The application also provides a cache device for the search engine. Fig. 5 schematically shows a block diagram of a cache device 500 for a search engine according to an embodiment of the present application. According to an embodiment of the present application, the apparatus 500 may comprise: a receiving device 501 for receiving the query and the elements obtained from the query, and a generating device 502 for generating a key value of the query based on the query; the searching means 503 is configured to find a list corresponding to the key value in the cache; a scoring device 504, configured to score and store the elements; and an updating device 505 for updating the cache.
According to an embodiment of the present application, the generating means 502 may further include: the filtering device is used for filtering out parameters related to scoring in the query by loading a preset configuration; the splicing device is used for splicing all parameters related to scoring into a new query; and the signature device is used for signing the new query to generate a key value of the new query.
According to another embodiment of the present application, the scoring device 504 may further include: element position checking means for checking whether the element is in the list; and first re-scoring means for re-scoring and saving said elements not in said list; score validity checking means for checking whether the score of the element is valid; and second re-scoring means for re-scoring and saving the elements for which the score is invalid.
According to one embodiment of the present application, a list contains a plurality of items, including an element identification number, an element score, and status information.
According to one embodiment of the present application, there is an upper limit on the number of items that a list can hold.
According to still another embodiment of the present application, the updating apparatus 505 may further include: state updating means for updating state information of the element; element retaining means for retaining items in the list having high element scores according to an upper limit; list retaining means for deciding whether the list is retained in the cache according to a least recently used rule.
According to one embodiment of the application, the scores of the elements obtained by the query are kept unchanged before the next query comes.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims (10)

1. A caching method for a search engine, comprising the steps of:
receiving user queries and elements obtained according to the user queries, wherein one element is a piece of search data which is found by a search engine according to the user queries and is relevant to the user queries;
generating a key value of the user query based on the user query;
finding a list corresponding to the key value in a cache, wherein the list comprises a plurality of items, each item comprises an element identification number, an element score and state information, the state information comprises the updating time of the element entering the cache and the time of the item being placed in the cache, and the state information is used for determining whether the element score is valid or not;
scoring and storing the elements; and
updating the cache;
wherein the step of scoring and saving the elements further comprises:
checking whether the element is in the list;
if the element is not in the list, adding the element not in the list to the list, re-scoring the element not in the list and saving the score;
if the element is in the list, checking whether the score of the element is valid;
if the scores of the elements are invalid, re-scoring the elements with invalid scores and storing the scores;
if the score of the element is valid, keeping the original score of the element unchanged.
2. The caching method for a search engine of claim 1, wherein the step of generating a key value for a user query further comprises:
filtering out parameters related to scoring in the query by loading a preset configuration;
composing all parameters related to scoring into a new query; and
the new query is signed to generate its key.
3. The caching method for a search engine according to claim 1, wherein: there is an upper limit to the number of items that the list can hold.
4. The caching method for a search engine according to claim 3, wherein the updating the cache step further comprises:
updating state information of the element;
reserving items with high element scores in the list according to the upper limit; and
and determining whether the list is kept in the cache according to the least recently used rule.
5. The caching method for a search engine according to claim 1, wherein: and before the next query comes, keeping the scores of the elements obtained by the query unchanged.
6. A caching apparatus for a search engine, comprising:
receiving means for receiving a user query and elements obtained from the user query, wherein one of the elements is a piece of search data related to the user query, which is found by a search engine according to the user query;
generating means for generating a key value of a user query based on the user query;
the searching device is used for finding a list corresponding to the key value in the cache, wherein the list comprises a plurality of items, each item comprises an element identification number, an element score and state information, the state information comprises the updating time of the element entering the cache and the time of the item being put into the cache, and the state information is used for determining whether the element score is valid or not;
the scoring device is used for scoring and storing the elements; and
updating means for updating the cache;
wherein the scoring device further comprises:
element position checking means for checking whether the element is in the list;
first re-scoring means for adding said element not in said list to said list if said element is not in said list, re-scoring said element not in said list and saving a score;
score validity checking means for checking whether the score of the element is valid if the element is in the list; and
second re-scoring means for re-scoring the element whose score is not valid and saving the score if the score of the element is not valid; if the score of the element is valid, keeping the original score of the element unchanged.
7. The caching apparatus for a search engine according to claim 6, wherein the generating means further comprises:
the filtering device is used for filtering out parameters related to scoring in the query by loading a preset configuration;
the splicing device is used for splicing all parameters related to scoring into a new query; and
and the signature device is used for signing the new query to generate a key value of the new query.
8. The caching apparatus for a search engine according to claim 6, wherein: there is an upper limit to the number of items that the list can hold.
9. The caching apparatus for a search engine according to claim 8, wherein the updating means further comprises:
state updating means for updating state information of the element;
element retaining means for retaining items with high scores of elements in the list according to the upper limit; and
list retaining means for deciding whether said list is retained in the cache according to least recently used rules.
10. The caching apparatus for a search engine according to claim 6, wherein: and before the next query comes, keeping the scores of the elements obtained by the query unchanged.
CN201310182204.7A 2013-05-16 2013-05-16 Caching method and equipment for search engine Active CN104166649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310182204.7A CN104166649B (en) 2013-05-16 2013-05-16 Caching method and equipment for search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310182204.7A CN104166649B (en) 2013-05-16 2013-05-16 Caching method and equipment for search engine

Publications (2)

Publication Number Publication Date
CN104166649A CN104166649A (en) 2014-11-26
CN104166649B true CN104166649B (en) 2020-03-20

Family

ID=51910468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310182204.7A Active CN104166649B (en) 2013-05-16 2013-05-16 Caching method and equipment for search engine

Country Status (1)

Country Link
CN (1) CN104166649B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145495B (en) * 2016-03-01 2020-12-29 创新先进技术有限公司 Method and device for dynamically adjusting parameter rules
US10901897B2 (en) * 2018-01-16 2021-01-26 Marvell Israel (M.I.S.L.) Ltd. Method and apparatus for search engine cache
CN112507199B (en) * 2020-12-22 2022-02-25 北京百度网讯科技有限公司 Method and apparatus for optimizing a search system
CN115905323B (en) * 2023-01-09 2023-08-18 北京创新乐知网络技术有限公司 Searching method, device, equipment and medium suitable for various searching strategies
CN116910100B (en) * 2023-09-08 2023-11-28 湖南立人科技有限公司 Cache data processing method for low-code platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479207A (en) * 2010-11-29 2012-05-30 阿里巴巴集团控股有限公司 Information search method, system and device
CN102930054A (en) * 2012-11-19 2013-02-13 北京奇虎科技有限公司 Data search method and data search system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8612423B2 (en) * 2010-10-29 2013-12-17 Microsoft Corporation Search cache for document search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479207A (en) * 2010-11-29 2012-05-30 阿里巴巴集团控股有限公司 Information search method, system and device
CN102930054A (en) * 2012-11-19 2013-02-13 北京奇虎科技有限公司 Data search method and data search system

Also Published As

Publication number Publication date
CN104166649A (en) 2014-11-26

Similar Documents

Publication Publication Date Title
EP3158480B1 (en) Data query method and apparatus
CN107273522B (en) Multi-application-oriented data storage system and data calling method
CN105718455B (en) A kind of data query method and device
CN104166649B (en) Caching method and equipment for search engine
JP5661104B2 (en) Method and system for search using search engine indexing and index
WO2017050014A1 (en) Data storage processing method and device
CN106874348B (en) File storage and index method and device and file reading method
CN110909025A (en) Database query method, query device and terminal
CN106649401A (en) Data writing method and device of distributed file system
US20080010238A1 (en) Index having short-term portion and long-term portion
CN104809179A (en) Device and method for accessing Hash table
US9558123B2 (en) Retrieval hash index
WO2021174763A1 (en) Database management method and apparatus based on lookup table
CN112148217B (en) Method, device and medium for caching deduplication metadata of full flash memory system
US9032152B2 (en) Cache miss detection filter
CN114297145A (en) Method, medium and system for searching file based on keywords locally by IPFS node
CN111831691A (en) Data reading and writing method and device, electronic equipment and storage medium
CN111858612B (en) Data accelerated access method and device based on graph database and storage medium
CN109992708B (en) Method, device, equipment and storage medium for metadata query
CN112416880A (en) Method and device for optimizing storage performance of mass small files based on real-time merging
CN110716900A (en) Data query method and system
CN110955658A (en) Data organization and storage method based on Java intelligent contract
CN109213972B (en) Method, device, equipment and computer storage medium for determining document similarity
CN110110034A (en) A kind of RDF data management method, device and storage medium based on figure
CN101963953B (en) Cache optimization method for mobile rich media player

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211116

Address after: Room 3921, floor 3, No. 2879, Longteng Avenue, Xuhui District, Shanghai

Patentee after: Alibaba (Shanghai) Co., Ltd

Address before: P.O. Box 847, 4th floor, Grand Cayman capital building, British Cayman Islands

Patentee before: Alibaba Group Holdings Limited