CN103365910B - Method and system for information retrieval - Google Patents

Method and system for information retrieval Download PDF

Info

Publication number
CN103365910B
CN103365910B CN201210099720.9A CN201210099720A CN103365910B CN 103365910 B CN103365910 B CN 103365910B CN 201210099720 A CN201210099720 A CN 201210099720A CN 103365910 B CN103365910 B CN 103365910B
Authority
CN
China
Prior art keywords
query
mapping
frequency
word list
extended
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210099720.9A
Other languages
Chinese (zh)
Other versions
CN103365910A (en
Inventor
姚伶伶
赫南
王迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210099720.9A priority Critical patent/CN103365910B/en
Publication of CN103365910A publication Critical patent/CN103365910A/en
Application granted granted Critical
Publication of CN103365910B publication Critical patent/CN103365910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and system for information retrieval. The method for information retrieval comprises the steps of (1) carrying out the quadratic mapping process on a basic QA word list on the basis of extension of a query to generate a second-level mapped QA word list, wherein the basic QA word list comprises mapping from a high-frequency query to a keyword, first-level mapping in the second-level mapped QA word list is mapping from an extended query to the high-frequency query, and second-level mapping in the second-level mapped QA word list is mapping from the high-frequency query to the keyword; (2) carrying out searching on the second-level mapped QA word list according to an obtained query in an information retrieval request to obtain a keyword hit by the query, extracting posted internet information corresponding to the keyword, and using the posted internet information as a retrieval result. According to the method and system for information retrieval, the coverage rate of information retrieval results to the posted internet information can be improved.

Description

Information retrieval method and system
Technical Field
The invention relates to the technical field of internet, in particular to a method and a system for information retrieval.
Background
In the existing information retrieval and distribution system, retrieval is performed according to a conventional retrieval method of web search, that is, according to an and operation of a plurality of core morphemes in a retrieval string (query), for example: if a certain search string contains A, B, C core morphemes, then the search is performed according to the conventional search method of web page search, i.e. according to the and operation of A, B, C, i.e. the internet published information that can match A, B, C core morphemes at the same time is retrieved as the search result.
The above retrieval method may result in a large number of matching failures, and therefore, the current practice is to expand the matching end and screen out a high-frequency query (i.e. a query with an occurrence frequency higher than a certain threshold) from a user retrieval log (query log) according to a certain time window under an offline condition; obtaining the webpage search results of the queries and analyzing the characteristics of the webpage search results through a semantic analysis service module; and simultaneously generating an initial keyword candidate list for each screened query by integrating the query expansion result and the keyword expansion, wherein the list comprises keywords for matching the query. Then, the query keyword mapping subsystem calculates various characteristics for measuring the relevance of each pair of query and keyword, including various text similarity, semantic similarity and the like. And finally, predicting the relevance of each pair of Query, keyword and various characteristics of the Query, screening and sequencing the candidate keywords according to the relevance scores to obtain a final keyword mapping table of the Query, namely a QA (Query Analysis) word table. The QA vocabulary refers to a hash (hash) vocabulary from query to keyword, the left key of the vocabulary is a high-frequency query counted by a query log in a certain time window, the right key is a keyword or a keyword series which is similar to the query text in the database of the internet release information mapped by the high-frequency query, namely the QA vocabulary maintains the mapping relation between the high-frequency query and the keyword. When the query analysis is carried out at the retrieval end and the Internet release information is matched, the keyword corresponding to the query is searched from the QA word list, and then the corresponding Internet release information is found in the keyword-Internet release information index to serve as a retrieval result.
However, in the existing retrieval method and system, the query can match corresponding keywords only if the query accurately hits the QA vocabulary, and the correlation between the queries is not fully utilized, so that the coverage rate of the retrieval result on internet published information is low.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide an information retrieval method and system, so as to fully utilize the correlation between queries and improve the coverage rate of the information retrieval result on the internet published information.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the invention provides an information retrieval method, which comprises the following steps:
performing a secondary mapping process on the basic search string analysis QA word list based on the expansion of the search string query to generate a QA word list of secondary mapping; the basic QA word list comprises mapping from high-frequency query to key words, the first-level mapping in the QA word list of the second-level mapping is mapping from extended query to high-frequency query, and the second-level mapping is mapping from high-frequency query to key words;
and searching the QA word list of the secondary mapping according to the retrieval string in the acquired information retrieval request to obtain the keyword hit by the retrieval string, and extracting the Internet release information corresponding to the keyword as a retrieval result.
Preferably, the extension of the query is specifically as follows:
and according to the retrieval log, obtaining a plurality of query related series by adopting query expansion based on session, and/or query expansion based on internet published information mutual clicking, and/or query expansion based on related searching.
Preferably, the query-based extension performs a secondary mapping process on the basic QA vocabulary to generate a secondary mapped QA vocabulary, specifically:
for each query correlation series obtained by query expansion, when judging that the query correlation series has a high-frequency query which is the same as that in the basic QA word list, adding other queries except the high-frequency query in the query correlation series as an expanded query of the high-frequency query, and generating an initial first-level mapping from the expanded query to the high-frequency query;
calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query, filtering the extended queries with the similarity smaller than a preset threshold, and reserving the extended queries with the similarity larger than or equal to the preset threshold to obtain the final first-level mapping;
and generating a QA word list of the secondary mapping according to the final primary mapping and the basic QA word list.
Preferably, the method further comprises: calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to a correlation logistic regression model, specifically:
receiving a standard set of manual labeling, wherein the standard set comprises mapping from an expanded query to a high-frequency query of the manual labeling;
calculating a characteristic value of mapping from each extended query to a high-frequency query in the standard set, and randomly dividing the standard set into a training set and a check set;
performing correlation logistic regression model training by using the training set to obtain an initial logistic regression model for evaluating the correlation between the extended query and the high-frequency query, evaluating the initial logistic regression model by using the check set, and optimizing feature selection according to an evaluation result to obtain a final correlation logistic regression model;
according to the final correlation logistic regression model, calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to the following formula:
wherein q is1Denotes extended query, q2Representing high frequency query, n representing total number of features, fi(q1,q2) Representing the ith eigenvalue, w, of an extended query to high frequency query mappingiRepresenting the weight of the ith feature.
Preferably, the feature values include a text similarity feature value and a category similarity feature value between the extended query and the corresponding high-frequency query, and the text similarity feature value includes at least one of:
the probability coefficient between the extended query and the corresponding high-frequency query, the word similarity, the term rate of the common phrases, the editing distance and the longest common substring.
Preferably, the searching of the QA vocabulary of the secondary mapping is performed according to the search string in the acquired information search request to obtain the keyword hit by the search string, specifically:
and searching a first-level mapping in the QA word list of the second-level mapping according to the retrieval string in the information retrieval request, acquiring a high-frequency query corresponding to the extended query matched with the retrieval string, and extracting the keyword corresponding to the high-frequency query in the second-level mapping as a hit keyword.
Preferably, the method further comprises:
according to the retrieval string in the acquired information retrieval request, firstly searching a basic QA word list, if the high-frequency query in the basic QA word list is matched, extracting the key words corresponding to the high-frequency query in the basic QA word list as hit key words, and not searching the QA word list of the secondary mapping;
and if the QA word list is not matched with the high-frequency query in the basic QA word list, searching the QA word list of the secondary mapping.
The invention also provides an information retrieval system, which comprises:
the second-level mapping word list generating module is used for performing a second mapping process on the QA word list of the basic search string analysis based on the expansion of the search string query to generate a QA word list of second-level mapping; the basic QA word list comprises mapping from high-frequency query to key words, the first-level mapping in the QA word list of the second-level mapping is mapping from extended query to high-frequency query, and the second-level mapping is mapping from high-frequency query to key words;
and the information retrieval module is used for searching the QA word list of the secondary mapping according to the retrieval string in the acquired information retrieval request to obtain the keyword hit by the retrieval string and extracting the Internet release information corresponding to the keyword as a retrieval result.
Preferably, the secondary mapping vocabulary generating module is further configured to obtain a plurality of query related series by using a query expansion based on session, and/or a query expansion based on internet published information click-through, and/or a query expansion based on related search according to the retrieval log.
Preferably, the second level mapping vocabulary generation module is further configured to,
for each query correlation series obtained by query expansion, when judging that the query correlation series has a high-frequency query which is the same as that in the basic QA word list, adding other queries except the high-frequency query in the query correlation series as an expanded query of the high-frequency query, and generating an initial first-level mapping from the expanded query to the high-frequency query;
calculating the similarity between each extended query in the initial first-level mapping and the high-frequency query, filtering the extended queries with the similarity smaller than a preset threshold, and reserving the extended queries with the similarity larger than or equal to the preset threshold to obtain the final first-level mapping;
and generating a QA word list of the secondary mapping according to the final primary mapping and the basic QA word list.
Preferably, the secondary mapping vocabulary generating module is further configured to calculate, according to a correlation logistic regression model, a similarity between each extended query and the high-frequency query in the initial first-level mapping, specifically:
receiving a standard set of manual labeling, wherein the standard set comprises mapping from an expanded query to a high-frequency query of the manual labeling;
calculating a characteristic value of mapping from each extended query to a high-frequency query in the standard set, and randomly dividing the standard set into a training set and a check set;
performing correlation logistic regression model training by using the training set to obtain an initial logistic regression model for evaluating the correlation between the extended query and the high-frequency query, evaluating the initial logistic regression model by using the check set, and optimizing feature selection according to an evaluation result to obtain a final correlation logistic regression model;
according to the final correlation logistic regression model, calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to the following formula:
wherein q is1Denotes extended query, q2Representing high frequency query, n representing total number of features, fi(q1,q2) I-th eigenvalue, w, representing extended query and high frequency query pairsiRepresenting the weight of the ith feature.
Preferably, the feature values include a text similarity feature value and a category similarity feature value between the extended query and the corresponding high-frequency query, and the text similarity feature value includes at least one of:
the probability coefficient between the extended query and the corresponding high-frequency query, the word similarity, the term rate of the common phrases, the editing distance and the longest common substring.
Preferably, the information retrieval module is further configured to search a first-level mapping in the QA vocabulary of the second-level mapping according to the retrieval string in the information retrieval request, obtain a high-frequency query corresponding to the extended query matched with the retrieval string, and extract a keyword corresponding to the high-frequency query in the second-level mapping as a hit keyword.
Preferably, the information retrieval module is further configured to,
according to the retrieval string in the acquired information retrieval request, firstly searching a basic QA word list, if the high-frequency query in the basic QA word list is matched, extracting the key words corresponding to the high-frequency query in the basic QA word list as hit key words, and not searching the QA word list of the secondary mapping;
and if the QA word list is not matched with the high-frequency query in the basic QA word list, searching the QA word list of the secondary mapping.
The method and the system for information retrieval enrich the left key entries of the QA word list, can more fully utilize the basic QA word list, improve the coverage rate of information retrieval on internet release information, improve the accuracy rate of information retrieval and improve the retrieval performance.
Drawings
FIG. 1 is a flow chart of a method for information retrieval according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the structure of the QA vocabulary of the secondary mapping in the embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a process of calculating similarity between each extended query and a corresponding high-frequency query in the initial first-level mapping according to a correlation logistic regression model in the embodiment of the present invention;
FIG. 4 is a diagram illustrating a specific implementation of step 101 shown in FIG. 1;
FIG. 5 is a diagram illustrating a specific implementation of step 102 shown in FIG. 1;
fig. 6 is a schematic structural diagram of an information retrieval system according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further elaborated below with reference to the drawings and the specific embodiments.
An information retrieval method provided by the embodiment of the invention is shown in fig. 1, and mainly comprises the following steps:
step 101, performing a secondary mapping process on a basic QA word list based on the query expansion to generate a QA word list of secondary mapping; the basic QA word list comprises mapping from high-frequency query to key words, the first-level mapping in the QA word list of the second-level mapping is mapping from extended query to high-frequency query, and the second-level mapping is mapping from high-frequency query to key words.
The basic QA vocabulary refers to a hash vocabulary from query to keyword, the left key of the vocabulary is a high-frequency query counted by a query log in a certain time window, the right key is a keyword or a keyword series with similar semantics with the query text in a database of internet release information mapped by the high-frequency query, namely the basic QA vocabulary maintains the mapping relation between the high-frequency query and the keyword (or the keyword series). The basic QA vocabulary may be obtained by offline processing through a special QBM (search string keyword merge) module.
The internet distribution information is information distributed by an internet information distributor through an information search distribution system, and includes: geographic information, biographical information, merchant information, and the like. These internet published information is stored by means of a special database.
After the query is expanded, a plurality of query related series are obtained, and the query expansion can adopt at least one of the following modes:
firstly, according to a query log in a certain time window, adopting query expansion based on session;
secondly, according to the query log in a certain time window, query expansion based on internet release information mutual clicking is adopted;
and thirdly, according to the query log in a certain time window, query expansion based on related search is adopted.
The query extension based on session mainly comprises the following operations: firstly, normalizing and filtering noise vocabularies in a retrieval log; then, merging the query searched by the same user in a period of continuous time into a query series, and counting the frequency of each query appearing in a log of one day and the frequency of each two queries appearing in the same query series in the log of one day; combining the obtained query series of each day and the statistical frequency information in a large time interval (for example, 1 month), calculating likelihood ratio characteristic values LLR between the queries by using a likelihood ratio formula, and filtering the query expansion result by using the characteristic values (for example, filtering the query correlation series of which the LLR is smaller than a preset threshold); and finally, overlapping the query expansion result rows of multiple days, and sequencing the query expansion results according to the likelihood ratio characteristic values to obtain a query correlation series. The likelihood ratio formula is as follows:
LLR=logb(c12;c1,p)+logb(c2-c12;N-c1,p)
-logb(c12;c1,p1)-logb(c2-c12;N-c1,p2)
wherein,
c1total frequency of occurrence of query1 in a large time interval, c1Total frequency of occurrence of query2 in a large time interval, c12The total frequency of the query1 and query2 occurring in a query-related series simultaneously, and N is the total frequency of all queries in a large time interval.
The query expansion based on internet published information mutual clicking mainly comprises the following operations: because different queries which trigger the display of the same internet published information in the information retrieval and publishing system may be connected, if the internet published information is clicked together, the different queries may have the same intention; therefore, based on the click log of the internet published information, different queries triggering the display of the same internet published information can be aggregated together to form a query related series. For example: and if the same internet release information exists in the internet release information displayed in the search of the queryA and the search of the queryB and the same internet release information is clicked by the user, the queryA and the queryB are considered to be related, so that the queryA and the queryB are aggregated into a query related series.
The query expansion based on the related search mainly comprises the following operations: when the search engine responds to the query request of the user, the search engine can 'guess' the possible retrieval intention of the user and automatically perform some expansion aiming at the retrieval query; for example: the user searches for "Liu De Hua", and the search engine returns a natural result and simultaneously presents the related retrieval query to the user, such as "Liu De movie", "Liu De Hua concert", "Liu De Hua microblog", and the like; the user searches for the rose, and the search engine returns a natural result and simultaneously presents the related search query to the user, such as fresh flowers, white roses, blue roses, yellow roses and the like. By utilizing the intelligent prompt of the search engine, the high-frequency query can be expanded to obtain the corresponding query related series.
Performing a secondary mapping process on the basic QA word list based on the query expansion to generate a QA word list of secondary mapping, which specifically comprises the following steps:
for each query correlation series obtained by query expansion, when judging that the query correlation series has a high-frequency query which is the same as that in a basic QA word list, adding other queries except the high-frequency query in the query correlation series as an expanded query of the high-frequency query, and generating an initial first-level mapping from the expanded query to the high-frequency query;
calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query, filtering the extended queries with the similarity smaller than a preset threshold, and reserving the extended queries with the similarity larger than or equal to the preset threshold to obtain the final first-level mapping;
and generating a QA word list of the secondary mapping according to the final first-level mapping and the basic QA word list.
Referring to fig. 2, in the QA vocabulary shown in fig. 2, the first-level mapping is a mapping from an extended query to a high-frequency query, the left key of the first-level mapping is the extended query, and the right key of the first-level mapping is the high-frequency query; the second level mapping is the mapping from the high frequency query to the keyword (or the keyword series), the left key of the second level mapping is the high frequency query, and the right key is the keyword (or the keyword series). The basic QA vocabulary is used as the second-level mapping, the QA vocabulary of the second-level mapping needs to ensure that the left key of the first-level mapping does not appear in the left key of the second-level mapping, and the right key of the first-level mapping appears in the left key of the second-level mapping.
It should be noted that, in the embodiment of the present invention, the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query may be calculated according to a correlation logistic regression model, and of course, the method for calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query in the embodiment of the present invention is not limited thereto, and any method capable of calculating the above similarity in practical application should fall within the scope of the embodiment of the present invention.
The specific operation process of calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to the correlation logistic regression model, as shown in fig. 3, specifically includes:
receiving a standard set of manual labeling, wherein the standard set comprises mapping from an expanded query to a high-frequency query of the manual labeling;
calculating a characteristic value of mapping from each extended query to a high-frequency query in the standard set, and randomly dividing the standard set into a training set and a check set;
performing correlation logistic regression model training by using the training set to obtain an initial logistic regression model for evaluating the correlation between the extended query and the high-frequency query, then evaluating the initial logistic regression model by using the check set, and optimizing feature selection (such as adding features, deleting features, performing feature combination and the like) according to an evaluation result to obtain a final correlation logistic regression model;
according to the final correlation logistic regression model, calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query by the following formula (substituting the characteristic value of each extended query to high-frequency query mapping in the initial first-level mapping into the following formula):
wherein q is1Denotes extended query, q2Representing high frequency query, n representing total number of features, fi(q1,q2) Representing the ith eigenvalue, w, of an extended query to high frequency query mappingiRepresenting the weight of the ith feature.
The characteristic values comprise text similarity characteristic values and category similarity characteristic values between the expanded query and the corresponding high-frequency query, and the text similarity characteristic values comprise at least one of the following values: the probability coefficient between the extended query and the corresponding high-frequency query, the literal similarity, the common phrase (term) rate, the editing distance and the longest common substring.
Wherein the tan imoto coefficientA. B represents any two queries;
similarity of characters and facesA. B represents any two queries;
common term rate is the number of phrases shared by a and B after word segmentation × 2/the sum of the numbers of phrases a and B after word segmentation, A, B represents any two queries;
the editing distance, also called the Levenshtein distance, refers to the minimum number of editing operations required for converting one string into another string;
longest common substring: a sequence S, if it is a subsequence of two known character sequences (e.g., A, B), and is the longest of all sequences that meet this condition, is referred to as the longest common subsequence of the two known character sequences, and can be used to describe the similarity between the two character sequences.
Referring to the schematic diagram shown in fig. 4, the detailed operation of step 101 may be implemented by adopting query expansion based on session, query expansion based on internet published information mutual click, and query expansion based on related search according to querylog in a certain time window, and combining the results of the expanded query to obtain a plurality of query related series; and then, for each query correlation series, performing a secondary mapping process based on the basic QA word list to generate a QA word list of secondary mapping. A correlation logistic regression model is needed in the process of performing the quadratic mapping, and the specific implementation process is described in the foregoing description.
And 102, searching the QA word list of the secondary mapping according to the retrieval string in the acquired information retrieval request to obtain the keyword hit by the retrieval string, and extracting the Internet release information corresponding to the keyword as a retrieval result.
The specific operation of QA word table lookup of the secondary mapping is as follows: and searching a first-level mapping in a QA word list of the second-level mapping according to a retrieval string in the information retrieval request, acquiring a high-frequency query corresponding to the extended query matched with the retrieval string, and extracting a keyword corresponding to the high-frequency query in the second-level mapping as a hit keyword.
As a preferred embodiment of the present invention, a basic QA vocabulary can be searched first according to a search string in an acquired information search request, and if a high-frequency query in the basic QA vocabulary is matched, a keyword corresponding to the high-frequency query in the basic QA vocabulary is extracted as a hit keyword, and the search of the QA vocabulary of the secondary mapping is not performed; and if the high-frequency query in the basic QA word list is not matched, searching the QA word list of the secondary mapping. If the search string in the information search request does not hit the corresponding keyword in the basic QA word list and the QA word list of the secondary mapping, other feasible methods for hitting the keyword can be selected to continue. The specific operation process is shown in fig. 5.
Corresponding to the above information retrieval method, an embodiment of the present invention further provides an information retrieval system, as shown in fig. 6, which mainly includes: a secondary mapping word list generating module 10 and an information retrieval module 20; wherein,
a secondary mapping word list generating module 10, configured to perform a secondary mapping process on the basic QA word list based on the query expansion, and generate a secondary mapping QA word list; the basic QA word list comprises mapping from high-frequency query to key words, the first-level mapping in the QA word list of the second-level mapping is mapping from extended query to high-frequency query, and the second-level mapping is mapping from high-frequency query to key words;
and the information retrieval module 20 is configured to perform, according to the retrieval string in the acquired information retrieval request, search for the QA vocabulary mapped in the second level to obtain a keyword hit by the retrieval string, and extract internet published information corresponding to the keyword as a retrieval result.
Preferably, the second-level mapping vocabulary generating module 10 is further configured to obtain a plurality of query-related series by using query expansion based on session, and/or query expansion based on internet published information click-through, and/or query expansion based on related search according to the search log.
Preferably, the secondary mapping vocabulary generating module 10 may be further configured to, for each query correlation series obtained by query expansion, add, when it is determined that a high-frequency query identical to that in the basic QA vocabulary exists in the query correlation series, another query in the query correlation series except the high-frequency query as an expanded query of the high-frequency query, and generate an initial first-level mapping from the expanded query to the high-frequency query; calculating the similarity between each extended query in the initial first-level mapping and the high-frequency query, filtering the extended queries with the similarity smaller than a preset threshold, and reserving the extended queries with the similarity larger than or equal to the preset threshold to obtain the final first-level mapping; and generating a QA word list of the secondary mapping according to the final first-level mapping and the basic QA word list.
Preferably, the secondary mapping vocabulary generating module 10 may be further configured to calculate, according to the correlation logistic regression model, a similarity between each extended query in the initial first-level mapping and the high-frequency query, specifically:
receiving a standard set of manual labeling, wherein the standard set comprises mapping from an expanded query to a high-frequency query of the manual labeling;
calculating a characteristic value of mapping from each extended query to a high-frequency query in the standard set, and randomly dividing the standard set into a training set and a check set;
performing correlation logistic regression model training by using the training set to obtain an initial logistic regression model for evaluating the correlation between the extended query and the high-frequency query, evaluating the initial logistic regression model by using the check set, and optimizing feature selection according to an evaluation result to obtain a final correlation logistic regression model;
according to the final correlation logistic regression model, calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to the following formula:
wherein q is1Denotes extended query, q2Representing high frequency query, n representing total number of features, fi(q1,q2) I-th eigenvalue, w, representing extended query and high frequency query pairsiIs shown asWeights of the i features.
The characteristic values comprise text similarity characteristic values and category similarity characteristic values between the expanded query and the corresponding high-frequency query, and the text similarity characteristic values comprise at least one of the following values: the method comprises the steps of extending a tanimoto coefficient between a query and a corresponding high-frequency query, literal similarity, common term rate, editing distance and the longest common substring.
Preferably, the information retrieval module 20 is further configured to search a first-level mapping in the QA vocabulary of the second-level mapping according to the retrieval string in the information retrieval request, obtain a high-frequency query corresponding to the extended query matching the retrieval string, and extract a keyword corresponding to the high-frequency query in the second-level mapping as a hit keyword.
Preferably, the information retrieval module 20 may be further configured to, according to a retrieval string in the acquired information retrieval request, first search the basic QA vocabulary, and if a high-frequency query in the basic QA vocabulary is matched, extract a keyword corresponding to the high-frequency query in the basic QA vocabulary as a hit keyword, and not perform a search of the QA vocabulary of the secondary mapping;
and if the QA word list is not matched with the high-frequency query in the basic QA word list, searching the QA word list of the secondary mapping.
In addition, as a preferred embodiment of the present invention, the system for information retrieval may further add a real-time retrieval string rewriting module (not shown in fig. 6), which is connected to the information retrieval module 20, and is used to re-enter the information retrieval module 20 to execute a new round of retrieval process after performing appropriate modification (for example, deleting several core elements in the retrieval string) on the retrieval string that fails to hit the keyword by the above method; and so on until the keyword is hit.
In summary, the embodiment of the present invention establishes a related query network in a query set by using methods such as query expansion based on session, query expansion based on internet published information click-through, query expansion based on related search, and the like; and then, checking the correlation between the queries to extract high-quality associated queries. In the specific implementation, a hash map data structure is adopted to represent the association relationship, a right key (map value) of the map is a high-frequency query screened from a user retrieval log in a certain time window, and a left key (map key) of the map is an extended query related to the high-frequency query. And the mapping relation between the high-frequency query and the keyword can be obtained through offline processing of the QBM module. Thus, a QA word list of secondary mapping is formed, which is equivalent to the expansion of the left key of the original basic QA word list; the first level mapping in the QA word list of the second level mapping is the mapping from the extended query to the high-frequency query, and the second level mapping is the mapping from the high-frequency query to the key words. The embodiment of the invention enriches the left key entries of the QA word list, can more fully utilize the basic QA word list and improve the coverage rate of information issued by the Internet.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (12)

1. A method for information retrieval, the method comprising:
performing a secondary mapping process on the basic search string analysis QA word list based on the expansion of the search string query to generate a QA word list of secondary mapping; the basic QA word list comprises mapping from high-frequency query to key words, the first-level mapping in the QA word list of the second-level mapping is mapping from extended query to high-frequency query, and the second-level mapping is mapping from high-frequency query to key words;
according to the retrieval string in the acquired information retrieval request, searching the QA word list of the secondary mapping to obtain the keyword hit by the retrieval string, and extracting the Internet release information corresponding to the keyword as a retrieval result;
the query-based expansion carries out a secondary mapping process on the basic QA word list to generate a QA word list of secondary mapping, and the method specifically comprises the following steps:
for each query correlation series obtained by query expansion, when judging that the query correlation series has a high-frequency query which is the same as that in the basic QA word list, adding other queries except the high-frequency query in the query correlation series as an expanded query of the high-frequency query, and generating an initial first-level mapping from the expanded query to the high-frequency query;
calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query, filtering the extended queries with the similarity smaller than a preset threshold, and reserving the extended queries with the similarity larger than or equal to the preset threshold to obtain the final first-level mapping;
and generating a QA word list of the secondary mapping according to the final primary mapping and the basic QA word list.
2. The information retrieval method of claim 1, wherein the query is extended specifically by:
and according to the retrieval log, obtaining a plurality of query related series by adopting query expansion based on session, and/or query expansion based on internet published information mutual clicking, and/or query expansion based on related searching.
3. The method of claim 1, further comprising: calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to a correlation logistic regression model, specifically:
receiving a standard set of manual labeling, wherein the standard set comprises mapping from an expanded query to a high-frequency query of the manual labeling;
calculating a characteristic value of mapping from each extended query to a high-frequency query in the standard set, and randomly dividing the standard set into a training set and a check set;
performing correlation logistic regression model training by using the training set to obtain an initial logistic regression model for evaluating the correlation between the extended query and the high-frequency query, evaluating the initial logistic regression model by using the check set, and optimizing feature selection according to an evaluation result to obtain a final correlation logistic regression model;
according to the final correlation logistic regression model, calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to the following formula:
S c o r e ( q 1 , q 2 ) = 1 1 + e - Σ i = 1 n w i f i ( q 1 , q 2 )
wherein q is1Denotes extended query, q2Representing high frequency query, n representing characteristic sumNumber fi(q1,q2) Representing the ith eigenvalue, w, of an extended query to high frequency query mappingiRepresenting the weight of the ith feature.
4. The information retrieval method of claim 3, wherein the feature values comprise text similarity feature values and category similarity feature values between the extended query and the corresponding high-frequency query, and the text similarity feature values comprise at least one of the following:
the probability coefficient between the extended query and the corresponding high-frequency query, the word similarity, the term rate of the common phrases, the editing distance and the longest common substring.
5. The method according to any one of claims 1 to 4, wherein the search of the QA vocabulary of the secondary mapping is performed according to the search string in the acquired information search request, and the keyword hit by the search string is obtained, specifically:
and searching a first-level mapping in the QA word list of the second-level mapping according to the retrieval string in the information retrieval request, acquiring a high-frequency query corresponding to the extended query matched with the retrieval string, and extracting the keyword corresponding to the high-frequency query in the second-level mapping as a hit keyword.
6. The method of claim 5, further comprising:
according to the retrieval string in the acquired information retrieval request, firstly searching a basic QA word list, if the high-frequency query in the basic QA word list is matched, extracting the key words corresponding to the high-frequency query in the basic QA word list as hit key words, and not searching the QA word list of the secondary mapping;
and if the QA word list is not matched with the high-frequency query in the basic QA word list, searching the QA word list of the secondary mapping.
7. A system for information retrieval, the system comprising:
the second-level mapping word list generating module is used for performing a second mapping process on the QA word list of the basic search string analysis based on the expansion of the search string query to generate a QA word list of second-level mapping; the basic QA word list comprises mapping from high-frequency query to key words, the first-level mapping in the QA word list of the second-level mapping is mapping from extended query to high-frequency query, and the second-level mapping is mapping from high-frequency query to key words;
the information retrieval module is used for searching the QA word list of the secondary mapping according to the retrieval string in the acquired information retrieval request to obtain the keyword hit by the retrieval string and extracting the Internet release information corresponding to the keyword as a retrieval result;
the secondary mapping vocabulary generation module is further operable to,
for each query correlation series obtained by query expansion, when judging that the query correlation series has a high-frequency query which is the same as that in the basic QA word list, adding other queries except the high-frequency query in the query correlation series as an expanded query of the high-frequency query, and generating an initial first-level mapping from the expanded query to the high-frequency query;
calculating the similarity between each extended query in the initial first-level mapping and the high-frequency query, filtering the extended queries with the similarity smaller than a preset threshold, and reserving the extended queries with the similarity larger than or equal to the preset threshold to obtain the final first-level mapping;
and generating a QA word list of the secondary mapping according to the final primary mapping and the basic QA word list.
8. The information retrieval system of claim 7, wherein the secondary mapping vocabulary generation module is further configured to obtain a plurality of query correlation series by using a query expansion based on session, and/or a query expansion based on internet published information click-through, and/or a query expansion based on correlation search according to the retrieval log.
9. The information retrieval system as recited in claim 7, wherein the secondary mapping vocabulary generation module is further configured to calculate, according to a correlation logistic regression model, a similarity between each extended query and a high-frequency query in the initial first-level mapping, specifically:
receiving a standard set of manual labeling, wherein the standard set comprises mapping from an expanded query to a high-frequency query of the manual labeling;
calculating a characteristic value of mapping from each extended query to a high-frequency query in the standard set, and randomly dividing the standard set into a training set and a check set;
performing correlation logistic regression model training by using the training set to obtain an initial logistic regression model for evaluating the correlation between the extended query and the high-frequency query, evaluating the initial logistic regression model by using the check set, and optimizing feature selection according to an evaluation result to obtain a final correlation logistic regression model;
according to the final correlation logistic regression model, calculating the similarity between each extended query in the initial first-level mapping and the corresponding high-frequency query according to the following formula:
S c o r e ( q 1 , q 2 ) = 1 1 + e - Σ i = 1 n w i f i ( q 1 , q 2 )
wherein q is1Denotes extended query, q2Representing high frequency query, n representing total number of features, fi(q1,q2) I-th eigenvalue, w, representing extended query and high frequency query pairsiRepresenting the weight of the ith feature.
10. The information retrieval system of claim 9, wherein the feature values comprise text similarity feature values and category similarity feature values between the extended query and the corresponding high frequency query, and the text similarity feature values comprise at least one of:
the probability coefficient between the extended query and the corresponding high-frequency query, the word similarity, the term rate of the common phrases, the editing distance and the longest common substring.
11. The system according to any one of claims 7 to 10, wherein the information retrieval module is further configured to search a first-level mapping in the QA vocabulary of the second-level mapping according to a retrieval string in the information retrieval request, obtain a high-frequency query corresponding to an extended query matching the retrieval string, and extract a keyword corresponding to the high-frequency query in the second-level mapping as a hit keyword.
12. The information retrieval system of claim 11, wherein the information retrieval module is further configured to,
according to the retrieval string in the acquired information retrieval request, firstly searching a basic QA word list, if the high-frequency query in the basic QA word list is matched, extracting the key words corresponding to the high-frequency query in the basic QA word list as hit key words, and not searching the QA word list of the secondary mapping;
and if the QA word list is not matched with the high-frequency query in the basic QA word list, searching the QA word list of the secondary mapping.
CN201210099720.9A 2012-04-06 2012-04-06 Method and system for information retrieval Active CN103365910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210099720.9A CN103365910B (en) 2012-04-06 2012-04-06 Method and system for information retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210099720.9A CN103365910B (en) 2012-04-06 2012-04-06 Method and system for information retrieval

Publications (2)

Publication Number Publication Date
CN103365910A CN103365910A (en) 2013-10-23
CN103365910B true CN103365910B (en) 2017-02-15

Family

ID=49367274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210099720.9A Active CN103365910B (en) 2012-04-06 2012-04-06 Method and system for information retrieval

Country Status (1)

Country Link
CN (1) CN103365910B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794139B (en) * 2014-01-22 2019-09-20 腾讯科技(北京)有限公司 Information retrieval method, apparatus and system
CN104142993B (en) * 2014-07-30 2017-08-29 东软集团股份有限公司 Complicated snort rule classifications method and system based on depth characteristic
CN105574028B (en) * 2014-10-15 2020-08-11 腾讯科技(深圳)有限公司 Information retrieval method and device
CN105354216B (en) * 2015-09-28 2018-09-07 哈尔滨工业大学 A kind of Chinese microblog topic information processing method
CN106844406B (en) * 2015-12-07 2021-03-02 腾讯科技(深圳)有限公司 Search method and search device
CN105631025B (en) * 2015-12-29 2021-09-28 腾讯科技(深圳)有限公司 Normalization processing method and device for query tag
CN107679186B (en) * 2017-09-30 2021-12-21 北京奇虎科技有限公司 Method and device for searching entity based on entity library
CN110110035A (en) * 2018-01-24 2019-08-09 北京京东尚科信息技术有限公司 Data processing method and device and computer readable storage medium
CN108874885A (en) * 2018-05-08 2018-11-23 苏州显知禾创科技服务有限公司 A kind of patent data management system
CN109725901B (en) * 2018-05-31 2024-03-29 中国平安人寿保险股份有限公司 Front-end code development method, device, equipment and computer storage medium
CN109033457A (en) * 2018-08-29 2018-12-18 广州中赢财富信息科技有限公司 The associated auditing method of Various database and system
CN109829115B (en) * 2019-02-14 2020-02-04 上海晓材科技有限公司 Search engine keyword optimization method
CN111859042A (en) * 2020-07-30 2020-10-30 上海妙一生物科技有限公司 Retrieval method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281523A (en) * 2007-04-25 2008-10-08 北大方正集团有限公司 Method and device for enquire enquiry extending as well as related searching word stock
CN101467125A (en) * 2006-04-19 2009-06-24 谷歌公司 Processing of query terms
CN102054007A (en) * 2009-11-10 2011-05-11 北大方正集团有限公司 Searching method and searching device
CN102346756A (en) * 2010-12-24 2012-02-08 镇江诺尼基智能技术有限公司 Device failure solution knowledge management and search system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7747600B2 (en) * 2007-06-13 2010-06-29 Microsoft Corporation Multi-level search

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101467125A (en) * 2006-04-19 2009-06-24 谷歌公司 Processing of query terms
CN101281523A (en) * 2007-04-25 2008-10-08 北大方正集团有限公司 Method and device for enquire enquiry extending as well as related searching word stock
CN102054007A (en) * 2009-11-10 2011-05-11 北大方正集团有限公司 Searching method and searching device
CN102346756A (en) * 2010-12-24 2012-02-08 镇江诺尼基智能技术有限公司 Device failure solution knowledge management and search system and method

Also Published As

Publication number Publication date
CN103365910A (en) 2013-10-23

Similar Documents

Publication Publication Date Title
CN103365910B (en) Method and system for information retrieval
US9317550B2 (en) Query expansion
CN107993724B (en) Medical intelligent question and answer data processing method and device
CN106991092B (en) Method and equipment for mining similar referee documents based on big data
US9710547B2 (en) Natural language semantic search system and method using weighted global semantic representations
CN103473283B (en) Method for matching textual cases
CN105045875B (en) Personalized search and device
JP5701911B2 (en) Guided search based on query model
US8862608B2 (en) Information retrieval using category as a consideration
US8805755B2 (en) Decomposable ranking for efficient precomputing
RU2005111000A (en) PROPOSAL OF RELATED TERMS FOR A MANY SENSE REQUEST
CN110390006A (en) Question and answer corpus generation method, device and computer readable storage medium
CN105488024A (en) Webpage topic sentence extraction method and apparatus
CN105528411B (en) Apparel interactive electronic technical manual full-text search device and method
CN102637192A (en) Method for answering with natural language
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN106649605B (en) Method and device for triggering promotion keywords
WO2021112984A1 (en) Feature and context based search result generation
CN109446399A (en) A kind of video display entity search method
JP2009193219A (en) Indexing apparatus, method thereof, program, and recording medium
CN105677664A (en) Compactness determination method and device based on web search
CN113868387A (en) Word2vec medical similar problem retrieval method based on improved tf-idf weighting
CN111125299B (en) Dynamic word stock updating method based on user behavior analysis
JP5518665B2 (en) Patent search device, patent search method, and program
Balfe et al. A comparative analysis of query similarity metrics for community-based web search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant