CN108491462B - Semantic query expansion method and device based on word2vec - Google Patents

Semantic query expansion method and device based on word2vec Download PDF

Info

Publication number
CN108491462B
CN108491462B CN201810179478.3A CN201810179478A CN108491462B CN 108491462 B CN108491462 B CN 108491462B CN 201810179478 A CN201810179478 A CN 201810179478A CN 108491462 B CN108491462 B CN 108491462B
Authority
CN
China
Prior art keywords
query
word
expansion
words
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810179478.3A
Other languages
Chinese (zh)
Other versions
CN108491462A (en
Inventor
章露露
贾连印
李孟娟
丁家满
李晓武
陈文焰
吕晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810179478.3A priority Critical patent/CN108491462B/en
Publication of CN108491462A publication Critical patent/CN108491462A/en
Application granted granted Critical
Publication of CN108491462B publication Critical patent/CN108491462B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a word2 vec-based query expansion method and device, and belongs to the technical field of information retrieval. The method comprises the following steps: preprocessing steps of a user given query: performing word segmentation processing on the query, removing stop words and restoring word stems; selecting an expansion word candidate set: selecting an initial expansion word by using a word2vec tool; establishing an extended word list: filtering the candidate set of the expansion words, and establishing an actual expansion word list; expanding and retrieving: matching the user query and the expansion words thereof with the index set, returning the relevant documents and sequencing. The invention provides an expansion word-oriented query vector generation method for filtering candidate expansion words and constructing an expansion word list, so that the correlation between the expansion words and the whole query is better reflected, and the query expansion effect is further improved.

Description

Semantic query expansion method and device based on word2vec
Technical Field
The invention relates to a semantic query expansion method and device based on word2vec, and belongs to the technical field of information retrieval.
Background
Query expansion technology is an important issue in the field of information retrieval. In the current information retrieval model and system, information is stored in the form of words, words or phrases, and when a user gives a query, it is possible to retrieve relevant documents only when the query words in the query set appear in the documents. However, in human natural language, there are many different expression ways for the same concept, for example, when searching for an automatic mobile, if the expansion is not performed, the query including car, sedan, Ford, etc. has high correlation with the original query of the user but cannot be retrieved due to different words, so that the user cannot obtain a satisfactory result. Because the problem of mismatching of the query terms exists, the user sometimes has to transform the query terms to find the required information, so in order to reduce the burden of the user, the information retrieval system is required to automatically select some other terms related to the query to assist the query, namely, the problem of mismatching of the terms is solved through a query expansion technology.
A user submits a query, and a search engine generally takes query expansion as an indispensable module in order to improve the retrieval satisfaction of the user, and currently, the commonly used query expansion methods mainly include the following steps:
1. the query expansion method based on the semantic knowledge dictionary comprises the following steps:
the method based on semantic knowledge dictionary mainly selects words with certain semantic relevance with query words by means of semantic knowledge dictionaries such as WordNet, HowNet or other synonym word forests and the like for expansion, the method is based on the upper and lower meaning words, synonyms and the like of the query words generally, the method is excessively dependent on a complete semantic system and is independent of a corpus to be retrieved, and therefore the selected expansion words usually cannot reflect the characteristics of the corpus and cannot obtain good query effect.
2. Query expansion based on global analysis:
the global analysis is that firstly, the words or phrases in all documents are subjected to correlation analysis, the association degree of each pair of words is calculated, and then the words with the highest association with the query words are added into the initial query to generate a new query. The method has the advantages that the relation between words can be searched to the maximum extent, and particularly, after a dictionary is established, query expansion can be carried out with higher efficiency; the disadvantage is that when the set of documents is large, it is often not feasible to build a complete word relationship dictionary, either temporally or spatially, and the cost of updating is even greater if the set of documents changes.
3. Query expansion based on local analysis:
the local analysis method mainly solves the expansion problem by utilizing a secondary retrieval method, obtains n documents most relevant to the original query by utilizing the direct retrieval of the primary given query and uses the n documents as the source of the expansion words, and finds the words most relevant to the original query in the n documents and adds the words into the initial query to establish a new query. The two kinds of feedback are different in that the result of initial retrieval by the relevant feedback needs to be judged by a user, the relevant documents considered by the user are taken as the source of the expansion words, the pseudo-relevant feedback does not need to interact with the user, and the returned first n documents are directly considered as relevant articles. Although the local analysis method is the most widely applied query expansion method at present, the method has the disadvantage that when the relevance of the initially retrieved documents in the front is not large, a large number of irrelevant words are easily added into the query, so that the problem of query drift is caused.
With the introduction of semantic models such as Word2Vec and Glove, Word embedding technology has attracted much attention in recent years in many fields of natural language processing. Word vectors obtained through training of training models provided by word2vec and Glove reflect semantic and grammatical relations in natural languages, and similarity between terms can be judged by calculating cosine values among the word vectors, so that the method can be well used for query expansion.
At present, the research work based on the Word2Vec query expansion is carried out, but most of the work has the following two main defects:
(1) when the expansion word list is constructed, only the words related to the query words are selected as the expansion words, and the relevance of the whole query is not considered.
(2) Even the work of considering the correlation with the entire query mostly considers that the query vector is fixed and invariant for all the replacement words, so the query vector is mostly a simple addition or average of each query word vector.
However, in general, for an expansion word of a query word q, the influence of other query words on the expansion word should not be comparable to the influence of q on the expansion word. The idea of generating different query vectors by taking different words in the query as the central words is widely applied to other information retrieval fields based on word embedding, such as semantic disambiguation and the like, and obtains better effect, but is not effectively applied to the field of query expansion.
Disclosure of Invention
The invention aims to solve the technical problem of providing a semantic query expansion method and device based on word2vec and aims to construct an expansion word list with higher query relevance so as to more comprehensively return documents relevant to user query.
The technical scheme of the invention is as follows: a semantic query expansion method based on word2vec comprises the following steps:
query and document preprocessing steps: segmenting the query submitted by the user, removing stop words, extracting the keywords queried by the user and restoring the stems to form a query Q; carrying out the same pretreatment on the document set to obtain a document set D;
selecting an expansion word candidate set: for the preprocessed query Q, n most similar terms of each query keyword are calculated and obtained by utilizing word vectors trained on the basis of a word2vec model, and an expanded word candidate set C is formed
Establishing an extended word list: calculating the similarity of each term in the C with the whole query, and selecting k expansion words with the highest similarity to construct an expansion word list T;
establishing a document set reverse index: establishing an inverted index for the preprocessed document set D;
expanding and retrieving: and calculating the relevance of the expanded query and the documents in the corresponding inverted index, and sequencing the documents according to the relevance.
The query and document preprocessing step specifically comprises the following steps:
(1) carrying out word segmentation processing on the query submitted by the user through a space character and a punctuation character;
(2) removing stop words after word segmentation, and filtering out words which do not represent concepts;
(3) after removing stop words, carrying out word stem reduction to generate a query Q;
(4) the same preprocessing is performed on the document set to generate a new document set D.
The expansion word candidate set selecting step specifically comprises the following steps:
(1) given a corpus, word vectors are trained by the training model provided by word2 vec. The word vectors are a group of multi-dimensional real-valued vectors, and the vectors reflect the semantic and grammatical relations in natural languages, so that the similarity between terms can be judged by calculating cosine values among the word vectors;
(2) after obtaining the word vector, each keyword Q in Q is processediCalculating and obtaining the similarity with q through the cosine similarity of the word vectoriThe most similar n words constitute the candidate set of expanded words of the query.
The step of establishing the extended vocabulary specifically comprises the following steps:
(1) for the query Q formed by the above processing, for each keyword Q in QiGenerating a Q vs Q according to the following equationiQuery vector of
Figure BDA0001588337720000031
Figure BDA0001588337720000032
Wherein vec (q)i) Representing a query term qiVector of (a), sim (q)i,qj) Denotes qiAnd q isjThe similarity of (c).
(2) To q isiCalculating the similarity of t and the query Q according to the following formula:
Figure BDA0001588337720000033
for candidate expansion words of different query words, different query vectors are adopted
Figure BDA0001588337720000034
The similarity between the expansion word and the query Q is calculated, so the invention generates the query vector
Figure BDA0001588337720000035
Is called an expansion word-oriented query vectorThe method of generation, and, accordingly,
Figure BDA0001588337720000036
also known as an expanded term-oriented query vector;
(3) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the expansion words according to the similarity, and returning k expansion words with the highest similarity as a final expansion word set T;
(4) generating an extended query Qexp=Q∪T。
The step of establishing the document set reverse index specifically comprises the following steps:
(1) counting all words of the document set D after preprocessing and removing duplication to generate a document word set V;
(2) for each term V in V, construct an ID (D) of all documents D containing V (where D ∈ D)id) And the number of occurrences of v in d, tfv,dInverted list of compositions, each term in the list being represented as a doublet < did,tfv,dIn the form of more than, the set of all inverted lists form an inverted index set I;
(3) for each term v, counting the number m of documents in which the term appears, and calculating the idf score of v according to the following formula:
Figure BDA0001588337720000041
where | D | represents the total number of documents in D.
The step of expanding and retrieving the document specifically comprises the following steps:
(1) (1) to QexpQuerying the inverted index set I to obtain inverted lists corresponding to the keywords, and recording the sets of the inverted lists as
Figure BDA0001588337720000042
(2) For the occurrence in
Figure BDA0001588337720000043
Each document d in (1), accumulating it in
Figure BDA0001588337720000044
The tf-idf score of each list in (1) to obtain QexpDegree of relevance R (Q) to document dexpD), calculating R (Q)expThe formula for d) is as follows:
Figure BDA0001588337720000045
in the formula, λ represents an adjustment parameter for controlling the weight of the query word and the expansion word in calculating the degree of correlation.
(3) These documents are ranked according to the magnitude of the relevance, thereby returning the N documents that are most relevant to the original query.
A semantic query expansion device based on word2vec comprises the following components:
the query and document set preprocessing module is used for carrying out word segmentation, word stem removal, word stem reduction and other processing on the document set and the query submitted by the user to form a query Q and a document set D;
the expansion word candidate set selection module is used for calculating and acquiring n most similar terms of each query keyword by using a word vector trained based on a word2vec model for each keyword in the query Q to form an expansion word candidate set C;
the expansion word list construction module is used for calculating the similarity of each term in the expansion word candidate set and the whole query, and selecting some expansion words with higher similarity to construct an expansion word list T;
the document set reverse index module is used for establishing a reverse index for the document set D after the preprocessing;
and the extended retrieval module is used for calculating the relevance between the extended query and the documents in the corresponding inverted index to obtain the relevant documents.
The invention has the beneficial effects that: the semantic query expansion method based on word2vec is provided, the similarity of the replacement words to the whole query is considered, an expansion word-oriented query vector generation method is introduced, different query vectors are generated for expansion words corresponding to different query words, an expansion word set with higher query relevance is obtained, and a better query expansion effect is further obtained.
Drawings
FIG. 1 is a functional block diagram of the semantic query expansion of the present invention based on word2 vec;
FIG. 2 is a diagram of an expanded word candidate set for each keyword in a query set in accordance with the present invention;
FIG. 3 is a diagram of an inverted index set of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: as shown in fig. 1-3, a semantic query expansion method based on word2vec includes:
query and document preprocessing steps:
(1) carrying out word segmentation processing on the query submitted by the user through a space character and a punctuation character;
(2) removing stop words after word segmentation, and filtering out words which do not represent concepts;
(3) and after the stop words are removed, carrying out word stem reduction to generate a query Q.
(4) The same preprocessing is performed on the document set to generate a new document set D.
Example 1: query preprocessing: assume that the user submitted a query of "documents associated with high speed available"
(1) Firstly, segmenting the query submitted by a user, wherein the query after segmentation is represented as follows: { documents, associated, with, high, speed, airft };
(2) removing stop words, then selecting nouns in the query to form a final query, wherein the query is expressed as: { schemes, speed, airfft };
(3) and performing word stem reduction on the keywords in the query, wherein the terms are noun complex numbers, and the reduced query keyword set Q is { proplem, speed, airft }.
Example 2: preprocessing a document set: assume a document set consisting of the following four documents:
D0="The main problem limiting the high velocity performance of helicopter is resistance"
D1="high altitude and high speed flying aircraft are often more slender shape"
D2="There are many airplanes in the sky that make up a row"
D3="whether to fly today is a problem"
finding out all words in the character string according to the spaces and the separators, removing stop words and restoring word stems to form a new document set which is as follows:
D0="problem,limit,velocity,performance,helicopter,resistance"
D1="altitude,speed,fly,aircraft,slender,shape"
D2="airplane,sky,row"
D3="fly,problem"
selecting an expansion word candidate set:
(1) selecting a Wikipedia corpus, and training a 200-dimensional word vector file through a CBOW model provided by word2 vec;
(2) after the word vector is obtained, for each keyword in Q, n most similar words are obtained by calculating cosine similarity of the word vector and are used as an expansion word candidate set of query.
For each keyword in the query Q ═ proplem, speed, air }, the top 10 semantically most relevant expansion words are selected through the trained word vector, and the case of expanding the word candidate set is shown in fig. 3.
Constructing an extended word list T:
(1) for each keyword Q in QiGenerating a Q vs Q according to the following equationiQuery vector of
Figure BDA0001588337720000061
Figure BDA0001588337720000062
Wherein vec (q)i) Representing a query term qiVector of (a), sim (q)i,qj) Denotes qiAnd q isjThe similarity of (c).
(2) To q isiCalculating the similarity of t and the query Q according to the following formula:
Figure BDA0001588337720000063
(3) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the similarity, and returning k expansion words with the highest similarity as a final expansion word set T;
(4) generating an extended query Qexp=Q∪T。
Example (c):
(1) firstly, a 200-dimensional word vector of each keyword in the query Q can be obtained according to the trained word vector:
vec(problem)=[0.29686138,1.71120727,...,-0.6585713,-1.86508703]
vec(speed)=[-2.00363445,1.05960512,...,-0.475373,-4.39991331]
vec(aircraft)=[-3.54158616,3.28720021,...,-2.34602952,-3.29022384]
then, calculating the query vector of each keyword in the Q facing the expansion word, wherein the calculation process is as follows:
Figure BDA0001588337720000064
Figure BDA0001588337720000065
Figure BDA0001588337720000066
2) take the keyword aircraft in query Q as an example, i.e. Q3Calculate q ═ air3The similarity of each expansion word t and the query Q:
Figure BDA0001588337720000071
........
Figure BDA0001588337720000072
(3) by analogy, the similarity between each expansion word in fig. 2 and the original query Q is calculated, then the expansion words in the candidate set are ranked according to the similarity, k expansion words most similar to the query Q are obtained, and taking k as an example, the finally obtained expansion word table T is as follows:
T={helicopter,airplane,velocity,altitude}
(4) merging the query words and the expansion words to obtain an expansion query Qexp
Qexp=Q∪T
={problem,speed,aircraft}∪{helicopter,airplane,velocity,altitude}
={problem,speed,aircraft,helicopter,airplane,velocity,altitude}
The establishment of the document set inverted index comprises the following steps:
(1) counting the independent terms in the document set D after the preprocessing to generate a vocabulary list V;
(2) for each term V in V, construct an ID (D) of all documents D containing V (where D ∈ D)id) And the number of occurrences of v in d, tfv,dInverted list of compositions, each term in the list being represented as a doublet < did,tfv,dIn the form of more than, the set of all inverted lists form an inverted index set I;
(3) for each term v, counting the number m of documents in which the term appears, and calculating the idf score of v according to the following formula:
Figure BDA0001588337720000073
where | D | represents the total number of documents in D.
Example (c):
(1) the document set is preprocessed by word segmentation, stop word removal and the like to obtain the following document set D:
D0="problem,limit,velocity,performance,helicopter,resistance"
D1="altitude,speed,fly,aircraft,slender,shape"
D2="airplane,sky,row"
D3="fly,problem"
and (5) counting the independent terms in the D to generate a vocabulary table V:
V={altitude,speed,fly,aircraft,slender,shape,problem,limit,velocity,performance,
helicopter,resistance,airplane,sky,row}
(2) taking the word velocity in the vocabulary V as an example, traversing the document set D to find the document containing the velocity D1Recording the ID ═ D1Counting it in document D1Is 1, then the reverse arrangement table of velocity is expressed as < D 11 >; calculating and establishing a set of inverted lists of all terms in the V by analogy to form an inverted index set I;
(3) for each word V in V, the number m of documents that appear (i.e., the length of the posting list for V) is counted, and an idf score is calculated:
if v is equal to velocity, the length of the posting list is 1, that is, there are only 1 documents in the document set containing the publishing, and m is equal to 1, so the idf score of the word velocity is calculated as:
Figure BDA0001588337720000081
the idf scores of all the words are calculated accordingly, and idf is recorded in the index, and the final inverted index set I is shown in FIG. 3.
Expanding and retrieving:
(1) to QexpQuerying the inverted index set I to obtain inverted lists corresponding to the keywords, and recording the sets of the inverted lists as
Figure BDA0001588337720000082
(2) For the occurrence in
Figure BDA0001588337720000083
Each document d in (1), accumulating it in
Figure BDA0001588337720000084
The tf-idf score of each list in (1) to obtain QexpDegree of relevance R (Q) to document dexpD), calculating R (Q)expThe formula for d) is as follows:
Figure BDA0001588337720000085
in the formula, λ represents an adjustment parameter for controlling the weight of the query word and the expansion word in calculating the degree of correlation.
(3) These documents are ranked according to the magnitude of the relevance, thereby returning the N documents that are most relevant to the original query.
Example (c):
(1) for the above generated QexpQuerying the inverted index set of FIG. 3 to obtain QexpThe inverted arrangement table corresponding to all the key words in the Chinese character library is used for solving a union set IQexp
IQexp=I(problem)∪I(speed)∪......∪I(airplane)∪I(altitude)
={D1,D3}∪{D0}∪......∪{D2}∪{D0}
={D0,D1,D2,D3}
(2) To D0,D1,D2And D3Number document, calculate QexpDegree of correlation R (Q)expD), where the control variable λ is 0.6, calculated hereThe process is as follows:
Figure BDA0001588337720000086
Figure BDA0001588337720000091
Figure BDA0001588337720000092
Figure BDA0001588337720000093
(3) the documents are sorted according to the size of the relevance, with D1>D0>D2>D3(ii) a If N is 3, return to D1,D0,D2Number document.
Example 2: a semantic query expansion device based on word2vec comprises the following components:
the query and document set preprocessing module is used for carrying out word segmentation, word stem removal, word stem reduction and other processing on the document set and the query submitted by the user to form a query Q and a document set D;
the expansion word candidate set selection module is used for calculating and acquiring n most similar terms of each query keyword by using a word vector trained based on a word2vec model for each keyword in the query Q to form an expansion word candidate set C;
the expansion word list construction module is used for calculating the similarity of each term in the expansion word candidate set and the whole query, and selecting some expansion words with higher similarity to construct an expansion word list T;
the document set reverse index module is used for establishing a reverse index for the document set D after the preprocessing;
and the extended retrieval module is used for calculating the relevance between the extended query and the documents in the corresponding inverted index to obtain the relevant documents.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (5)

1. A semantic query expansion method based on word2vec is characterized in that: the method comprises the following steps:
(1) query and document preprocessing: segmenting the query submitted by the user, removing stop words, extracting the keywords queried by the user and restoring the stems to form a query Q; carrying out the same pretreatment on the document set to obtain a document set D;
(2) selecting an expansion word candidate set: for the preprocessed query Q, calculating and acquiring n most similar terms of each query keyword by using a word vector trained based on a word2vec model to form an expanded word candidate set C;
the method specifically comprises the following steps:
a corpus is given, word vectors are trained through a training model provided by word2vec, the word vectors are a group of multi-dimensional real-valued vectors, and the vectors reflect the semantic and grammatical relations in natural language, so that the similarity between terms can be judged by calculating cosine values between the word vectors;
after obtaining the word vector, each keyword Q in Q is processediCalculating and obtaining the similarity with q through the cosine similarity of the word vectoriThe most similar n words form an expansion word candidate set C of the query;
(3) establishing an extended word list: calculating the similarity of each term in the C with the whole query Q, and selecting k expansion words with the highest similarity to construct an expansion word list T;
the method specifically comprises the following steps:
(a) for the query Q formed by the processing, for each query word Q in QiGenerating a Q vs Q according to the following equationiQuery vector of
Figure FDA0003181849810000011
Figure FDA0003181849810000012
In the formula, vec (q)i) Representing a query term qiVector of (a), sim (q)i,qj) Denotes qiAnd q isjThe similarity of (2);
(b) to q isiCalculating the similarity of t and the query Q according to the following formula:
Figure FDA0003181849810000013
for candidate expansion words of different query words, different query vectors are adopted
Figure FDA0003181849810000014
Calculating the similarity between the expanded word and the query Q to generate a query vector
Figure FDA0003181849810000015
Referred to as an extended term-oriented query vector generation method, and, accordingly,
Figure FDA0003181849810000016
also known as an expanded term-oriented query vector;
(c) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the expansion words according to the similarity, and returning k expansion words with the highest similarity as a final expansion word list T;
(d) generating an extended query Qexp=Q∪T;
(4) Establishing a document set reverse index: establishing an inverted index for the preprocessed document set D;
(5) expanding and retrieving: computing an expanded query QexpWith documents in the corresponding inverted indexThe documents are ranked according to the relevance.
2. The method for semantic query expansion based on word2vec as claimed in claim 1, wherein: the query and document preprocessing step specifically comprises the following steps:
(1) carrying out word segmentation processing on the query submitted by the user through a space character and a punctuation character;
(2) removing stop words after word segmentation, and filtering out words which do not represent concepts;
(3) after removing stop words, carrying out word stem reduction to generate a query Q;
(4) the same preprocessing is performed on the document set to generate a new document set D.
3. The method for semantic query expansion based on word2vec as claimed in claim 1, wherein: the method for establishing the document set reverse index specifically comprises the following steps:
(1) counting all words of the document set D after preprocessing and removing duplication to generate a document word set V;
(2) for each term V in V, construct a document D consisting of all the documents containing V, where D ∈ ID of D (D ∈ D)id) And the number of occurrences of v in d, tfv,dInverted list of compositions, each term in the list being represented as a doublet < did,tfv,dIn the form of more than, the set of all inverted lists form an inverted index set I;
(3) for each term v, counting the number m of documents in which the term appears, and calculating the idf score of v according to the following formula:
Figure FDA0003181849810000021
where | D | represents the total number of documents in D.
4. The method for semantic query expansion based on word2vec as claimed in claim 1, wherein: the extended search specifically comprises the following steps:
(1) to QexpQuerying the inverted index set I to obtain inverted lists corresponding to the keywords, and recording the sets of the inverted lists as
Figure FDA0003181849810000022
(2) For the occurrence in
Figure FDA0003181849810000023
Each document d in (1), accumulating it in
Figure FDA0003181849810000024
The tf-idf score of each list in (1) to obtain QexpDegree of relevance R (Q) to document dexpD), calculating R (Q)expThe formula for d) is as follows:
Figure FDA0003181849810000025
in the formula, λ represents an adjustment parameter for controlling the weight of the query term and the expansion term when calculating the degree of correlation;
(3) these documents are ranked according to the magnitude of the relevance, thereby returning the N documents that are most relevant to the original query.
5. A semantic query expansion device based on word2vec is characterized by comprising the following steps:
the query and document set preprocessing module is used for carrying out word segmentation, word stem removal, word stem reduction and other processing on the document set and the query submitted by the user to form a query Q and a document set D;
the expansion word candidate set selection module is used for calculating and acquiring n most similar terms of each query keyword by using a word vector trained based on a word2vec model for each keyword in the query Q to form an expansion word candidate set C;
the method specifically comprises the following steps:
a corpus is given, word vectors are trained through a training model provided by word2vec, the word vectors are a group of multi-dimensional real-valued vectors, and the vectors reflect the semantic and grammatical relations in natural language, so that the similarity between terms can be judged by calculating cosine values between the word vectors;
after obtaining the word vector, each keyword Q in Q is processediCalculating and obtaining the similarity with q through the cosine similarity of the word vectoriThe most similar n words form an expansion word candidate set C of the query;
the expansion word list construction module is used for calculating the similarity of each term in the expansion word candidate set and the whole query, and selecting some expansion words with higher similarity to construct an expansion word list T;
the method specifically comprises the following steps:
(a) for the query Q formed by the processing, for each query word Q in QiGenerating a Q vs Q according to the following equationiQuery vector of
Figure FDA0003181849810000031
Figure FDA0003181849810000032
In the formula, vec (q)i) Representing a query term qiVector of (a), sim (q)i,qj) Denotes qiAnd q isjThe similarity of (2);
(b) to q isiCalculating the similarity of t and the query Q according to the following formula:
Figure FDA0003181849810000033
for candidate expansion words of different query words, different query vectors are adopted
Figure FDA0003181849810000034
Calculating the similarity between the expanded word and the query Q to generate a queryVector of inquiry
Figure FDA0003181849810000035
Referred to as an extended term-oriented query vector generation method, and, accordingly,
Figure FDA0003181849810000036
also known as an expanded term-oriented query vector;
(c) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the expansion words according to the similarity, and returning k expansion words with the highest similarity as a final expansion word list T;
(d) generating an extended query Qexp=Q∪T;
The document set reverse index module is used for establishing a reverse index for the document set D after the preprocessing;
and the extended retrieval module is used for calculating the relevance between the extended query and the documents in the corresponding inverted index to obtain the relevant documents.
CN201810179478.3A 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec Active CN108491462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810179478.3A CN108491462B (en) 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810179478.3A CN108491462B (en) 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec

Publications (2)

Publication Number Publication Date
CN108491462A CN108491462A (en) 2018-09-04
CN108491462B true CN108491462B (en) 2021-09-14

Family

ID=63341204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810179478.3A Active CN108491462B (en) 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec

Country Status (1)

Country Link
CN (1) CN108491462B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063203B (en) * 2018-09-14 2020-07-24 河海大学 Query term expansion method based on personalized model
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 A kind of construction method of domain lexicon, device, equipment and storage medium
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109885766A (en) * 2019-02-11 2019-06-14 武汉理工大学 A kind of books recommended method and system based on book review
CN110008407B (en) * 2019-04-09 2021-05-04 苏州浪潮智能科技有限公司 Information retrieval method and device
CN110196977B (en) * 2019-05-31 2023-06-09 广西南宁市博睿通软件技术有限公司 Intelligent warning condition supervision processing system and method
CN110188204B (en) * 2019-06-11 2022-10-04 腾讯科技(深圳)有限公司 Extended corpus mining method and device, server and storage medium
CN110489526A (en) * 2019-08-13 2019-11-22 上海市儿童医院 A kind of term extended method, device and storage medium for medical retrieval
DE102019212421A1 (en) * 2019-08-20 2021-02-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method and device for identifying similar documents
CN110909116B (en) * 2019-11-28 2022-12-23 中国人民解放军军事科学院军事科学信息研究中心 Entity set expansion method and system for social media
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112199461B (en) * 2020-09-17 2022-05-31 暨南大学 Document retrieval method, device, medium and equipment based on block index structure
CN112836008B (en) * 2021-02-07 2023-03-21 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN113033197A (en) * 2021-03-24 2021-06-25 中新国际联合研究院 Building construction contract rule query method and device
CN112949304A (en) * 2021-03-24 2021-06-11 中新国际联合研究院 Construction case knowledge reuse query method and device
CN114723008A (en) * 2022-04-01 2022-07-08 北京健康之家科技有限公司 Language representation model training method, device, equipment, medium and user response method
CN116340470B (en) * 2023-05-30 2023-09-15 环球数科集团有限公司 Keyword associated retrieval system based on AIGC

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778161B (en) * 2015-04-30 2017-07-07 车智互联(北京)科技有限公司 Based on Word2Vec and Query log extracting keywords methods
EP3232336A4 (en) * 2015-12-01 2018-03-21 Huawei Technologies Co., Ltd. Method and device for recognizing stop word
US9798820B1 (en) * 2016-10-28 2017-10-24 Searchmetrics Gmbh Classification of keywords

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于用户兴趣模型的个性化搜索排序研究;徐康;《中国优秀硕士学位论文全文数据库 信息科技辑》;20151015;第I138-583页 *

Also Published As

Publication number Publication date
CN108491462A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108491462B (en) Semantic query expansion method and device based on word2vec
CN109101479B (en) Clustering method and device for Chinese sentences
JP5203934B2 (en) Propose and refine user input based on original user input
CN103136352A (en) Full-text retrieval system based on two-level semantic analysis
CN108509521B (en) Image retrieval method for automatically generating text index
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN108920599B (en) Question-answering system answer accurate positioning and extraction method based on knowledge ontology base
CN111611356A (en) Information searching method and device, electronic equipment and readable storage medium
KR100396826B1 (en) Term-based cluster management system and method for query processing in information retrieval
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
CN109614493B (en) Text abbreviation recognition method and system based on supervision word vector
Zhang et al. Research on keyword extraction of Word2vec model in Chinese corpus
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
Huang et al. An approach on Chinese microblog entity linking combining *** encyclopaedia and word2vec
da Silva et al. Query Expansion in Text Information Retrieval with Local Context and Distributional Model.
Pourvali A new graph based text segmentation using Wikipedia for automatic text summarization
Li et al. Complex query recognition based on dynamic learning mechanism
JP2023031294A (en) Computer-implemented method, computer program and computer system (specificity ranking of text elements and applications thereof)
CN111209737B (en) Method for screening out noise document and computer readable storage medium
Bradford Use of latent semantic indexing to identify name variants in large data collections
Berenguer et al. Towards a tabular open data search engine for public sector information
CN106708808B (en) Information mining method and device
Papagiannopoulou et al. Unsupervised keyphrase extraction based on outlier detection
Reddy et al. Cross lingual information retrieval using search engine and data mining
Rei et al. Parser lexicalisation through self-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant