CN108491462B

CN108491462B - Semantic query expansion method and device based on word2vec

Info

Publication number: CN108491462B
Application number: CN201810179478.3A
Authority: CN
Inventors: 章露露; 贾连印; 李孟娟; 丁家满; 李晓武; 陈文焰; 吕晓伟
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2021-09-14
Anticipated expiration: 2038-03-05
Also published as: CN108491462A

Abstract

The invention discloses a word2 vec-based query expansion method and device, and belongs to the technical field of information retrieval. The method comprises the following steps: preprocessing steps of a user given query: performing word segmentation processing on the query, removing stop words and restoring word stems; selecting an expansion word candidate set: selecting an initial expansion word by using a word2vec tool; establishing an extended word list: filtering the candidate set of the expansion words, and establishing an actual expansion word list; expanding and retrieving: matching the user query and the expansion words thereof with the index set, returning the relevant documents and sequencing. The invention provides an expansion word-oriented query vector generation method for filtering candidate expansion words and constructing an expansion word list, so that the correlation between the expansion words and the whole query is better reflected, and the query expansion effect is further improved.

Description

Semantic query expansion method and device based on word2vec

Technical Field

The invention relates to a semantic query expansion method and device based on word2vec, and belongs to the technical field of information retrieval.

Background

Query expansion technology is an important issue in the field of information retrieval. In the current information retrieval model and system, information is stored in the form of words, words or phrases, and when a user gives a query, it is possible to retrieve relevant documents only when the query words in the query set appear in the documents. However, in human natural language, there are many different expression ways for the same concept, for example, when searching for an automatic mobile, if the expansion is not performed, the query including car, sedan, Ford, etc. has high correlation with the original query of the user but cannot be retrieved due to different words, so that the user cannot obtain a satisfactory result. Because the problem of mismatching of the query terms exists, the user sometimes has to transform the query terms to find the required information, so in order to reduce the burden of the user, the information retrieval system is required to automatically select some other terms related to the query to assist the query, namely, the problem of mismatching of the terms is solved through a query expansion technology.

A user submits a query, and a search engine generally takes query expansion as an indispensable module in order to improve the retrieval satisfaction of the user, and currently, the commonly used query expansion methods mainly include the following steps:

1. the query expansion method based on the semantic knowledge dictionary comprises the following steps:

the method based on semantic knowledge dictionary mainly selects words with certain semantic relevance with query words by means of semantic knowledge dictionaries such as WordNet, HowNet or other synonym word forests and the like for expansion, the method is based on the upper and lower meaning words, synonyms and the like of the query words generally, the method is excessively dependent on a complete semantic system and is independent of a corpus to be retrieved, and therefore the selected expansion words usually cannot reflect the characteristics of the corpus and cannot obtain good query effect.

2. Query expansion based on global analysis:

the global analysis is that firstly, the words or phrases in all documents are subjected to correlation analysis, the association degree of each pair of words is calculated, and then the words with the highest association with the query words are added into the initial query to generate a new query. The method has the advantages that the relation between words can be searched to the maximum extent, and particularly, after a dictionary is established, query expansion can be carried out with higher efficiency; the disadvantage is that when the set of documents is large, it is often not feasible to build a complete word relationship dictionary, either temporally or spatially, and the cost of updating is even greater if the set of documents changes.

3. Query expansion based on local analysis:

the local analysis method mainly solves the expansion problem by utilizing a secondary retrieval method, obtains n documents most relevant to the original query by utilizing the direct retrieval of the primary given query and uses the n documents as the source of the expansion words, and finds the words most relevant to the original query in the n documents and adds the words into the initial query to establish a new query. The two kinds of feedback are different in that the result of initial retrieval by the relevant feedback needs to be judged by a user, the relevant documents considered by the user are taken as the source of the expansion words, the pseudo-relevant feedback does not need to interact with the user, and the returned first n documents are directly considered as relevant articles. Although the local analysis method is the most widely applied query expansion method at present, the method has the disadvantage that when the relevance of the initially retrieved documents in the front is not large, a large number of irrelevant words are easily added into the query, so that the problem of query drift is caused.

With the introduction of semantic models such as Word2Vec and Glove, Word embedding technology has attracted much attention in recent years in many fields of natural language processing. Word vectors obtained through training of training models provided by word2vec and Glove reflect semantic and grammatical relations in natural languages, and similarity between terms can be judged by calculating cosine values among the word vectors, so that the method can be well used for query expansion.

At present, the research work based on the Word2Vec query expansion is carried out, but most of the work has the following two main defects:

(1) when the expansion word list is constructed, only the words related to the query words are selected as the expansion words, and the relevance of the whole query is not considered.

(2) Even the work of considering the correlation with the entire query mostly considers that the query vector is fixed and invariant for all the replacement words, so the query vector is mostly a simple addition or average of each query word vector.

However, in general, for an expansion word of a query word q, the influence of other query words on the expansion word should not be comparable to the influence of q on the expansion word. The idea of generating different query vectors by taking different words in the query as the central words is widely applied to other information retrieval fields based on word embedding, such as semantic disambiguation and the like, and obtains better effect, but is not effectively applied to the field of query expansion.

Disclosure of Invention

The invention aims to solve the technical problem of providing a semantic query expansion method and device based on word2vec and aims to construct an expansion word list with higher query relevance so as to more comprehensively return documents relevant to user query.

The technical scheme of the invention is as follows: a semantic query expansion method based on word2vec comprises the following steps:

query and document preprocessing steps: segmenting the query submitted by the user, removing stop words, extracting the keywords queried by the user and restoring the stems to form a query Q; carrying out the same pretreatment on the document set to obtain a document set D;

selecting an expansion word candidate set: for the preprocessed query Q, n most similar terms of each query keyword are calculated and obtained by utilizing word vectors trained on the basis of a word2vec model, and an expanded word candidate set C is formed

Establishing an extended word list: calculating the similarity of each term in the C with the whole query, and selecting k expansion words with the highest similarity to construct an expansion word list T;

establishing a document set reverse index: establishing an inverted index for the preprocessed document set D;

expanding and retrieving: and calculating the relevance of the expanded query and the documents in the corresponding inverted index, and sequencing the documents according to the relevance.

The query and document preprocessing step specifically comprises the following steps:

(1) carrying out word segmentation processing on the query submitted by the user through a space character and a punctuation character;

(2) removing stop words after word segmentation, and filtering out words which do not represent concepts;

(3) after removing stop words, carrying out word stem reduction to generate a query Q;

(4) the same preprocessing is performed on the document set to generate a new document set D.

The expansion word candidate set selecting step specifically comprises the following steps:

(1) given a corpus, word vectors are trained by the training model provided by word2 vec. The word vectors are a group of multi-dimensional real-valued vectors, and the vectors reflect the semantic and grammatical relations in natural languages, so that the similarity between terms can be judged by calculating cosine values among the word vectors;

(2) after obtaining the word vector, each keyword Q in Q is processed_iCalculating and obtaining the similarity with q through the cosine similarity of the word vector_iThe most similar n words constitute the candidate set of expanded words of the query.

The step of establishing the extended vocabulary specifically comprises the following steps:

(1) for the query Q formed by the above processing, for each keyword Q in Q_iGenerating a Q vs Q according to the following equation_iQuery vector of

Wherein vec (q)_i) Representing a query term q_iVector of (a), sim (q)_i,q_j) Denotes q_iAnd q is_jThe similarity of (c).

(2) To q is_iCalculating the similarity of t and the query Q according to the following formula:

for candidate expansion words of different query words, different query vectors are adopted

The similarity between the expansion word and the query Q is calculated, so the invention generates the query vector

Is called an expansion word-oriented query vectorThe method of generation, and, accordingly,

also known as an expanded term-oriented query vector;

(3) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the expansion words according to the similarity, and returning k expansion words with the highest similarity as a final expansion word set T;

(4) generating an extended query Q_exp＝Q∪T。

The step of establishing the document set reverse index specifically comprises the following steps:

(1) counting all words of the document set D after preprocessing and removing duplication to generate a document word set V;

(2) for each term V in V, construct an ID (D) of all documents D containing V (where D ∈ D)_id) And the number of occurrences of v in d, tf_v,dInverted list of compositions, each term in the list being represented as a doublet < d_id,tf_v,dIn the form of more than, the set of all inverted lists form an inverted index set I;

(3) for each term v, counting the number m of documents in which the term appears, and calculating the idf score of v according to the following formula:

where | D | represents the total number of documents in D.

The step of expanding and retrieving the document specifically comprises the following steps:

(1) (1) to Q_expQuerying the inverted index set I to obtain inverted lists corresponding to the keywords, and recording the sets of the inverted lists as

(2) For the occurrence in

Each document d in (1), accumulating it in

The tf-idf score of each list in (1) to obtain Q_expDegree of relevance R (Q) to document d_expD), calculating R (Q)_expThe formula for d) is as follows:

in the formula, λ represents an adjustment parameter for controlling the weight of the query word and the expansion word in calculating the degree of correlation.

(3) These documents are ranked according to the magnitude of the relevance, thereby returning the N documents that are most relevant to the original query.

A semantic query expansion device based on word2vec comprises the following components:

the query and document set preprocessing module is used for carrying out word segmentation, word stem removal, word stem reduction and other processing on the document set and the query submitted by the user to form a query Q and a document set D;

the expansion word candidate set selection module is used for calculating and acquiring n most similar terms of each query keyword by using a word vector trained based on a word2vec model for each keyword in the query Q to form an expansion word candidate set C;

the expansion word list construction module is used for calculating the similarity of each term in the expansion word candidate set and the whole query, and selecting some expansion words with higher similarity to construct an expansion word list T;

the document set reverse index module is used for establishing a reverse index for the document set D after the preprocessing;

and the extended retrieval module is used for calculating the relevance between the extended query and the documents in the corresponding inverted index to obtain the relevant documents.

The invention has the beneficial effects that: the semantic query expansion method based on word2vec is provided, the similarity of the replacement words to the whole query is considered, an expansion word-oriented query vector generation method is introduced, different query vectors are generated for expansion words corresponding to different query words, an expansion word set with higher query relevance is obtained, and a better query expansion effect is further obtained.

Drawings

FIG. 1 is a functional block diagram of the semantic query expansion of the present invention based on word2 vec;

FIG. 2 is a diagram of an expanded word candidate set for each keyword in a query set in accordance with the present invention;

FIG. 3 is a diagram of an inverted index set of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1-3, a semantic query expansion method based on word2vec includes:

query and document preprocessing steps:

(3) and after the stop words are removed, carrying out word stem reduction to generate a query Q.

Example 1: query preprocessing: assume that the user submitted a query of "documents associated with high speed available"

(1) Firstly, segmenting the query submitted by a user, wherein the query after segmentation is represented as follows: { documents, associated, with, high, speed, airft };

(2) removing stop words, then selecting nouns in the query to form a final query, wherein the query is expressed as: { schemes, speed, airfft };

(3) and performing word stem reduction on the keywords in the query, wherein the terms are noun complex numbers, and the reduced query keyword set Q is { proplem, speed, airft }.

Example 2: preprocessing a document set: assume a document set consisting of the following four documents:

D₀＝"The main problem limiting the high velocity performance of helicopter is resistance"

D₁＝"high altitude and high speed flying aircraft are often more slender shape"

D₂＝"There are many airplanes in the sky that make up a row"

D₃＝"whether to fly today is a problem"

finding out all words in the character string according to the spaces and the separators, removing stop words and restoring word stems to form a new document set which is as follows:

D₀＝"problem,limit,velocity,performance,helicopter,resistance"

D₁＝"altitude,speed,fly,aircraft,slender,shape"

D₂＝"airplane,sky,row"

D₃＝"fly,problem"

selecting an expansion word candidate set:

(1) selecting a Wikipedia corpus, and training a 200-dimensional word vector file through a CBOW model provided by word2 vec;

(2) after the word vector is obtained, for each keyword in Q, n most similar words are obtained by calculating cosine similarity of the word vector and are used as an expansion word candidate set of query.

For each keyword in the query Q ═ proplem, speed, air }, the top 10 semantically most relevant expansion words are selected through the trained word vector, and the case of expanding the word candidate set is shown in fig. 3.

Constructing an extended word list T:

(1) for each keyword Q in Q_iGenerating a Q vs Q according to the following equation_iQuery vector of

(3) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the similarity, and returning k expansion words with the highest similarity as a final expansion word set T;

(4) generating an extended query Q_exp＝Q∪T。

Example (c):

(1) firstly, a 200-dimensional word vector of each keyword in the query Q can be obtained according to the trained word vector:

vec(problem)＝[0.29686138,1.71120727,...,-0.6585713,-1.86508703]

vec(speed)＝[-2.00363445,1.05960512,...,-0.475373,-4.39991331]

vec(aircraft)＝[-3.54158616,3.28720021,...,-2.34602952,-3.29022384]

then, calculating the query vector of each keyword in the Q facing the expansion word, wherein the calculation process is as follows:

2) take the keyword aircraft in query Q as an example, i.e. Q₃Calculate q ═ air₃The similarity of each expansion word t and the query Q:

........

(3) by analogy, the similarity between each expansion word in fig. 2 and the original query Q is calculated, then the expansion words in the candidate set are ranked according to the similarity, k expansion words most similar to the query Q are obtained, and taking k as an example, the finally obtained expansion word table T is as follows:

T＝{helicopter,airplane,velocity,altitude}

(4) merging the query words and the expansion words to obtain an expansion query Q_exp：

Q_exp＝Q∪T

＝{problem,speed,aircraft}∪{helicopter,airplane,velocity,altitude}

＝{problem,speed,aircraft,helicopter,airplane,velocity,altitude}

The establishment of the document set inverted index comprises the following steps:

(1) counting the independent terms in the document set D after the preprocessing to generate a vocabulary list V;

where | D | represents the total number of documents in D.

Example (c):

(1) the document set is preprocessed by word segmentation, stop word removal and the like to obtain the following document set D:

D₀＝"problem,limit,velocity,performance,helicopter,resistance"

D₁＝"altitude,speed,fly,aircraft,slender,shape"

D₂＝"airplane,sky,row"

D₃＝"fly,problem"

and (5) counting the independent terms in the D to generate a vocabulary table V:

V＝{altitude,speed,fly,aircraft,slender,shape,problem,limit,velocity,performance,

helicopter,resistance,airplane,sky,row}

(2) taking the word velocity in the vocabulary V as an example, traversing the document set D to find the document containing the velocity D₁Recording the ID ═ D₁Counting it in document D₁Is 1, then the reverse arrangement table of velocity is expressed as < D ₁1 >; calculating and establishing a set of inverted lists of all terms in the V by analogy to form an inverted index set I;

(3) for each word V in V, the number m of documents that appear (i.e., the length of the posting list for V) is counted, and an idf score is calculated:

if v is equal to velocity, the length of the posting list is 1, that is, there are only 1 documents in the document set containing the publishing, and m is equal to 1, so the idf score of the word velocity is calculated as:

the idf scores of all the words are calculated accordingly, and idf is recorded in the index, and the final inverted index set I is shown in FIG. 3.

Expanding and retrieving:

(1) to Q_expQuerying the inverted index set I to obtain inverted lists corresponding to the keywords, and recording the sets of the inverted lists as

(2) For the occurrence in

Each document d in (1), accumulating it in

Example (c):

(1) for the above generated Q_expQuerying the inverted index set of FIG. 3 to obtain Q_expThe inverted arrangement table corresponding to all the key words in the Chinese character library is used for solving a union set I_Qexp：

I_Qexp＝I(problem)∪I(speed)∪......∪I(airplane)∪I(altitude)

＝{D₁,D₃}∪{D₀}∪......∪{D₂}∪{D₀}

＝{D₀,D₁,D₂,D₃}

(2) To D₀,D₁,D₂And D₃Number document, calculate Q_expDegree of correlation R (Q)_expD), where the control variable λ is 0.6, calculated hereThe process is as follows:

(3) the documents are sorted according to the size of the relevance, with D₁＞D₀＞D₂＞D₃(ii) a If N is 3, return to D₁,D₀,D₂Number document.

Example 2: a semantic query expansion device based on word2vec comprises the following components:

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A semantic query expansion method based on word2vec is characterized in that: the method comprises the following steps:

(1) query and document preprocessing: segmenting the query submitted by the user, removing stop words, extracting the keywords queried by the user and restoring the stems to form a query Q; carrying out the same pretreatment on the document set to obtain a document set D;

(2) selecting an expansion word candidate set: for the preprocessed query Q, calculating and acquiring n most similar terms of each query keyword by using a word vector trained based on a word2vec model to form an expanded word candidate set C;

the method specifically comprises the following steps:

a corpus is given, word vectors are trained through a training model provided by word2vec, the word vectors are a group of multi-dimensional real-valued vectors, and the vectors reflect the semantic and grammatical relations in natural language, so that the similarity between terms can be judged by calculating cosine values between the word vectors;

after obtaining the word vector, each keyword Q in Q is processed_iCalculating and obtaining the similarity with q through the cosine similarity of the word vector_iThe most similar n words form an expansion word candidate set C of the query;

(3) establishing an extended word list: calculating the similarity of each term in the C with the whole query Q, and selecting k expansion words with the highest similarity to construct an expansion word list T;

the method specifically comprises the following steps:

(a) for the query Q formed by the processing, for each query word Q in Q_iGenerating a Q vs Q according to the following equation_iQuery vector of

In the formula, vec (q)_i) Representing a query term q_iVector of (a), sim (q)_i,q_j) Denotes q_iAnd q is_jThe similarity of (2);

(b) to q is_iCalculating the similarity of t and the query Q according to the following formula:

Calculating the similarity between the expanded word and the query Q to generate a query vector

Referred to as an extended term-oriented query vector generation method, and, accordingly,

also known as an expanded term-oriented query vector;

(c) calculating the similarity of the expansion words of each query word relative to the whole query Q according to the model, then reordering the expansion words according to the similarity, and returning k expansion words with the highest similarity as a final expansion word list T;

(d) generating an extended query Q_exp＝Q∪T；

(4) Establishing a document set reverse index: establishing an inverted index for the preprocessed document set D;

(5) expanding and retrieving: computing an expanded query Q_expWith documents in the corresponding inverted indexThe documents are ranked according to the relevance.

2. The method for semantic query expansion based on word2vec as claimed in claim 1, wherein: the query and document preprocessing step specifically comprises the following steps:

3. The method for semantic query expansion based on word2vec as claimed in claim 1, wherein: the method for establishing the document set reverse index specifically comprises the following steps:

(2) for each term V in V, construct a document D consisting of all the documents containing V, where D ∈ ID of D (D ∈ D)_id) And the number of occurrences of v in d, tf_v,dInverted list of compositions, each term in the list being represented as a doublet < d_id,tf_v,dIn the form of more than, the set of all inverted lists form an inverted index set I;

where | D | represents the total number of documents in D.

4. The method for semantic query expansion based on word2vec as claimed in claim 1, wherein: the extended search specifically comprises the following steps:

(2) For the occurrence in

Each document d in (1), accumulating it in

in the formula, λ represents an adjustment parameter for controlling the weight of the query term and the expansion term when calculating the degree of correlation;

5. A semantic query expansion device based on word2vec is characterized by comprising the following steps:

the method specifically comprises the following steps:

Calculating the similarity between the expanded word and the query Q to generate a queryVector of inquiry

also known as an expanded term-oriented query vector;

(d) generating an extended query Q_exp＝Q∪T；