CN111897926A

CN111897926A - Chinese query expansion method integrating deep learning and expansion word mining intersection

Info

Publication number: CN111897926A
Application number: CN202010774430.4A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-11-06

Abstract

The invention provides a Chinese query expansion method with intersection fusion of deep learning and expanded word mining, which comprises the steps of firstly carrying out word embedding semantic learning training on a primary detection document set by adopting a deep learning tool to obtain a word embedding expanded word set with rich context semantic information, then mining an association rule mode on a primary detection front row pseudo-related feedback document set by utilizing a pseudo-related feedback expanded word mining method based on Copulas theory to obtain a rule expanded word set containing feature interword association information based on statistical analysis, and finally carrying out intersection fusion on the word embedding expanded word set and the rule expanded word set to obtain a final expanded word set so as to improve the quality of expanded words. The method integrates deep learning and expansion word mining, and excavates the high-quality expansion words related to the original query, so that the problems of query subject drift and word mismatching can be solved, the text information retrieval performance is improved, and the method has good application value and popularization prospect.

Description

Chinese query expansion method integrating deep learning and expansion word mining intersection

Technical Field

The invention relates to a Chinese query expansion method integrating deep learning and expansion word mining, belonging to the technical field of information retrieval.

Background

Query expansion is one of key technologies for solving the problems of query topic drift and word mismatching in information retrieval, and the query expansion refers to modifying the weight of an original query or adding words related to the original query to obtain a new query longer than the original query so as to describe the semantic meaning or topic implied by the original query more completely and accurately, make up for the deficiency of user query information and improve the retrieval performance of an information retrieval system. The core problem of query expansion is the source of the expansion terms and the design of the expansion model. With the development of network technology and the arrival of the big data era, network users have more and more requirements on information retrieval, for example, how to accurately retrieve required information from massive big data makes query expansion research become a hotspot in the field of information retrieval.

In recent decades, researchers have conducted research on query expansion models from different perspectives and methods, and have gained abundant research results, wherein relevant feedback expansion based on association pattern mining and recently emerging query expansion based on deep learning are more concerned and discussed by the researchers at home and abroad. For example, Bouziri et al propose supervised learning based extension word mining (see: Bouziri A, Latiri C, Gaussian E et al. learning Query extension from rules in terms of [ C ]. Proceedings of the 7th International Journal Discovery on Knowledge Discovery and Knowledge management (IC3K), Lisbon, Portugal,2015: 525. 530.) and ranking learning model based extension word mining methods (see: Bouziri A, Latiri C, Gaussian E. efficiency assessment search for implementation of the invention, simulation analysis of the general implementation of the invention, extension word mining methods based on the ranking learning model (see: implementation of the general correlation of [ C ]. 19, simulation of the invention, simulation of the language of the invention classification of the extension of the general implementation of the Query, scientific analysis of the language of the invention [ C ]. 12, simulation of the language of the invention, simulation of the extension of the language of the invention, simulation of the language of the invention, and the implementation of the language of the Query of the language of the human classification of, 2019,9(6): 5016-: kuzi S, Shok A, Kurland O.query expansion Word embedding [ C ]. Proceedings of the 25th ACM International Conference on Information and Knowledge management.NewYork: ACM Press,2016:1929 and 1932.) A CBOW model of a deep learning tool Word2vec is applied to train Word vectors in search corpuses, and feature words related to query semantics are selected from the Word vectors to realize query expansion. Experimental results show that the query expansion method is effective and has better performance in the aspect of improving the information retrieval performance. However, the existing query expansion method has not completely solved the technical problems of query topic drift, word mismatching and the like existing in information retrieval, and meanwhile, although the expansion words from the association mode can carry feature word association information based on statistical analysis, the semantic information of the expansion words in the document context is lacked, and the word embedding expansion words derived from word vector semantic learning training have rich document context semantic information but have no feature word association information based on statistical analysis.

In order to fully exert the expansion advantages of the regular expansion words and the word embedded expansion words and make up the respective defects, the invention integrates deep learning with the expansion word mining based on the Copulas theory, and provides a Chinese query expansion method integrating the deep learning with the expansion word mining in an intersection manner, which can solve the problems of query theme drift and word mismatching in a text information retrieval system, improve the text information retrieval performance and has good application value and wide popularization prospect.

Disclosure of Invention

The invention aims to provide a Chinese query expansion method integrating deep learning and expansion word mining intersection, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve and enhance the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.

The invention adopts the following specific technical scheme:

a Chinese query expansion method integrating deep learning and expanded word mining intersection comprises the following steps:

step 1, searching a Chinese document set for original query to obtain a primary detection document set, and performing Chinese word segmentation and stop word removal pretreatment on the primary detection document set.

Step 2, performing word embedding semantic learning training on the initial inspection document set by using a deep learning tool to obtain a feature word and word embedding vector set, and specifically comprising the following steps:

and (2.1) performing word embedding semantic learning training on the primary detection pseudo-related feedback document set by adopting a Skip-gram model (in detail, https:// code. Google. com/p/word2vec /) of a deep learning tool Google open source word vector tool word2vec to obtain a word embedding vector set of the primary detection document feature words.

(2.2) in the word embedding vector set of the initial detection document characteristic words, calculating each query term q_i(q_iE.g. Q, Q is the original query term set, Q ═ Q₁,q₂,…,q_n) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)₁,cet₂,…,cet_m) Word vector cosine similarity degree of (q) VCos_i,cet_j) The formula is shown as formula (1), wherein j is more than or equal to 1 and less than or equal to m. The word embedding candidate expansion words refer to those non-query terms in the word embedding vector set.

In the formula (1), vcet_jIndicating that the jth word is embedded into the candidate expansion word cet_jWord vector value of, vq_iRepresenting the ith query term q_iThe word vector value of.

(2.3) given a minimum vector cosine similarity threshold minqvcos, extracting its VCos (q)_i,cet_j) Query term q of not less than minqvcos_iThe word is embedded into a candidate expansion word as the query term q_iWord embedding expansion word (q)_iet₁,q_iet₂,…,q_iet_p1) Will query term q₁,q₂,…,q_nEmbedding all the words into the extension Word combination, removing the repeated words to obtain the final Word embedding extension Word set ET _ WE (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word embedding extension words_WEETAnd then, the step 3 is carried out. The ET _ WE is shown as formula (2):

the weight w of the word embedded expansion word_WEETFor querying the cosine similarity of terms and words embedded in expanded words, e.g.And (3) when repeated words appear, the word embedding expansion word weight is equal to the cumulative sum of the similarity of each vector of the repeated words.

Step 3, adopting a Copulas theory-based pseudo-related feedback expansion word mining method to mine rule expansion words in the initial detection pseudo-related feedback document set, and establishing a rule expansion word set, wherein the method specifically comprises the following steps:

and (3.1) extracting m pieces of primary detection documents in the primary detection document set, constructing a primary detection pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and feature word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a feature word weight, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.

The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.

(3.2) taking the feature words in the Chinese feature word library as 1_ candidate item set C₁。

(3.3) calculation of C₁Support degree CSup (C) based on Copulas theory₁) If CSup (C)₁) More than or equal to the minimum support threshold ms, C is set₁As 1_ frequent item set L₁And added to the frequent itemset set fis (frequency itemset).

The csup (copula basedsupport) represents the support degree based on copula theory. The CSup (C)₁) Is calculated as shown in equation (4):

in the formula (4), C₁"Count" represents 1_ candidate C₁The frequency of occurrence in the pseudo related feedback Chinese document library, all doc (count) represents the total number of the pseudo related feedback Chinese document libraryNumber of documents, C₁"Weight" represents the 1_ candidate C₁Item set weights in the pseudo-relevance feedback Chinese document library, allItems (weight) represent the weighted sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.

(3.4) adopting a self-connection method to connect (k-1) _ frequent item set L_k-1Deriving k _ candidate C from concatenation_kAnd k is more than or equal to 2.

The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.

(3.5) when mining to 2_ candidate C₂When, if the C is₂If the original query term is not contained, the C is deleted₂If the C is₂If the original query term is contained, the C is left₂Then, C is left₂And (4) transferring to the step (3.6). When mining to k _ candidate C_kAnd when the k is more than or equal to 3, directly transferring to the step (3.6).

(3.6) calculation of C_kSupport degree CSup (C) based on Copulas theory_k) If CSup (C)_k) Not less than ms, then C_kIs k _ frequent item set L_kAdding into FIS, then transferring into the step (3.7), or directly transferring into the step (3.7).

The CSup (C)_k) Is calculated as shown in equation (5):

in the formula (5), C_k"Count" denotes k _ candidate C_kFrequency of occurrence in pseudo-relevance feedback Chinese document library, C_k"Weight" denotes k _ candidate C_kItem set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).

(3.7) k is added with 1 and then is transferred to the step (3.4) to continue to sequentially execute the next step,up to said L_kAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.8).

(3.8) taking out k _ frequent item set L from FIS_kAnd k is more than or equal to 2.

(3.9) extraction of L_kIs set of proper subset entries Et_jAnd Q_iAnd is and

Q_i∪Et_j＝L_k，

et (E) described_jFor a proper subset of terms set without query terms, said Q_iThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.

(3.10) calculate association rule Q based on Copulas theory_i→Et_jConfidence of (CConf) (Q)_i→Et_j) If CConf (Q)_i→ET_j) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, Q is added_i→Et_jAdding into the association rule set AR (Association rule), then, proceeding to step (3.9), from L_kTo re-extract the other proper subset item sets Et_jAnd Q_iSequentially proceeding the next steps, and circulating the steps until L_kIf and only if all proper subset entries in the set are retrieved once, then proceed to step (3.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS_kThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS_kIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).

The CConf (copula based Confidence) represents a confidence based on copula theory, the CConf (Q)_i→ET_j) Is represented by equation (6):

in the formula (6), the reaction mixture is,Q_i"Count represents the proper subset item set Q_iFrequency of occurrence, Q, in pseudo-associative feedback Chinese document library_i'Weight' represents a proper subset item set Q_iTerm set weights in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) _ Count represents a set of items (Q)_i∪Et_j) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) ' Weight representation item set (Q)_i∪Et_j) Item set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).

(3.11) extracting the association rule backing Et from the association rule set AR_jAs rule extension words, obtain rule extension word set ET _ AR (extension Term from Association rules), calculate rule extension word weight w_EtThen, the process proceeds to step 4.

The ET _ AR is shown as formula (7):

in formula (7), Ret_iIndicating the ith rule expansion word.

The rule expansion word weight w_EtThe calculation formula is shown in formula (8):

in the formula (8), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.

Step 4, performing intersection fusion on the rule expansion word set and the word embedding expansion word set to obtain a final expansion word, and realizing query expansion, wherein the specific steps are as follows:

(4.1) performing intersection operation on the rule extension word Set ET _ AR and the word embedding extension word Set ET _ WE to obtain a final extension word Set ETS _ Q (expansion Term Set for Query Q) of the original Query Term Set Q, and calculating a final extension word weight w(ET_i)。

The final extended word set ETS _ Q is calculated as shown in equation (9):

final expanded word weight w (ET)_i) Is calculated as shown in equation (10):

w(ET_i)＝w_Et+w_WEET(10)

and (4.2) combining the final expansion word with the original query into a new query, and searching the Chinese document again to realize query expansion.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a Chinese query expansion method with deep learning and expanded word mining intersection fusion, which is characterized in that a deep learning tool is adopted for carrying out word embedding semantic learning training on a primary detection document set to obtain a word embedding expanded word set with rich context semantic information, a copula theory-based pseudo-related feedback expanded word mining method is utilized to mine an association rule mode for the primary detection front row pseudo-related feedback document set to obtain a rule expanded word set containing characteristic interword association information based on statistical analysis, and the word embedding expanded word set and the rule expanded word set are subjected to intersection fusion to obtain a final expanded word set so as to improve the quality of expanded words. Experimental results show that the method can inhibit the problems of query theme drift and word mismatching, improve the information retrieval performance, has higher retrieval performance than similar comparison methods in recent years, and has better application value and popularization prospect.

(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that compared with the reference retrieval, the MAP of the method of the invention has the highest average amplification of 27.87 percent, the MAP of the method of the invention has higher average amplification than that of the comparison method, the average amplification of 18.21 percent, and the experimental effect is obvious, thus the retrieval performance of the method of the invention is better than that of the reference retrieval and comparison method, the information retrieval performance can be improved, the problems of query drift and word mismatching in the information retrieval are reduced, and the method has very high application value and wide popularization prospect.

Drawings

FIG. 1 is a general flow diagram of the method for expanding Chinese queries by merging deep learning and expanded word mining intersections according to the present invention.

Detailed Description

Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:

1. item set

In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.

2. Associating front and back pieces of a rule

Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.

3. Rule expansion word

The rule expansion word means that the expansion word is from a back item set of the association rule, and the front item set of the association rule is an original query item set.

4. Rule expansion word weight calculation

Taking the confidence coefficient of the association rule of the former item set as the original query word as the weight w of the rule expansion word_Et。

The weight w of the expanded word_EtThe calculation formula is shown in formula (11):

in the formula (11), the association rule Q_i→ET_jIn, Q_iFor sets of terms containing query terms, for association rule antecedents, ET_jAn item set which does not contain the query terms and contains the expansion terms is a back part of the association rule; AllDoc (count) indicates pseudo-correlation feedbackThe total number of documents in the Chinese document library; allItems (weight) represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library; q_i_ Count represents a set of items Q_iFrequency of occurrence, Q, in pseudo-associative feedback Chinese document library_i' Weight representation item set Q_iTerm set weights in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) _ Count represents a set of items (Q)_i∪Et_j) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) ' Weight representation item set (Q)_i∪Et_j) Item set weights in a pseudo-relevance feedback Chinese document library. max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.

5. Word embedding expansion word and weight thereof

The word embedding expansion word is derived from a set of word embedding vectors. The specific description is as follows: in the word embedding vector set, calculating each query term Q of the original query term set Q_i(q_i∈Q，Q＝(q₁,q₂,…,q_n) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)₁,cet₂,…,cet_m) Word vector cosine similarity degree of (q) VCos_i,cet_j) Giving a minimum similarity threshold value minqvcos, and extracting query terms q with the cosine similarity of the word vector not less than the similarity threshold value minqvcos_iThe word is embedded into a candidate expansion word as the query term q_iWord embedding expansion word (q)_iet₁,q_iet₂,…,q_iet_p1) Will query term q₁,q₂,…,q_nAnd embedding all the words into the extension Word combinations, and removing repeated words to obtain a final Word embedding extension Word set ET _ WE (expansion Term from Word embedding) of the original query Term set Q. The word embedding candidate expansion words refer to those non-query terms in the word embedding vector set.

The word embedding extension word ET _ WE is shown as equation (12):

the vector cosine similarity VCos (q)_i,cet_j) Is calculated as shown in equation (13):

in the formula (13), the vcet_jIndicating that the jth word is embedded into the candidate expansion word cet_jWord vector value of, vq_iRepresenting the ith query term q_iThe word vector value of.

And taking the vector cosine similarity value of the query lexical item and the word embedding expansion word as the weight of the word embedding expansion word.

The weight w of the word embedded expansion word_WEETWhen a repeated word appears, the word embedding expansion word weight is equal to the accumulated sum of the similarity of each vector of the repeated word, as shown in formula (14).

6. Support degree and confidence degree based on Copulas theory

The copula function theory (see Sklar A. the principles de repetition a n dimension sets sources marks J. the Publication de l 'institute de Statistique l' universities 1959,8(1): 229) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function. By referring to Copulas function theory, the invention provides support csup (Copulas based supported port) and confidence cconf (Copulas based configured confidence) based on Copulas theory, which are described in detail as follows.

Characteristic term set (T) based on copula theory₁∪T₂) Support CSup (T)₁∪T₂) Is calculated as shown in equation (15):

in formula (15), (T)₁∪T₂) _ Count represents a set of items (T)₁∪T₂) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (T)₁∪T₂) ' Weight representation item set (T)₁∪T₂) Item set weight in the pseudo relevant feedback Chinese document library, and all doc (count) represents the total document quantity of the pseudo relevant feedback Chinese document library; allitems (weight) represents the weighted sum of all Chinese feature words in the pseudo-relevant feedback Chinese document library.

Association rule (T) based on copula theory₁→T₂) Confidence CConf (T)₁→T₂) Is calculated as shown in equation (16):

in formula (16), T₁"Count" represents a set of items T₁Frequency of occurrence, T, in pseudo-correlated feedback Chinese document library₁' Weight representation item set T₁Term set weights in pseudo-relevance feedback Chinese document library, (T)₁∪T₂) _ Count represents a set of items (T)₁∪T₂) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (T)₁∪T₂) ' Weight representation item set (T)₁∪T₂) Item set weights in a pseudo-relevant feedback Chinese document library; AllDoc (count) and AllItems (weight) are defined as in equation (15).

The invention is further explained below by referring to the drawings and specific comparative experiments.

As shown in FIG. 1, the method for expanding Chinese queries by fusion of deep learning and expanded word mining intersection of the present invention comprises the following steps:

and (2.1) performing word embedding semantic learning training on the primary detection pseudo-related feedback document set by adopting a Skip-gram model of a deep learning tool Google open source word vector tool word2vec to obtain a word embedding vector set of the primary detection document characteristic words.

the weight w of the word embedded expansion word_WEETFor searching terms and word-inlaysAnd (3) entering the vector cosine similarity of the expansion word, wherein when the repeated word appears, the weight of the word embedded expansion word is equal to the cumulative sum of the vector similarities of the repeated word.

The invention adopts TF-IDF weighting technology to calculate the weight of the feature words.

in the formula (4), C₁"Count" represents 1_ candidate C₁The frequency of occurrence in the pseudo related feedback Chinese document library, all doc (count) represents the total document number of the pseudo related feedback Chinese document library, C₁"Weight" represents the 1_ candidate C₁Item set weights in pseudo-relevance feedback Chinese document library, allItems (weight) representing pseudo-relevanceAnd feeding back the weight accumulation sum of all Chinese characteristic words in the Chinese document library.

The self-join method uses a candidate set join method given in Apriori algorithm.

The CSup (C)_k) Is calculated as shown in equation (5):

(3.7) after k is added with 1, the step (3.4) is carried out to continue the subsequent steps until the L_kAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.8).

(3.9) extraction of L_kIs set of proper subset entries Et_jAnd Q_iAnd is and

Q_i∪Et_j＝L_k，

in the formula (6), Q_i"Count represents the proper subset item set Q_iFrequency of occurrence, Q, in pseudo-associative feedback Chinese document library_i'Weight' represents a proper subset item set Q_iTerm set weights in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) _ Count represents a set of items (Q)_i∪Et_j) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) ' Weight representation item set (Q)_i∪Et_j) Item set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).

The ET _ AR is shown as formula (7):

in formula (7), Ret_iIndicating the ith rule expansion word.

(4.1) performing intersection operation on the rule extension word Set ET _ AR and the word embedding extension word Set ET _ WE to obtain a final extension word Set ETS _ Q (expansion Term Set for Query Q) of the original Query Term Set Q, and calculating a final extension word weight w (ET)_i)。

The final extended word set ETS _ Q is calculated as shown in equation (9):

final expanded word weight w (ET)_i) Is calculated as shown in equation (10)：

w(ET_i)＝w_Et+w_WEET(10)

Experimental design and results:

we compared the method of the present invention with the prior art similar method to perform the search experiment to illustrate the effectiveness of the method of the present invention.

1. Experimental environment and experimental data:

the experimental corpus is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text standard corpus, 901446 Chinese traditional documents (converted into Chinese simplified bodies during experiments) are distributed in 8 data sets as shown in Table 1. The NTCIR-5CLIR corpus has 50 chinese queries, 4 types of query topics and a result set of 2 evaluation criteria (i.e., Rigid (highly relevant, relevant to query) and Relax (highly relevant, and partially relevant to query) evaluation criteria). The retrieval experiment is completed by adopting the topics of Description (Desc for short, belonging to long query) and Title query (belonging to short query). The index for evaluation of the search for the Experimental results is MAP (mean Average precision)

TABLE 1 original corpus and its quantity

2. The reference retrieval and comparison method comprises the following steps:

the experimental basic retrieval environment is built by Lucene.

The reference retrieval is a retrieval result obtained by submitting an original query to Lucene.

The comparative method is described as follows:

comparative method 1: query expansion method of word vectors based on literature (see details: Kan, Linyuan, kojiu, etc.. patent query expansion word vector method study [ J ]. computer science and exploration, 2018,12(6):972-980.), parameters: α is 0.1 and k is 60.

Comparative method 2: mining rule expansion words by adopting a weighted association pattern mining technology of documents (see detail: yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper, 2017,36(3): 307-: c is 0.1, mi is 0.0001, and the results of the experiment are average values when ms is 0.004,0.005,0.006,0.007, respectively.

The Skip-gram model words used by the invention are embedded with semantic learning training parameters: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.

3. The experimental methods and results are as follows:

the average value of the MAP obtained by the method of the invention is shown in the table 2 and the table 3, wherein the average amplification (%) in the table refers to the total average amplification of the search results of the method of the invention on 8 data sets relative to the reference search and the contrast expansion method.

The average amplification calculation method comprises the following steps: firstly, the amplification of the retrieval result of the method of the invention on each data set relative to the reference retrieval and contrast expansion method is calculated, then, the amplification on each data set is accumulated and then is divided by 8, and the total average amplification of the retrieval result of the method of the invention relative to other methods is obtained.

TABLE 2 MAP value of search performance (Title query) for the method of the present invention and the reference search and comparison method

TABLE 3 search Performance MAP values for the inventive method and the reference search and comparison method (Desc Inquiry)

Tables 2 and 3 show that compared with the reference retrieval, the method has the advantages that the MAP value of the retrieval result is better improved, the effect is obvious, the average amplification is 27.87%, the MAP value of the method is mostly higher than that of the comparison method, and the extended retrieval performance of the method is higher than that of the reference retrieval and the similar comparison method. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims

1. A Chinese query expansion method integrating deep learning and expanded word mining intersection is characterized by comprising the following steps:

step 1, searching a Chinese document set for original query to obtain a primary check document set, and performing Chinese word segmentation and stop word removal preprocessing on the primary check document set;

(2.1) performing word embedding semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a word embedding vector set of the primary detection document feature words;

(2.2) in the word embedding vector set of the initial detection document characteristic words, calculating each query term q_i(q_iE.g. Q, Q is the original query term set, Q ═ Q₁,q₂,…,q_n) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)₁,cet₂,…,cet_m) Word vector cosine similarity degree of (q) VCos_i,cet_j) Wherein j is more than or equal to 1 and less than or equal to m; the word embedding candidate expansion words refer to those non-query terms in the word embedding vector set;

(2.3) given a minimum vector cosine similarity threshold minqvcos, extracting its VCos (q)_i,cet_j) Query term q of not less than minqvcos_iThe word is embedded into a candidate expansion word as the query term q_iWord embedding expansion word (q)_iet₁,q_iet₂,…,q_iet_p1) Will query term q₁,q₂,…,q_nEmbedding all the words into the expanded word combination, and removing repeated words to obtain the original query term setThe final word embedding expansion word set ET _ WE of the Q is combined, and the weight w of the word embedding expansion word is calculated_WEETThen, turning to the step 3;

(3.1) extracting m pieces of primary detection documents in the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and feature word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a feature word weight, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library;

(3.2) taking the feature words in the Chinese feature word library as 1_ candidate item set C₁；

(3.3) calculation of C₁Support degree CSup (C) based on Copulas theory₁) If CSup (C)₁) More than or equal to the minimum support threshold ms, C is set₁As 1_ frequent item set L₁And adding to a frequent item set FIS;

(3.4) adopting a self-connection method to connect (k-1) _ frequent item set L_k-1Deriving k _ candidate C from concatenation_kThe k is more than or equal to 2;

(3.5) when mining to 2_ candidate C₂When, if the C is₂If the original query term is not contained, the C is deleted₂If the C is₂If the original query term is contained, the C is left₂Then, C is left₂Transferring to the step (3.6); when mining to k _ candidate C_kWhen k is more than or equal to 3, directly switching to the step (3.6);

(3.6) calculation of C_kSupport degree CSup (C) based on Copulas theory_k) If CSup (C)_k) Not less than ms, then C_kIs k _ frequent item set L_kAdding into FIS, then transferring into the step (3.7), or directly transferring into the step (3.7);

(3.7) after k is added with 1, the step (3.4) is carried out to continue the subsequent steps until the L_kIf the item set is an empty set, finishing the excavation of the frequent item set, and turning to the step (3.8);

(3.8) taking out k _ frequent item set L from FIS_kThe k is more than or equal to 2;

(3.9) extraction of L_kIs set of proper subset entries Et_jAnd Q_iAnd is and

Q_i∪Et_j＝L_k，

et (E) described_jFor a proper subset of terms set without query terms, said Q_iThe method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;

(3.10) calculate association rule Q based on Copulas theory_i→Et_jConfidence of (CConf) (Q)_i→Et_j) If CConf (Q)_i→ET_j) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, Q is added_i→Et_jAdding to the association rule set AR, then proceeding to step (3.9), from L_kTo re-extract the other proper subset item sets Et_jAnd Q_iSequentially proceeding the next steps, and circulating the steps until L_kIf and only if all proper subset entries in the set are retrieved once, then proceed to step (3.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS_kThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS_kIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (3.11) is carried out;

(3.11) extracting the association rule backing Et from the association rule set AR_jAs a rule expansion word, obtaining a rule expansion word set ET _ AR, and calculating a rule expansion word weight w_EtThen, go to step 4;

(4.1) performing intersection operation on the rule extended word set ET _ AR and the word embedded extended word set ET _ WE to obtainTo the final extension word set ETS _ Q of the original query term set Q, and calculating the final extension word weight w (ET)_i)；

2. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein:

in said step (2.2), the word vector cosine similarity VCos (q)_i,cet_j) The calculation of (a) is performed according to equation (1):

in the formula (1), vcet_jIndicating that the jth word is embedded into the candidate expansion word cet_jWord vector value of, vq_iRepresenting the ith query term q_iA word vector value of;

in the step (2.3), the word embedding expansion word set ET _ WE of the final query Q is shown as a formula (2);

the weight w of the word embedded expansion word_WEETThe cosine similarity of the vector of the query term and the word embedded expansion word is shown in formula (3), and when a repeated word appears, the weight of the word embedded expansion word is equal to the accumulated sum of the similarity of each vector of the repeated word;

3. the method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein:

in the step (3.3), the CSup (C)₁) Is calculated as shown in equation (4):

in the formula (4), C₁"Count" represents 1_ candidate C₁The frequency of occurrence in the pseudo related feedback Chinese document library, all doc (count) represents the total document number of the pseudo related feedback Chinese document library, C₁"Weight" represents the 1_ candidate C₁Item set weights in the pseudo-correlation feedback Chinese document library, wherein allItems (weight) represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library;

in the step (3.6), the CSup (C)_k) Is calculated as shown in equation (5):

in the formula (5), C_k"Count" denotes k _ candidate C_kFrequency of occurrence in pseudo-relevance feedback Chinese document library, C_k"Weight" denotes k _ candidate C_kItem set weights in a pseudo-relevant feedback Chinese document library; the definitions of alldoc (count) and allItems (weight) are the same as in formula (4);

in the step (3.10), the CConf (Q)_i→ET_j) Is represented by equation (6):

in the formula (6), Q_i"Count represents the proper subset item set Q_iFrequency of occurrence, Q, in pseudo-associative feedback Chinese document library_i'Weight' represents a proper subset item set Q_iTerm set weights in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) _ Count represents a set of items (Q)_i∪Et_j) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)_i∪Et_j) ' Weight representation item set (Q)_i∪Et_j) Item set weights in a pseudo-relevant feedback Chinese document library; allDoc (count) and allItems (weight) are defined as in formula (4);

in the step (3.11), the ET _ AR is as shown in formula (7):

in formula (7), Ret_iRepresenting the ith rule expansion word;

4. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein:

in the step (4.1), the final extended word set ETS _ Q is calculated as shown in equation (9):

final expanded word weight w (ET)_i) Is calculated as shown in equation (10):

w(ET_i)＝w_Et+w_WEET(10)。

5. the method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein: in the step (2.1), the deep learning tool adopts a Skip-gram model of a Google open source word vector tool word2 vec.

6. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein: in the step (3.1), a TF-IDF weighting technology is adopted to calculate the weight of the feature words.

7. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein: in the step (3.4), the self-connection method adopts a candidate connection method given in Apriori algorithm.