CN111897923A

CN111897923A - Text retrieval method based on intersection expansion of word vector and association mode

Info

Publication number: CN111897923A
Application number: CN202010774137.8A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-11-06

Abstract

The invention provides a text retrieval method based on word vector and association mode intersection expansion, which comprises the steps of firstly, obtaining a primary check document set by querying and retrieving a Chinese document set through a user, then, respectively obtaining a regular expansion word set and a word vector expansion word set by carrying out two processes of regular expansion word mining and word vector semantic learning training on the primary check document set, wherein the regular expansion words contain characteristic inter-word association information based on statistical analysis, the word vector expansion words contain rich context semantic information, and the regular expansion word set and the word vector expansion word set are subjected to intersection fusion to obtain a final expansion word set so as to improve the quality of the expansion words and realize query expansion. Experimental results show that the method can reduce the problems of query subject drift and word mismatching in information retrieval, improve the information retrieval performance, has higher retrieval performance than similar comparison methods in recent years, and has better application value and popularization prospect.

Description

Text retrieval method based on intersection expansion of word vector and association mode

Technical Field

The invention relates to a text retrieval method based on intersection expansion of word vectors and associated modes, and belongs to the technical field of information retrieval.

Background

In the face of mass information resources of the internet in the big data era, how network users accurately and efficiently inquire more required information from network big data information is always a problem concerned by people in the information retrieval field of academia and industry. Query expansion is one of core key technologies for solving the problems of information retrieval word mismatching and query subject drift, and the query expansion refers to adding other feature words related to the original query semantics, so that the defect of semantic information caused by too simple original query is made up, and the purpose of improving the information retrieval performance is achieved. The Information retrieval method based on query expansion is regarded by scholars at home and abroad, for example, Pan et al (see Min Pan, Jimmy Huang, Tingting He, et al. A Simple Kernel Co-currence-based Enhancement for Pseudo-relevance-retrieval feed [ J ]. Journal of the association for Information Science and Technology (JASIST),2020,71(3): 264:. 281.) use Pseudo-relevance Feedback query expansion method based on Kernel term Co-occurrence semantics in Information retrieval, experiments show the validity of the method, Latiria et al (see Laddtiria C, Haad H, Hamrowni T. todigi and metadata relating to Information retrieval application J. see yellow Information mining, see: J. extension of query J. for Information mining, see yellow Information retrieval J. 247. use the rules of query expansion method of Pseudo-relevance Information extraction J. (see: 247. for yellow Information mining), the method comprises the following steps of (1) pseudo-correlation feedback query expansion [ J ] mined based on matrix weighted association rules, 2009,20(7): 1854-: patent query expansion word vector method research [ J ] computer science and exploration, 2018,12(6):972-980.) in information retrieval, query expansion is achieved by selecting expansion words through calculating word vector cosine similarity, and the like, and experimental results show that the method can improve retrieval performance.

However, the existing query expansion method does not finally and completely solve the technical problems of query topic drift, word mismatching and the like in information retrieval, and in order to better improve the information retrieval performance and effectively restrain the problems of query topic drift and word mismatching in information retrieval, the invention provides a text retrieval method based on word vector and association mode intersection expansion, which improves and improves the text information retrieval performance and has good application value and wide popularization prospect.

Disclosure of Invention

The invention aims to provide a text retrieval method based on intersection expansion of word vectors and associated modes, which is used in the field of information retrieval, such as a Chinese web information retrieval system or a search engine, and can improve the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.

The invention adopts the following specific technical scheme:

a text retrieval method based on intersection expansion of word vectors and association modes comprises the following steps:

step 1, a Chinese user queries and searches a Chinese document set to obtain a primary check document, and a primary check document set is constructed.

And 2, extracting m primary detection documents from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a weight of the feature words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.

The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.

Step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:

(3.1) mining 1_ candidate C₁: directly extracting characteristic words from Chinese characteristic word stock as 1_ candidate item set C₁。

(3.2) mining 1_ frequent item set L₁: calculating C₁Support Copulas _ S (C) based on Copulas function₁) Extract Copulas _ S (C)₁) C not lower than minimum support threshold ms₁As 1_ frequent item set L₁And added to the frequent item set fi (frequency itemset).

The copolas _ s (copolas based supported support) represents the degree of support based on the copolas function.

The copolas _ S (C)₁) Is calculated as in equation (1)The following steps:

in formula (1), Frequency [ C ]₁]Represents the 1_ candidate C₁The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C₁]Represents the 1_ candidate C₁Item set weights in the pseudo-relevance feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.

(3.3) generating k _ candidate C_kAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set L_k-1Deriving k _ candidate C from concatenation_k。

The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.

(3.4) to 2_ candidate C₂Pruning: if the C is₂If the original query term is not contained, the C is deleted₂If the C is₂If the original query term is contained, the C is left₂Then, C is left₂And (4) transferring to the step (3.5).

(3.5) mining k _ frequent item set L_kAnd k is more than or equal to 2: calculating C_kSupport Copulas _ S (C) based on Copulas function_k) Extract Copulas _ S (C)_k) C not lower than minimum support threshold ms_kAs k _ frequent item set L_kAnd is added to the FI.

The copolas _ S (C)_k) Is calculated as shown in equation (2):

in the formula (2), the Frequency [ C ]_k]Is represented by C_kFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ C_k]Is represented by C_kThe term set weights, SumCount and SumWeight, in the pseudo-relevance feedback Chinese document library are defined as in equation (1).

(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L is_kAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7).

(3.7) optionally taking out L from FI_kAnd k is more than or equal to 2.

(3.8) extraction of L_kIs a proper subset of item sets L_etAnd L_qAnd is and

L_q∪L_et＝L_k，

said L_etFor a proper subset of terms set without query terms, said L_qThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.

(3.9) mining association rule L_q→L_et: calculating the association rule L based on Copulas function_q→L_etConfidence copolas _ C (L)_q→L_et) Extracting confidence Copulas _ C (L)_q→L_et) Association rule L not lower than minimum confidence threshold mc_q→L_etTo the association rule set ar (association rule).

The copula _ c (copula based confidence) represents the confidence of the association rule based on the copula function.

The copolas _ C (L)_q→L_et) The formula (3) is shown as follows:

in formula (3), Frequency [ L ]_q]Representing a proper subset item set L_qFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ L_q]Representing a proper subset item set L_qTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document library_k]Representing a set of items L_kFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ L_k]Representing a set of items L_kItem set weights in a pseudo-relevance feedback Chinese document library. SumCount and SumWeight are as defined for formula (1).

(3.10) extract Copulas _ C (L)_q→L_et) Q of not less than a minimum confidence threshold mc_i→Et_jAdding into the association rule set AR (Association rule), and then proceeding to step (3.8), from L_kTo re-extract other proper subset item sets L_etAnd L_qSequentially proceeding the next steps, and circulating the steps until L_kIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FI_kThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FI_kIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).

(3.11) extracting a subsequent item set L of the association rule from the association rule set AR_etSaid L is_et＝(Ret₁,Ret₂,…,Ret_s) S is more than or equal to 1, and a slave item set L_etExtracting rule extension words, removing repeated items to obtain a rule extension word set ARET (expansion Term from Association rules), and calculating a weight w of the rule extension words_RetThen, the process proceeds to step 4.

The ARET is shown as a formula (4):

in the formula (4), Ret_iThe ith rule expansion word indicating that duplicate terms have been removed, said i ≧ 1.

The rule expansion word weight w_RetThe calculation formula is shown in formula (5):

in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.

And 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:

and (4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set.

The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (detailed in https:// code. Google. com/p/word2vec /).

(4.2) in the initial survey document feature word vector set, calculating each query term q_i(q_iE.g. Q, Q is the original query term set, Q ═ Q₁,q₂,…,q_n) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)₁,cet₂,…,cet_m) Word vector cosine similarity CosSim (q)_i,cet_j) The formula is shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m. The word vector candidate expansion words refer to those non-query terms in the word vector set.

In the formula (6), the vcet_jRepresenting the jth word vector candidate expansion word cet_jWord vector value of, vq_iRepresenting the ith query term q_iThe word vector value of.

(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSim_i,cet_j) Query term q not lower than minSim_iAs the q-word vector candidate expansion word_iWord vector expansion word (q)_iet₁,q_iet₂,…,q_iet_p1) Will query term q₁,q₂,…,q_nRemoving repeated words to obtain the final Word vector extension Word set WEET (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word vector extension words_weetThen, go to step 5.

The WEET is shown as a formula (7):

the weight w of the word vector expansion word_weetIn order to query the vector cosine similarity between the term and the word vector expansion word, as shown in formula (8), when a repeated word occurs, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word.

Step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:

(5.1) performing intersection operation on the rule Expansion word set ARET and the word vector Expansion word set WEET to obtain a final Expansion word set FETS (final Expansion Term set) of the original query Term set Q, realizing intersection Expansion of the word vectors and the association mode, and calculating a final Expansion word weight w_FetThen, the process proceeds to step 6.

The FETS is represented by formula (9):

in the formula (9), the Fet_nRepresenting the nth final expanded word.

Final expanded word weight w_FetExtending word set weights w for rules_RetSum word vector expansion word weight w_weetSum of w_weetAs shown in equation (10):

w_Fet＝w_Ret+w_weet(10)

and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a text retrieval method based on word vector and association mode intersection expansion, which comprises the steps of firstly, obtaining a primary check document set by querying and retrieving a Chinese document set through a user, then, respectively obtaining a regular expansion word set and a word vector expansion word set by carrying out two processes of regular expansion word mining and word vector semantic learning training on the primary check document set, wherein the regular expansion words contain characteristic inter-word association information based on statistical analysis, the word vector expansion words contain rich context semantic information, and the regular expansion word set and the word vector expansion word set are subjected to intersection fusion to obtain a final expansion word set so as to improve the quality of the expansion words and realize query expansion. Experimental results show that the method can improve the information retrieval performance, the retrieval performance is higher than that of the similar comparison method in recent years, and the method has good application value and popularization prospect.

(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that compared with the reference retrieval, the MAP of the method of the invention has the highest average amplification of 27.87 percent, the MAP of the method of the invention has higher average amplification than that of the comparison method, the average amplification of 18.21 percent, and the experimental effect is obvious, thus the retrieval performance of the method of the invention is better than that of the reference retrieval and comparison method, the information retrieval performance can be improved, the problems of query drift and word mismatching in the information retrieval are reduced, and the method has very high application value and wide popularization prospect.

Drawings

Fig. 1 is a general flow diagram of a text retrieval method based on word vector and association pattern intersection expansion according to the present invention.

Detailed Description

Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:

1. item set

In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.

2. Associating rules front and back parts

Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.

Support degree and confidence degree of Copulas function and characteristic word association mode

The copula function theory (see Sklar A. the principles de repetition a n dimension sets sources marks J. the Publication de l 'institute de Statistique l' universities 1959,8(1): 229) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function.

The invention utilizes copula function to integrate the frequency and weight of the feature term set into the support degree and confidence degree of the feature term association mode, and provides the calculation formula of the support degree copula _ s (copula based support) and the confidence degree copula _ c (copula based confidence) as follows.

Characteristic term set (T)₁∪T₂) Degree of support Copulas _ S (T)₁∪T₂) Is calculated as shown in equation (11):

in formula (11), Frequency [ T ]₁∪T₂]Representing a set of items (T)₁∪T₂) Frequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ T₁∪T₂]Representing a set of items (T)₁∪T₂) Term set weights in pseudo-relevance feedback Chinese document library, SumCount representing pseudo-relevance feedbackThe total document number of the Chinese document library, SumWeight, represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library.

Association rule (T)₁→T₂) The confidence copolas _ C is calculated as shown in equation (12):

in formula (12), Frequency [ T ]₁]Representing a set of items T₁Frequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ T₁]Representing a set of items T₁Term set weight, Frequency [ T ], in pseudo-relevance-feedback Chinese document library₁∪T₂]、Weight[T₁∪T₂]SumCount and SumWeight are as defined for formula (11).

4. Rule expansion word

The rule expansion word means that the expansion word is from a back item set of the association rule, and the front item set of the association rule is an original query item set.

5. Rule expansion word weight calculation

The invention takes the confidence coefficient of the association rule as the weight w of the rule expansion word_Ret。

The rule expansion word weight w_RetThe calculation formula is shown in formula (11):

in the formula (13), L_RetSet of terms comprising an extension term Ret for the absence of query terms, said L_qIs a term set containing query terms, and Q is an original query term set,

L_q∪L_et＝L_k，

the Frequency [ L ]_q]Representing a set of items L_qIn falseRelevance feedback frequency of occurrence in Chinese document library, Weight [ L_q]Representing a set of items L_qTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document library_k]Representing a set of items L_kFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ L_k]Representing a set of items L_kItem set weights in a pseudo-relevant feedback Chinese document library; SumCount and SumWeight are as defined for formula (11); max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.

6. Word vector expansion word and weight thereof

The meaning of the word vector expansion word is described as follows:

in the initial examination document feature word vector set, calculating query term q_i(q_iE.g. Q, Q is the original query term set, Q ═ Q₁,q₂,…,q_n) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)₁,cet₂,…,cet_m) Word vector cosine similarity CosSim (q)_i,cet_j) Wherein j is more than or equal to 1 and less than or equal to m, a minimum vector cosine similarity threshold value minSim is given, and CosSim (q) of the minimum vector cosine similarity threshold value is extracted_i,cet_j) Query term q not lower than minSim_iAs the q-word vector candidate expansion word_iWord vector expansion word (q)_iet₁,q_iet₂,…,q_iet_p1) Will query term q₁,q₂,…,q_nAnd after repeated words are removed, obtaining a final word vector extension word set WEET (expansion Term from word extension) of the original query Term set Q. The word vector candidate expansion words refer to those non-query terms in the word vector set.

The word vector cosine similarity CosSim (q)_i,cet_j) As shown in equation (14):

in the formula (14), the vcet_jRepresenting the jth word vector candidate expansion word cet_jWord vector value of, vq_iRepresenting the ith query term q_iThe word vector value of.

The word vector expansion word set WEET is shown as formula (15):

the invention takes the vector cosine similarity value of the query lexical item and the word vector expansion word as the weight of the word vector expansion word.

The weight w of the word vector expansion word_weetWhen a repeated word appears, the word vector expansion word weight is equal to the cumulative sum of the similarity of each vector of the repeated word, as shown in formula (16).

The invention is further explained below by referring to the drawings and specific comparative experiments.

As shown in fig. 1, the text retrieval method based on the intersection expansion of the word vector and the association mode of the present invention includes the following steps:

And 2, extracting m primary detection documents from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a feature word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.

(3.1) mining 1_ candidate C₁: directly extracting characteristic words from Chinese characteristic word library as 1_ candidateItem set C₁。

The copolas _ S (C)₁) Is calculated as shown in equation (1):

The self-join method uses a candidate set join method given in Apriori algorithm.

The copolas _ S (C)_k) Is calculated as shown in equation (2):

(3.7) optionally taking out L from FI_kAnd k is more than or equal to 2.

(3.8) extraction of L_kIs a proper subset of item sets L_etAnd L_qAnd is and

L_q∪L_et＝L_k，

The copolas _ C (L)_q→L_et) The formula (3) is shown as follows:

The ARET is shown as a formula (4):

The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.

The WEET is shown as a formula (7):

The FETS is represented by formula (9):

in the formula (9), the Fet_nRepresenting the nth final expanded word.

w_Fet＝w_Ret+w_weet(10)

Experimental design and results:

we compare the experiment with the prior similar method to illustrate the effectiveness of the method of the invention.

1. Experimental environment and experimental data:

the experimental corpus is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text standard corpus, 901446 Chinese traditional documents (converted into Chinese simplified bodies during experiments) are distributed in 8 data sets as shown in Table 1. The NTCIR-5CLIR corpus has 50 chinese queries, 4 types of query topics and a result set of 2 evaluation criteria (i.e., Rigid (highly relevant, relevant to query) and Relax (highly relevant, and partially relevant to query) evaluation criteria). The retrieval experiment is completed by adopting the topics of Description (Desc for short, belonging to long query) and Title query (belonging to short query). The index for experimental results retrieval evaluation is P @ 5.

TABLE 1 original corpus and its quantity

2. The reference retrieval and comparison method comprises the following steps:

the experimental basic retrieval environment is built by Lucene.

The reference retrieval is a retrieval result obtained by submitting an original query to Lucene.

The comparative method is described as follows:

comparative method 1: mining rule extension words based on a weighted frequent pattern mining technology of multiple support thresholds of documents (see Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining with weighted multiple minimum supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605-: the results of the experiment were average values of ms of 0.1,0.15,0.2,0.25, 0.1 and 0.1 for mc, LMS, HMS, WT and WT, respectively.

Comparative method 2: mining rule extension words by adopting a fully weighted positive and negative association mode mining technology of documents (detailed in: yellow name selection, JianCaoqing, more-English cross-language query translation based on fully weighted positive and negative association mode mining, expansion [ J ]. electronic newspaper, 2018,46(12):3029 and 3036.) to realize query expansion and parameters: the results of the experiments were average values of ms of 0.10,0.11,0.12, and 0.13, where mc is 0.1, α is 0.3, minPR is 0.1, and minNR is 0.01.

The Skip-gram model words used by the invention are embedded with semantic learning training parameters: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.

3. The experimental methods and results are as follows:

the average value of the retrieval result P @5 obtained by 50 Chinese queries on 8 data sets of NTCIR-5CLIR corpus is shown in tables 2 and 3, wherein the average amplification (%) in the tables refers to the total average amplification of the retrieval result on 8 data sets by the method relative to the reference retrieval and contrast expansion method. The average amplification calculation method comprises the following steps: firstly, the amplification of the retrieval result of the method of the invention on each data set relative to the reference retrieval and contrast expansion method is calculated, then, the amplification on each data set is accumulated and then is divided by 8, and the total average amplification of the retrieval result of the method of the invention relative to other methods is obtained.

TABLE 2 search Performance P @5 value (Title Inquiry) of the inventive method and the reference search and comparison method

TABLE 3 search Performance P @5 values (Desc Inquiry) for the inventive method and the reference search and comparison method

Tables 2 and 3 show that compared with the reference retrieval, the retrieval result P @5 value of the method is better improved, the effect is obvious, the average amplification is 20.44% at most, most of the P @5 values of the method are higher than those of the comparative method, and the expansion retrieval performance of the method is higher than that of the reference retrieval and the similar comparative method. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims

1. A text retrieval method based on intersection expansion of word vectors and association modes is characterized by comprising the following steps:

step 1, a Chinese user queries and retrieves a Chinese document set to obtain a primary check document, and a primary check document set is constructed;

step 2, extracting m pieces of primary detection documents in the front row from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a feature word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library;

(3.1) mining 1_ candidate C₁: directly extracting characteristic words from Chinese characteristic word stock as 1_ candidate item set C₁；

(3.2) mining 1_ frequent item set L₁: calculating C₁Support Copulas _ S (C) based on Copulas function₁) Extract Copulas _ S (C)₁) Is not lowC at minimum support threshold ms₁As 1_ frequent item set L₁Adding the frequent item set FI to the frequent item set;

the copolas _ S (C)₁) Is calculated as shown in equation (1):

in formula (1), Frequency [ C ]₁]Represents the 1_ candidate C₁The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C₁]Represents the 1_ candidate C₁Item set weights in the pseudo-correlation feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library;

(3.3) generating k _ candidate C_kAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set L_k-1Deriving k _ candidate C from concatenation_k(ii) a The self-connection method adopts a candidate item set connection method given in an Apriori algorithm;

(3.4) to 2_ candidate C₂Pruning: if the C is₂If the original query term is not contained, the C is deleted₂If the C is₂If the original query term is contained, the C is left₂Then, C is left₂Transferring to the step (3.5);

(3.5) mining k _ frequent item set L_kAnd k is more than or equal to 2: calculating C_kSupport Copulas _ S (C) based on Copulas function_k) Extract Copulas _ S (C)_k) C not lower than minimum support threshold ms_kAs k _ frequent item set L_kAdded to FI;

the copolas _ S (C)_k) Is calculated as shown in equation (2):

in the formula (2), the Frequency [ C ]_k]Is represented by C_kAppearing in a pseudo-relevant feedback Chinese document libraryFrequency, Weight [ C_k]Is represented by C_kItem set weights in the pseudo-relevance feedback Chinese document library, SumCount and SumWeight are defined as in formula (1);

(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L is_kIf the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7);

(3.7) optionally taking out L from FI_kThe k is more than or equal to 2;

(3.8) extraction of L_kIs a proper subset of item sets L_etAnd L_qAnd is and

L_q∪L_et＝L_k，

said L_etFor a proper subset of terms set without query terms, said L_qThe method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;

(3.9) mining association rule L_q→L_et: calculating the association rule L based on Copulas function_q→L_etConfidence copolas _ C (L)_q→L_et) Extracting confidence Copulas _ C (L)_q→L_et) Association rule L not lower than minimum confidence threshold mc_q→L_etAdding the data into an association rule set AR;

the copolas _ C (L)_q→L_et) The formula (3) is shown as follows:

in formula (3), Frequency [ L ]_q]Representing a proper subset item set L_qFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ L_q]Representing a proper subset item set L_qTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document library_k]Representing a set of items L_kExtracting in pseudo-relevant feedback Chinese document libraryFrequency of occurrence, Weight [ L_k]Representing a set of items L_kItem set weights in a pseudo-relevant feedback Chinese document library; SumCount and SumWeight are as defined for formula (1);

(3.10) extract Copulas _ C (L)_q→L_et) Q of not less than a minimum confidence threshold mc_i→Et_jAdding to the association rule set AR, then proceeding to step (3.8), from L_kTo re-extract other proper subset item sets L_etAnd L_qSequentially proceeding the next steps, and circulating the steps until L_kIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FI_kThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FI_kIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (3.11) is carried out;

(3.11) extracting a subsequent item set L of the association rule from the association rule set AR_etSaid L is_et＝(Ret₁,Ret₂,…,Ret_s) S is more than or equal to 1, and a slave item set L_etExtracting rule expansion words, removing repeated items to obtain a rule expansion word set ARET, and calculating the weight w of the rule expansion words_RetThen, go to step 4;

the ARET is shown as a formula (4):

in the formula (4), Ret_iAn ith rule expansion word indicating that duplicate terms have been removed, where i ≧ 1;

in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word;

(4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set;

the deep learning tool is a Skip-gram model of a Google open source word vector tool word2 vec;

(4.2) in the initial survey document feature word vector set, calculating each query term q_i(q_iE.g. Q, Q is the original query term set, Q ═ Q₁,q₂,…,q_n) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)₁,cet₂,…,cet_m) Word vector cosine similarity CosSim (q)_i,cet_j) As shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m; the word vector candidate expansion words refer to the non-query terms in the word vector set;

in the formula (6), the vcet_jRepresenting the jth word vector candidate expansion word cet_jWord vector value of, vq_iRepresenting the ith query term q_iA word vector value of;

(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSim_i,cet_j) Query term q not lower than minSim_iAs the q-word vector candidate expansion word_iWord vector expansion word (q)_iet₁,q_iet₂,…,q_iet_p1) Will query term q₁,q₂,…,q_nRemoving repeated words to obtain the final word vector expansion word set WEET of the original query term set Q, and calculating the weight w of the word vector expansion words_weetThen, howeverThen, turning to the step 5;

the WEET is shown as a formula (7):

the weight w of the word vector expansion word_weetFor the vector cosine similarity between the query term and the word vector expansion word, as shown in formula (8), when a repeated word appears, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word;

(5.1) performing intersection operation on the rule expansion word set ARET and the word vector expansion word set WEET to obtain a final expansion word set FETS of the original query term set Q, realizing intersection expansion of the word vectors and the association mode, and calculating a final expansion word weight w_FetThen, go to step 6;

the FETS is represented by formula (9):

in the formula (9), the Fet_nRepresents the nth final expanded word;

w_Fet＝w_Ret+w_weet(10)