CN108062355A - Query word extended method based on pseudo- feedback with TF-IDF - Google Patents

Query word extended method based on pseudo- feedback with TF-IDF Download PDF

Info

Publication number
CN108062355A
CN108062355A CN201711179719.6A CN201711179719A CN108062355A CN 108062355 A CN108062355 A CN 108062355A CN 201711179719 A CN201711179719 A CN 201711179719A CN 108062355 A CN108062355 A CN 108062355A
Authority
CN
China
Prior art keywords
word
msub
mrow
inquiry
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711179719.6A
Other languages
Chinese (zh)
Other versions
CN108062355B (en
Inventor
徐志文
田绪红
古万荣
毛宜军
王国华
李吉平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Agricultural University
Original Assignee
South China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Agricultural University filed Critical South China Agricultural University
Priority to CN201711179719.6A priority Critical patent/CN108062355B/en
Publication of CN108062355A publication Critical patent/CN108062355A/en
Application granted granted Critical
Publication of CN108062355B publication Critical patent/CN108062355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of query word extended methods based on pseudo- feedback with TF IDF, this method mainly constrains selected ci poem by the inquiry of science and takes, it finally can be used to do the word that query word extends by screening twice proposed by the present invention, then given a mark for document by marking formula proposed by the present invention and sorting operation.The characteristic of the present invention is that proposing a kind of new inquiry constraint selected ci poem takes the selection mode of mode and candidate word, and has done screening operation twice and removed unrelated word.Traditional BM25 marking formula is had also combined, invents a new new marking formula for aiming at query word extension, result document that can be more after the extending query word of science is given a mark, so as to draw more scientific searching order result.

Description

Query word extended method based on pseudo- feedback with TF-IDF
Technical field
The present invention relates to vertical search, search term extension field, a kind of looking into based on pseudo- feedback and TF-IDF is referred in particular to Ask word extended method.
Background technology
Search engine has become the important tool that people obtain information needed at present, but since user can not be accurate sometimes Search term is provided, and causes search result unsatisfactory.In order to preferably provide the information useful to user, query word extension Technology is just come into being.
Want in user there are many kinds of the contents searched in the case of expression way, if only carried out by the search term of user Retrieval, it is easy to search term matching, the unsatisfactory situation of search result occur.So we should be as far as possible possible As a result fed back, then sorted, selected for user, so as to achieve the purpose that Optimizing Search according to weight calculation formula.Tradition Search term extension have the following problems:
1) expansion word is excessive, causes feedback result excessive, and user tends not to finish watching search result;
2) weight of the candidate word of many expansion words can be calculated, causes the calculating time excessive;
3) many unrelated words may be also counted as the larger word of weight when sequence, causes the data for coming several former There are extraneous datas.
4) expansion word causes to extend ineffective completely according to synonym table.
The content of the invention
It is an object of the invention to overcome the shortcomings of existing query word extension, it is proposed that one kind is based on pseudo- feedback and TF- The query word extended method of IDF the method define the preliminary selection of query word candidate word and the postsearch screening of candidate word, Give a relatively reasonable query word extended method.
To achieve the above object, technical solution provided by the present invention is:Expanded based on pseudo- feedback and the query word of TF-IDF Exhibition method, comprises the following steps:
1) inquiry constraint word is selected;
2) initial extension candidate word is selected;
3) screening obtains secondary expansion candidate word;
4) final expansion word is obtained by calculating score;
5) expansion word drawn according to step 4) creates inquiry;
6) sorted according to document weight.
In step 1), query statement is removed into stop words and is segmented, each word separated is first once individually looked into It askes, the word scored is respectively n1, n2, n3..., the feedback number of documents of each word is recorded, is respectively Nn1, Nn2, Nn3..., it will All query words carry out an AND inquiry, record the total FA of feedback document, then all inquiries are carried out an OR inquiry, The total FO of feedback document is recorded, in order to find the inquiry of query statement constraint word, first, with first in query statement Word n1Exemplified by, we will be n1One query is carried out, the operation of the inquiry is to reject n1Afterwards, remaining all query words are carried out AND inquiry, the sum for the feedback document this time inquired about are denoted as Fn1, then calculate FA and account for Fn1Proportion, use D1To record knot Fruit, D1Calculation formula be:
Afterwards, we will be calculated comprising word n1Number of files account for OR inquiries result document number proportion, n is represented with this1 Degree of freedom (occur in the result document of OR inquiries more, illustrate freer, effect of contraction is smaller), use V1It represents, V1's Calculation formula is:
If there is V1More than D1Then by word n1It is defined as inquiry constraint word, other words n afterwards2, n3..., also repeat above-mentioned behaviour Make, until all words all carried out judgement, why V1More than D1Definition be inquiry constraint word, be because V1It represents n1Degree of freedom in OR query results, if n1Presence to the no effect of contraction of the inquiry, then include n1AND inquiry account for Not comprising n1AND inquiry proportion D1, it should more than or equal to its degree of freedom V1If V1More than D1, then its presence is illustrated Allow the amount for feeding back number of files reduction that AND is inquired about higher than threshold value (if V1Equal to D1, then AND inquiry feedback number of documents reduction Just, it to the effect of contraction of query statement just, be at this time threshold value;If V1Less than D1, then V is illustrated1Corresponding word n1Deposit , it is also smaller than threshold value to the effect of contraction of the query statement, in other words, n1Degree of freedom can to regard it as can be the inquiry The size for the advantageous effect that sentence plays the role of, if its presence is not played its minimum this and had, then it is assumed that it is looked into this The effect of sentence Constrained is ask, therefore is defined as inquiry constraint word), inquiry constraint word is carried out query word extension by us, If the word is directly only set to inquiry constraint word there are one word.
In step 2), each inquiry constraint word is carried out individually inquiry and obtains its feedback document by us, by it The content of feedback document carries out stop words and participle, then according to the TFnew*IDF of each word, score is calculated, by ranking Preceding 50 word is denoted as the initial candidate expansion word of the word, defines candidate word wiIn the feedback document d of constraint wordjThe word frequency of middle appearance For widj, the feedback number of files of note constraint word is k, and the calculation formula of TFnew is (max () function is to take all parameters in bracket Maximum, min () function is the minimum value for taking all parameters in bracket):
Number of files all in corpus is defined as N, contains wiNumber of files be wiN, wiThe calculation formula of IDF be:
In step 3), we will more each initial candidate word its it is corresponding inquiry constraint word feedback document in Average word frequency of the average word frequency with the initial candidate word in the feedback document of oneself, if the latter's bigger, by the word from initial Rejected in candidate word set, obtain two level candidate word set (though ensure by some with constraint word it is relevant, not only with constraint The related word of word is rejected), it is X to remember the number occurred in i-th of document of certain word in n documenti, average word frequency TFavg Calculation formula be:
In step 4), candidate word weight calculation formula is utilized:
-1
S (w, q)=(d+1)
W represents candidate word, and q represents constraint word, and d is the absolute value of the distance of the term vector of two words, represent candidate word with The distance of word is constrained, why uses (d+1)-1The score of each candidate word is calculated, because first can ensure with distance Increase, score S is can be gradually smaller, and second when can allow closer to the distance, and score is increased very big, apart from it is far when score subtract Few very little more meets reality, also more Easy open gap, by sort result, using the ranking word of first three as the expansion of the constraint word Word is opened up, repeats aforesaid operations, until going out the expansion word of oneself for each inquiry constraint selected ci poem.
In step 5), with each constraint word and its 3 expansion words, with logical relation OR connections, theirs are formed Set, the set for then forming institute's Constrained word and their 3 expansion words, with logical relation AND connections, finally again They are connected with other query statements with logical relation AND, are inquired about with this relation, obtain feedback document.
In step 6), original BM25 marking formula is introduced:
In formula, WiFor the weight of its equivalent, R (qi, d) be current equivalent and document the degree of correlation, in original BM25 It gives a mark in formula, by the score formula in step 4), i.e. S (w, q) is added in, and is sorted, and the document to be given a mark is denoted as d, with Exemplified by some inquiry constraint word, the expansion word for inquiring about constraint word is denoted as q respectively1, q2, q3, inquire about constraint word and be denoted as q4, then S (qi, q4) represent inquiry constraint word q4Expansion word qiTheir distance calculated using the marking formula in step 4), they Weight in BM25 is respectively W1, W2, W3, W4, they are denoted as a query set QA, they in BM25 with each document The degree of correlation be denoted as R (qi, d), first calculate each constraint word and the score S of its expansion word partA, then all inquiry constraints The S that word and its expansion word calculateAIt is all added, is finally calculated along with other query statements by original BM25 marking formula Score SB, i.e., with the score S that the BM25 formula without adding in the score formula S (w, q) in step 4) are their calculatingB
SACalculation formula be:
By whole SAThe sum of add SBScore afterwards is the final score of every document, afterwards by result press from greatly to Small return.
Compared with prior art, the present invention having the following advantages that and advantageous effect:
1st, of the invention is not to be extended for each word, but has what science selected, will be had about to query result The word of Shu Zuoyong is extended, more science.
2nd, the present invention calculates after can the maxima and minima of word frequency be rejected when selecting candidate word, can tie calculating Fruit is more fair.
3rd, the present invention has screens candidate word twice, can reject unrelated candidate word, so that not having in final result Unnecessary extension.
4th, the present invention utilizes x-1The similarity calculation between word and word is carried out, is more tallied with the actual situation, first apart from smaller Similarity is bigger, and two words very close to when, similarity increases very big, and when far twice, similarity varies less.
5th, the present invention to finally to the marking of each document when only need with 3 final candidate words participations rather than all times Word is selected to both participate in marking, can save and calculate the time.And expansion word candidate word and the similarity of expansion word are also added in when giving a mark It calculates, and as auxiliary bonus point item, more science.
Description of the drawings
Fig. 1 is the processing step flow chart of the method for the present invention.
Specific embodiment
With reference to specific embodiment, the invention will be further described.
As shown in Figure 1, the query word extended method based on pseudo- feedback with TF-IDF that the present embodiment is provided, including following Step:
1) inquiry constraint word is selected
First, query statement is removed into stop words and segmented, stop words includes some auxiliary words of mood, preposition etc., these operations The existing segmenter such as IK segmenter can be used, we will judge whether the vocabulary quantity separated is more than 1 first, if more than 1, then Once individually inquiry is carried out in the search engine (such as solr) that each word separated is first put up at us, by each word Remittance (n1, n2, n3...) and inquiry its obtain feedback number of documents (Nn1, Nn2, Nn3...) and after each word removes it Feedback number of documents (the Fn of the AND inquiries of remaining all query words1, Fn2, Fn3...) correspond, and use chained list ListArrayA is preserved, and all query words then are carried out an AND inquiry, and it is total to record feedback document with integer variable FA Number, then all inquiries are subjected to an OR inquiry, the total number of files of feedback is recorded with integer variable FO, with floating type variable array D records the threshold value proportion of each word, wherein each element Dx(which element x represents, does not imply that array index, x from 1 starts to take, and is taken until by all words) calculation formula be:
Further according to element number in the chained list defined before, the floating type array V of one is defined, records the freedom of each word Degree (occur in the result document of OR inquiries more, illustrate freer, effect of contraction is smaller), wherein each element Vx(x Which element represented, does not imply that array index, x takes since 1, is taken until by all words) calculation formula be:
By the element in array V one by one compared with element in corresponding D, if there is VxMore than Dx, then will be under this element Mark takes out, and finds corresponding lower target element in chained list ListArrayA, the corresponding word of this element is deposited into new chained list ListArrayB, final ListArrayB all elements are then inquiry constraint word, our purpose is that inquiry is constrained word Query word extension is carried out, why VxMore than DxWord be defined as inquiry constraint word, be because VxRepresent nxIt inquires about and ties in OR Degree of freedom in fruit, if nxPresence to the no effect of contraction of the inquiry, then include nxAND inquiry account for not comprising nxAND The proportion D of inquiryx, it should more than or equal to its degree of freedom VxIf VxMore than Dx, then it is anti-to illustrate that its presence allows AND to inquire about The amount for presenting number of documents reduction is higher than threshold value (if VxEqual to Dx, then just, it is to looking into for the feedback number of documents reduction of AND inquiries The effect of contraction of inquiry sentence is at this time threshold value just;If VxLess than Dx, then V is illustratedxCorresponding word nxPresence, to the inquiry language The effect of contraction of sentence is also smaller than threshold value, in other words, nxDegree of freedom can to regard it as can be having of playing of the query statement The size of profit effect, if its presence is not played the role of its minimum this and had, then it is assumed that it is to the query statement Constrained Effect, therefore be defined as inquiring about constraint word), inquiry constraint word is carried out query word extension by us, if word there are one only, Then directly the word is put into ListArrayB, is set to inquiry constraint word.
2) initial extension candidate word is selected
We define the chained list ListArrayC of a character type array (size 50) first, we will Element in ListArrayB repeats following operation one by one, afterwards corresponds in deposit ListArrayC result, takes first First constraint word, is carried out individually inquire about obtaining its feedback document, fed back document word carry out stop words with And participle, then according to the TFnew*IDF of each word, calculate score, by except the constraint word in itself in addition to ranking before 50 word remember It for the initial candidate expansion word of the word, that is, deposits into the first character type array in ListArrayC, is repeated in Operation is stated, defines candidate word wiIn the feedback document d of constraint wordjThe word frequency of middle appearance is widj, the feedback number of files of note constraint word For k, the calculation formula of TFnew is (max () function is the maximum for taking all parameters in bracket, and min () function is to take bracket In all parameters minimum value):
Number of files all in corpus is defined as N, contains wiNumber of files be wiThe calculation formula of N, IDF is:
3) screening obtains secondary expansion candidate word
We will more each initial candidate word its it is corresponding inquiry constraint word feedback document in average word frequency with Average word frequency of the initial candidate word in the feedback document of oneself, if the latter's bigger, by the word from initial candidate set of words Middle rejecting, obtains two level candidate word set, deposit ListArrayD (though ensure some are relevant with constraint word, not only Rejected with the related word of constraint word), it is X to remember the number occurred in i-th of document of certain word in n documenti, average word frequency The calculation formula of Tfavg is:
4) final expansion word is obtained by calculating score
Utilize candidate word weight calculation formula:
S (w, q)=(d+1) -1
W represents candidate word, and q represents constraint word, and d is the absolute value of the distance of the term vector of two words, represent candidate word with The distance of word is constrained, why uses (d+1)-1The score of each candidate word is calculated, because first can ensure with distance Increase, score S is can be gradually smaller, and second when can allow closer to the distance, and score is increased very big, apart from it is far when score subtract Few very little more meets reality, also more Easy open gap, by sort result, using the ranking word of first three as the expansion of the constraint word Word is opened up, is stored in ListArrayE, repeats aforesaid operations, until going out the expansion word of oneself for each inquiry constraint selected ci poem.
5) expansion word drawn according to step 4) creates inquiry
All words in ListArrayE and their corresponding constraint words are carried out to the inquiry operation of following form, (about The expansion word 3OR constraints word 1 of the expansion word 2OR constraint words 1 of the expansion word 1OR constraint words 1 of beam word 1) AND (extensions of constraint word 2 The expansion word 3OR constraints word 2 of the expansion word 2OR constraint words 2 of word 1OR constraint words 2) ... the mode of other query statements of AND carries out Inquiry.
6) sorted according to document weight
It sorts after feedback document in step 5) is given a mark, marking principle is:Introduce original BM25 marking formula:
In formula, WiFor the weight of its equivalent, R (qi, d) be current equivalent and document the degree of correlation, in original BM25 It gives a mark in formula, the score formula S (w, q) of step 4) is added in, each expansion word is also assisted in into point counting together, note is wanted The document of marking is denoted as d, and by taking some inquiry constraint word as an example, the expansion word for inquiring about constraint word is denoted as q respectively1, q2, q3, look into It askes constraint word and is denoted as q4, then S (qi, q4) represent inquiry constraint word q4Expansion word qiIt is calculated using the marking formula in step 4) Their distance, their weights in BM25 are respectively W1, W2, W3, W4, they are denoted as a query set QA, they In BM25 R (q are denoted as with the degree of correlation of each documenti, d), first calculate each constraint word and the score of its expansion word part SA, then the S that all inquiry constraint words and its expansion word are calculatedAAll be added, finally along with other query statements pass through it is original The score S that BM25 marking formula calculateB, i.e., it is them with the BM25 formula without adding in the score formula S (w, q) in step 4) The score S of calculatingB
SACalculation formula be:
By all SAAnd plus SBScore afterwards is the final score of every document, afterwards by result press from greatly to Small return.
Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore The variation that all shape, principles according to the present invention are made should all be covered within the scope of the present invention.

Claims (6)

1. the query word extended method based on pseudo- feedback with TF-IDF, which is characterized in that comprise the following steps:
1) inquiry constraint word is selected;
2) initial extension candidate word is selected;
3) screening obtains secondary expansion candidate word;
4) final expansion word is obtained by calculating score;
5) expansion word drawn according to step 4) creates inquiry;
6) sorted according to document weight.
2. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step 1) in, query statement is removed into stop words and is segmented, each word separated is first once individually inquired about, the word scored point It Wei not n1, n2, n3..., the feedback number of documents of each word is recorded, is respectively Nn1, Nn2, Nn3..., all query words are carried out AND inquiry records the total FA of feedback document, then all inquiries is carried out an OR inquiry, records feedback document Total FO, in order to find the inquiry of query statement constraint word, first, with first word n in query statement1Exemplified by, to be n1 One query is carried out, the operation of the inquiry is to reject n1Afterwards, remaining all query words are subjected to an AND inquiry, this is looked into The sum of the feedback document of inquiry is denoted as Fn1, then calculate FA and account for Fn1Proportion, use D1To record as a result, D1Calculation formula be:
<mrow> <msub> <mi>D</mi> <mn>1</mn> </msub> <mo>=</mo> <mfrac> <mrow> <mi>F</mi> <mi>A</mi> </mrow> <mrow> <msub> <mi>Fn</mi> <mn>1</mn> </msub> </mrow> </mfrac> </mrow>
Afterwards, to calculate comprising word n1Number of files account for OR inquiries result document number proportion, n is represented with this1Freedom Degree, uses V1It represents, V1Calculation formula be:
<mrow> <msub> <mi>V</mi> <mn>1</mn> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>Nn</mi> <mn>1</mn> </msub> </mrow> <mrow> <mi>F</mi> <mi>O</mi> </mrow> </mfrac> </mrow>
If there is V1More than D1Then by word n1It is defined as inquiry constraint word, other word n afterwards2, n3..., also repeatedly aforesaid operations, directly Judgement was all carried out to all words, why V1More than D1Definition be inquiry constraint word, be because V1Represent n1In OR Degree of freedom in query result, if n1Presence to the no effect of contraction of the inquiry, then include n1AND inquiries account for and do not include n1AND inquiry proportion D1, it should more than or equal to its degree of freedom V1If V1More than D1, then illustrate that its presence allows AND to look into The amount of the feedback number of documents reduction of inquiry is higher than threshold value, that is, it is bigger than threshold value to the effect of contraction of the query statement;If V1 Equal to D1, then just, it is at this time threshold just to the effect of contraction of query statement for the feedback number of documents reduction of AND inquiries Value;If V1Less than D1, then V is illustrated1Corresponding word n1Presence, it is also smaller than threshold value to the effect of contraction of the query statement, change sentence It talks about, n1Degree of freedom can to regard it as can be the size of advantageous effect that the query statement plays, if its presence does not have Play the role of to its minimum this, then it is assumed that it is that the query statement Constrained is acted on, therefore is defined as inquiry constraint word; In addition, inquiry constraint word is subjected to query word extension, if the word directly only is set to inquiry constraint word there are one word.
3. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step 2) in, each inquiry constraint word is subjected to individually inquiry and obtains its feedback document, the content for being fed back document carries out Stop words and participle are removed, then according to the TFnew*IDF of each word, score is calculated, before ranking 50 word is denoted as the word Initial candidate expansion word defines candidate word wiIn the feedback document d of constraint wordjThe word frequency of middle appearance is widj, remember the anti-of constraint word Feedback number of files is k, and the calculation formula of TFnew is:
<mrow> <mi>T</mi> <mi>F</mi> <mi>n</mi> <mi>e</mi> <mi>w</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mn>2</mn> </msub> <mo>+</mo> <mn>...</mn> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mi>k</mi> </msub> <mo>-</mo> <mi>max</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mn>2</mn> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>min</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mn>2</mn> </msub> <mo>,</mo> <mn>...</mn> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>d</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>k</mi> <mo>-</mo> <mn>2</mn> </mrow> </mfrac> <mo>*</mo> <mi>k</mi> </mrow>
Wherein, max () function is the maximum for taking all parameters in bracket, and min () function is to take in bracket all parameters most Small value;
Number of files all in corpus is defined as N, contains wiNumber of files be wiN, wiThe calculation formula of IDF be:
<mrow> <mi>I</mi> <mi>D</mi> <mi>F</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mi>N</mi> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>.</mo> </mrow>
4. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step It 3), average word frequency and the initial time of more each initial candidate word in the feedback document of its corresponding inquiry constraint word in Average word frequency of the word in the feedback document of oneself is selected, if the latter's bigger, which from initial candidate set of words is rejected, is obtained To two level candidate word set, though ensureing some are relevant with constraint word, not only rejected with the related word of constraint word, note The number that certain word occurs in i-th of document in n document is Xi, the calculation formula of average word frequency TFavg is:
<mrow> <mi>T</mi> <mi>F</mi> <mi>a</mi> <mi>v</mi> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mfrac> <msub> <mi>X</mi> <mi>i</mi> </msub> <mi>n</mi> </mfrac> <mo>.</mo> </mrow>
5. a kind of query word extended method based on pseudo- feedback with TF-IDF according to claim 1, it is characterised in that: In step 4), candidate word weight calculation formula is utilized:
S (w, q)=(d+1)-1
In formula, w represents candidate word, and q represents constraint word, and d is the absolute value of the distance of the term vector of two words, represent candidate word with The distance of word is constrained, why uses (d+1)-1The score of each candidate word is calculated, because first can ensure with distance Increase, score S is can be gradually smaller, and second when can allow the distance to be close to preset range, score increase within a preset range, Apart from it is remote when score reduce within a preset range, more meet reality, also more Easy open gap, by sort result, by ranking first three Expansion word of the word as the constraint word, aforesaid operations are repeated, until going out the expansion word of oneself for each inquiry constraint selected ci poem;
In step 6), original BM25 marking formula is introduced:
<mrow> <mi>S</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>Q</mi> <mo>,</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mi>i</mi> <mi>n</mi> </munderover> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>*</mo> <mi>R</mi> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>d</mi> <mo>)</mo> </mrow> </mrow>
In formula, WiFor the weight of its equivalent, R (qi, d) be current equivalent and document the degree of correlation, original BM25 give a mark In formula, the score formula S (w, q) in step 4) is added in, and is sorted, the document to be given a mark is denoted as d, is looked into some Exemplified by asking constraint word, the expansion word for inquiring about constraint word is denoted as q respectively1, q2, q3, inquire about constraint word and be denoted as q4, then S (qi, q4) table Show inquiry constraint word q4Expansion word qiTheir distance calculated using the marking formula in step 4), they are in BM25 Weight be respectively W1, W2, W3, W4, they are denoted as a query set QA, their degrees of correlation in BM25 with each document It is denoted as R (qi, d), first calculate each constraint word and the score S of its expansion word partA, then all inquiry constraint words and its expansion Open up the S that word calculatesAIt is all added, finally along with other query statements pass through the score S of original BM25 marking formulaB, i.e., with not having Have and add in the score S that the BM25 formula of the score formula S (w, q) in step 4) calculate for themB
SACalculation formula be:
<mrow> <msub> <mi>S</mi> <mi>A</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>Q</mi> <mi>A</mi> </msub> <mo>,</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>4</mn> </munderover> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>q</mi> <mn>4</mn> </msub> <mo>)</mo> </mrow> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>*</mo> <mi>R</mi> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>d</mi> <mo>)</mo> </mrow> </mrow>
By whole SAThe sum of add SBScore afterwards is the final score of every document, afterwards by result by returning from big to small It returns.
6. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step 5) in, with each constraint word and its 3 expansion words, with logical relation OR connections, their set is formed, it then will be all The set that word and their 3 expansion words are formed is constrained, with logical relation AND connections, finally they and others are looked into again Sentence logical relation AND connections are ask, are inquired about with this relation, obtain feedback document.
CN201711179719.6A 2017-11-23 2017-11-23 Query term expansion method based on pseudo feedback and TF-IDF Active CN108062355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711179719.6A CN108062355B (en) 2017-11-23 2017-11-23 Query term expansion method based on pseudo feedback and TF-IDF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711179719.6A CN108062355B (en) 2017-11-23 2017-11-23 Query term expansion method based on pseudo feedback and TF-IDF

Publications (2)

Publication Number Publication Date
CN108062355A true CN108062355A (en) 2018-05-22
CN108062355B CN108062355B (en) 2020-07-31

Family

ID=62135023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711179719.6A Active CN108062355B (en) 2017-11-23 2017-11-23 Query term expansion method based on pseudo feedback and TF-IDF

Country Status (1)

Country Link
CN (1) CN108062355B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110047A (en) * 2019-04-30 2019-08-09 中国农业科学院农业信息研究所 Subject content polymerization analysis method based on TF-IDF and domain lexicon
CN110442777A (en) * 2019-06-24 2019-11-12 华中师范大学 Pseudo-linear filter model information search method and system based on BERT
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112307182A (en) * 2020-10-29 2021-02-02 上海交通大学 Question-answering system-based pseudo-correlation feedback extended query method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2104044A1 (en) * 2008-03-18 2009-09-23 Korea Advanced Institute Of Science And Technology Query expansion method using augmented terms for improving precision without degrading recall
CN101876979A (en) * 2009-04-28 2010-11-03 株式会社理光 Query expansion method and equipment
US8280900B2 (en) * 2010-08-19 2012-10-02 Fuji Xerox Co., Ltd. Speculative query expansion for relevance feedback
CN103678412A (en) * 2012-09-21 2014-03-26 北京大学 Document retrieval method and device
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2104044A1 (en) * 2008-03-18 2009-09-23 Korea Advanced Institute Of Science And Technology Query expansion method using augmented terms for improving precision without degrading recall
CN101876979A (en) * 2009-04-28 2010-11-03 株式会社理光 Query expansion method and equipment
US8280900B2 (en) * 2010-08-19 2012-10-02 Fuji Xerox Co., Ltd. Speculative query expansion for relevance feedback
CN103678412A (en) * 2012-09-21 2014-03-26 北京大学 Document retrieval method and device
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
VAIDYANATHAN, REKHA, SUJOY DAS, AND NAMITA SRIVASTAVA: "Query expansion strategy based on pseudo relevance feedback and term weight scheme for monolingual retrieval", 《INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS》 *
巩玉玺,王大玲: "一种改进的基于伪相关反馈的查询扩展", 《微计算机信息》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110047A (en) * 2019-04-30 2019-08-09 中国农业科学院农业信息研究所 Subject content polymerization analysis method based on TF-IDF and domain lexicon
CN110110047B (en) * 2019-04-30 2021-03-19 中国农业科学院农业信息研究所 Topic content aggregation analysis method based on TF-IDF and domain dictionary
CN110442777A (en) * 2019-06-24 2019-11-12 华中师范大学 Pseudo-linear filter model information search method and system based on BERT
CN110442777B (en) * 2019-06-24 2022-11-18 华中师范大学 BERT-based pseudo-correlation feedback model information retrieval method and system
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112307182A (en) * 2020-10-29 2021-02-02 上海交通大学 Question-answering system-based pseudo-correlation feedback extended query method
CN112307182B (en) * 2020-10-29 2022-11-04 上海交通大学 Question-answering system-based pseudo-correlation feedback extended query method

Also Published As

Publication number Publication date
CN108062355B (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
CN104063523B (en) E-commerce search scoring and ranking method and system
CN108062355A (en) Query word extended method based on pseudo- feedback with TF-IDF
CN103020164B (en) Semantic search method based on multi-semantic analysis and personalized sequencing
CN103218719B (en) A kind of e-commerce website air navigation aid and system
JP3607462B2 (en) Related keyword automatic extraction device and document search system using the same
CN101223525B (en) Relationship networks
CN106649272B (en) A kind of name entity recognition method based on mixed model
CN105320772B (en) A kind of association paper querying method of patent duplicate checking
CN111143479A (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN105117487B (en) A kind of books semantic retrieving method based on content structure
CN105045875B (en) Personalized search and device
Singh et al. Vector space model: an information retrieval system
CN105975596A (en) Query expansion method and system of search engine
CN103455487B (en) The extracting method and device of a kind of search term
CN104008171A (en) Legal database establishing method and legal retrieving service method
CN104111933A (en) Method and device for acquiring business object label and building training model
CN101344890A (en) Grading method for information retrieval document based on viewpoint searching
JPH09223161A (en) Method and device for generating query response in computer-based document retrieval system
CN102708100A (en) Method and device for digging relation keyword of relevant entity word and application thereof
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN103593474A (en) Image retrieval ranking method based on deep learning
CN107247743A (en) A kind of judicial class case search method and system
CN112818661B (en) Patent technology keyword unsupervised extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant