CN108062355A - Query word extended method based on pseudo- feedback with TF-IDF - Google Patents
Query word extended method based on pseudo- feedback with TF-IDF Download PDFInfo
- Publication number
- CN108062355A CN108062355A CN201711179719.6A CN201711179719A CN108062355A CN 108062355 A CN108062355 A CN 108062355A CN 201711179719 A CN201711179719 A CN 201711179719A CN 108062355 A CN108062355 A CN 108062355A
- Authority
- CN
- China
- Prior art keywords
- word
- msub
- mrow
- inquiry
- constraint
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of query word extended methods based on pseudo- feedback with TF IDF, this method mainly constrains selected ci poem by the inquiry of science and takes, it finally can be used to do the word that query word extends by screening twice proposed by the present invention, then given a mark for document by marking formula proposed by the present invention and sorting operation.The characteristic of the present invention is that proposing a kind of new inquiry constraint selected ci poem takes the selection mode of mode and candidate word, and has done screening operation twice and removed unrelated word.Traditional BM25 marking formula is had also combined, invents a new new marking formula for aiming at query word extension, result document that can be more after the extending query word of science is given a mark, so as to draw more scientific searching order result.
Description
Technical field
The present invention relates to vertical search, search term extension field, a kind of looking into based on pseudo- feedback and TF-IDF is referred in particular to
Ask word extended method.
Background technology
Search engine has become the important tool that people obtain information needed at present, but since user can not be accurate sometimes
Search term is provided, and causes search result unsatisfactory.In order to preferably provide the information useful to user, query word extension
Technology is just come into being.
Want in user there are many kinds of the contents searched in the case of expression way, if only carried out by the search term of user
Retrieval, it is easy to search term matching, the unsatisfactory situation of search result occur.So we should be as far as possible possible
As a result fed back, then sorted, selected for user, so as to achieve the purpose that Optimizing Search according to weight calculation formula.Tradition
Search term extension have the following problems:
1) expansion word is excessive, causes feedback result excessive, and user tends not to finish watching search result;
2) weight of the candidate word of many expansion words can be calculated, causes the calculating time excessive;
3) many unrelated words may be also counted as the larger word of weight when sequence, causes the data for coming several former
There are extraneous datas.
4) expansion word causes to extend ineffective completely according to synonym table.
The content of the invention
It is an object of the invention to overcome the shortcomings of existing query word extension, it is proposed that one kind is based on pseudo- feedback and TF-
The query word extended method of IDF the method define the preliminary selection of query word candidate word and the postsearch screening of candidate word,
Give a relatively reasonable query word extended method.
To achieve the above object, technical solution provided by the present invention is:Expanded based on pseudo- feedback and the query word of TF-IDF
Exhibition method, comprises the following steps:
1) inquiry constraint word is selected;
2) initial extension candidate word is selected;
3) screening obtains secondary expansion candidate word;
4) final expansion word is obtained by calculating score;
5) expansion word drawn according to step 4) creates inquiry;
6) sorted according to document weight.
In step 1), query statement is removed into stop words and is segmented, each word separated is first once individually looked into
It askes, the word scored is respectively n1, n2, n3..., the feedback number of documents of each word is recorded, is respectively Nn1, Nn2, Nn3..., it will
All query words carry out an AND inquiry, record the total FA of feedback document, then all inquiries are carried out an OR inquiry,
The total FO of feedback document is recorded, in order to find the inquiry of query statement constraint word, first, with first in query statement
Word n1Exemplified by, we will be n1One query is carried out, the operation of the inquiry is to reject n1Afterwards, remaining all query words are carried out
AND inquiry, the sum for the feedback document this time inquired about are denoted as Fn1, then calculate FA and account for Fn1Proportion, use D1To record knot
Fruit, D1Calculation formula be:
Afterwards, we will be calculated comprising word n1Number of files account for OR inquiries result document number proportion, n is represented with this1
Degree of freedom (occur in the result document of OR inquiries more, illustrate freer, effect of contraction is smaller), use V1It represents, V1's
Calculation formula is:
If there is V1More than D1Then by word n1It is defined as inquiry constraint word, other words n afterwards2, n3..., also repeat above-mentioned behaviour
Make, until all words all carried out judgement, why V1More than D1Definition be inquiry constraint word, be because V1It represents
n1Degree of freedom in OR query results, if n1Presence to the no effect of contraction of the inquiry, then include n1AND inquiry account for
Not comprising n1AND inquiry proportion D1, it should more than or equal to its degree of freedom V1If V1More than D1, then its presence is illustrated
Allow the amount for feeding back number of files reduction that AND is inquired about higher than threshold value (if V1Equal to D1, then AND inquiry feedback number of documents reduction
Just, it to the effect of contraction of query statement just, be at this time threshold value;If V1Less than D1, then V is illustrated1Corresponding word n1Deposit
, it is also smaller than threshold value to the effect of contraction of the query statement, in other words, n1Degree of freedom can to regard it as can be the inquiry
The size for the advantageous effect that sentence plays the role of, if its presence is not played its minimum this and had, then it is assumed that it is looked into this
The effect of sentence Constrained is ask, therefore is defined as inquiry constraint word), inquiry constraint word is carried out query word extension by us,
If the word is directly only set to inquiry constraint word there are one word.
In step 2), each inquiry constraint word is carried out individually inquiry and obtains its feedback document by us, by it
The content of feedback document carries out stop words and participle, then according to the TFnew*IDF of each word, score is calculated, by ranking
Preceding 50 word is denoted as the initial candidate expansion word of the word, defines candidate word wiIn the feedback document d of constraint wordjThe word frequency of middle appearance
For widj, the feedback number of files of note constraint word is k, and the calculation formula of TFnew is (max () function is to take all parameters in bracket
Maximum, min () function is the minimum value for taking all parameters in bracket):
Number of files all in corpus is defined as N, contains wiNumber of files be wiN, wiThe calculation formula of IDF be:
In step 3), we will more each initial candidate word its it is corresponding inquiry constraint word feedback document in
Average word frequency of the average word frequency with the initial candidate word in the feedback document of oneself, if the latter's bigger, by the word from initial
Rejected in candidate word set, obtain two level candidate word set (though ensure by some with constraint word it is relevant, not only with constraint
The related word of word is rejected), it is X to remember the number occurred in i-th of document of certain word in n documenti, average word frequency TFavg
Calculation formula be:
In step 4), candidate word weight calculation formula is utilized:
-1
S (w, q)=(d+1)
W represents candidate word, and q represents constraint word, and d is the absolute value of the distance of the term vector of two words, represent candidate word with
The distance of word is constrained, why uses (d+1)-1The score of each candidate word is calculated, because first can ensure with distance
Increase, score S is can be gradually smaller, and second when can allow closer to the distance, and score is increased very big, apart from it is far when score subtract
Few very little more meets reality, also more Easy open gap, by sort result, using the ranking word of first three as the expansion of the constraint word
Word is opened up, repeats aforesaid operations, until going out the expansion word of oneself for each inquiry constraint selected ci poem.
In step 5), with each constraint word and its 3 expansion words, with logical relation OR connections, theirs are formed
Set, the set for then forming institute's Constrained word and their 3 expansion words, with logical relation AND connections, finally again
They are connected with other query statements with logical relation AND, are inquired about with this relation, obtain feedback document.
In step 6), original BM25 marking formula is introduced:
In formula, WiFor the weight of its equivalent, R (qi, d) be current equivalent and document the degree of correlation, in original BM25
It gives a mark in formula, by the score formula in step 4), i.e. S (w, q) is added in, and is sorted, and the document to be given a mark is denoted as d, with
Exemplified by some inquiry constraint word, the expansion word for inquiring about constraint word is denoted as q respectively1, q2, q3, inquire about constraint word and be denoted as q4, then S
(qi, q4) represent inquiry constraint word q4Expansion word qiTheir distance calculated using the marking formula in step 4), they
Weight in BM25 is respectively W1, W2, W3, W4, they are denoted as a query set QA, they in BM25 with each document
The degree of correlation be denoted as R (qi, d), first calculate each constraint word and the score S of its expansion word partA, then all inquiry constraints
The S that word and its expansion word calculateAIt is all added, is finally calculated along with other query statements by original BM25 marking formula
Score SB, i.e., with the score S that the BM25 formula without adding in the score formula S (w, q) in step 4) are their calculatingB;
SACalculation formula be:
By whole SAThe sum of add SBScore afterwards is the final score of every document, afterwards by result press from greatly to
Small return.
Compared with prior art, the present invention having the following advantages that and advantageous effect:
1st, of the invention is not to be extended for each word, but has what science selected, will be had about to query result
The word of Shu Zuoyong is extended, more science.
2nd, the present invention calculates after can the maxima and minima of word frequency be rejected when selecting candidate word, can tie calculating
Fruit is more fair.
3rd, the present invention has screens candidate word twice, can reject unrelated candidate word, so that not having in final result
Unnecessary extension.
4th, the present invention utilizes x-1The similarity calculation between word and word is carried out, is more tallied with the actual situation, first apart from smaller
Similarity is bigger, and two words very close to when, similarity increases very big, and when far twice, similarity varies less.
5th, the present invention to finally to the marking of each document when only need with 3 final candidate words participations rather than all times
Word is selected to both participate in marking, can save and calculate the time.And expansion word candidate word and the similarity of expansion word are also added in when giving a mark
It calculates, and as auxiliary bonus point item, more science.
Description of the drawings
Fig. 1 is the processing step flow chart of the method for the present invention.
Specific embodiment
With reference to specific embodiment, the invention will be further described.
As shown in Figure 1, the query word extended method based on pseudo- feedback with TF-IDF that the present embodiment is provided, including following
Step:
1) inquiry constraint word is selected
First, query statement is removed into stop words and segmented, stop words includes some auxiliary words of mood, preposition etc., these operations
The existing segmenter such as IK segmenter can be used, we will judge whether the vocabulary quantity separated is more than 1 first, if more than 1, then
Once individually inquiry is carried out in the search engine (such as solr) that each word separated is first put up at us, by each word
Remittance (n1, n2, n3...) and inquiry its obtain feedback number of documents (Nn1, Nn2, Nn3...) and after each word removes it
Feedback number of documents (the Fn of the AND inquiries of remaining all query words1, Fn2, Fn3...) correspond, and use chained list
ListArrayA is preserved, and all query words then are carried out an AND inquiry, and it is total to record feedback document with integer variable FA
Number, then all inquiries are subjected to an OR inquiry, the total number of files of feedback is recorded with integer variable FO, with floating type variable array
D records the threshold value proportion of each word, wherein each element Dx(which element x represents, does not imply that array index, x from
1 starts to take, and is taken until by all words) calculation formula be:
Further according to element number in the chained list defined before, the floating type array V of one is defined, records the freedom of each word
Degree (occur in the result document of OR inquiries more, illustrate freer, effect of contraction is smaller), wherein each element Vx(x
Which element represented, does not imply that array index, x takes since 1, is taken until by all words) calculation formula be:
By the element in array V one by one compared with element in corresponding D, if there is VxMore than Dx, then will be under this element
Mark takes out, and finds corresponding lower target element in chained list ListArrayA, the corresponding word of this element is deposited into new chained list
ListArrayB, final ListArrayB all elements are then inquiry constraint word, our purpose is that inquiry is constrained word
Query word extension is carried out, why VxMore than DxWord be defined as inquiry constraint word, be because VxRepresent nxIt inquires about and ties in OR
Degree of freedom in fruit, if nxPresence to the no effect of contraction of the inquiry, then include nxAND inquiry account for not comprising nxAND
The proportion D of inquiryx, it should more than or equal to its degree of freedom VxIf VxMore than Dx, then it is anti-to illustrate that its presence allows AND to inquire about
The amount for presenting number of documents reduction is higher than threshold value (if VxEqual to Dx, then just, it is to looking into for the feedback number of documents reduction of AND inquiries
The effect of contraction of inquiry sentence is at this time threshold value just;If VxLess than Dx, then V is illustratedxCorresponding word nxPresence, to the inquiry language
The effect of contraction of sentence is also smaller than threshold value, in other words, nxDegree of freedom can to regard it as can be having of playing of the query statement
The size of profit effect, if its presence is not played the role of its minimum this and had, then it is assumed that it is to the query statement Constrained
Effect, therefore be defined as inquiring about constraint word), inquiry constraint word is carried out query word extension by us, if word there are one only,
Then directly the word is put into ListArrayB, is set to inquiry constraint word.
2) initial extension candidate word is selected
We define the chained list ListArrayC of a character type array (size 50) first, we will
Element in ListArrayB repeats following operation one by one, afterwards corresponds in deposit ListArrayC result, takes first
First constraint word, is carried out individually inquire about obtaining its feedback document, fed back document word carry out stop words with
And participle, then according to the TFnew*IDF of each word, calculate score, by except the constraint word in itself in addition to ranking before 50 word remember
It for the initial candidate expansion word of the word, that is, deposits into the first character type array in ListArrayC, is repeated in
Operation is stated, defines candidate word wiIn the feedback document d of constraint wordjThe word frequency of middle appearance is widj, the feedback number of files of note constraint word
For k, the calculation formula of TFnew is (max () function is the maximum for taking all parameters in bracket, and min () function is to take bracket
In all parameters minimum value):
Number of files all in corpus is defined as N, contains wiNumber of files be wiThe calculation formula of N, IDF is:
3) screening obtains secondary expansion candidate word
We will more each initial candidate word its it is corresponding inquiry constraint word feedback document in average word frequency with
Average word frequency of the initial candidate word in the feedback document of oneself, if the latter's bigger, by the word from initial candidate set of words
Middle rejecting, obtains two level candidate word set, deposit ListArrayD (though ensure some are relevant with constraint word, not only
Rejected with the related word of constraint word), it is X to remember the number occurred in i-th of document of certain word in n documenti, average word frequency
The calculation formula of Tfavg is:
4) final expansion word is obtained by calculating score
Utilize candidate word weight calculation formula:
S (w, q)=(d+1) -1
W represents candidate word, and q represents constraint word, and d is the absolute value of the distance of the term vector of two words, represent candidate word with
The distance of word is constrained, why uses (d+1)-1The score of each candidate word is calculated, because first can ensure with distance
Increase, score S is can be gradually smaller, and second when can allow closer to the distance, and score is increased very big, apart from it is far when score subtract
Few very little more meets reality, also more Easy open gap, by sort result, using the ranking word of first three as the expansion of the constraint word
Word is opened up, is stored in ListArrayE, repeats aforesaid operations, until going out the expansion word of oneself for each inquiry constraint selected ci poem.
5) expansion word drawn according to step 4) creates inquiry
All words in ListArrayE and their corresponding constraint words are carried out to the inquiry operation of following form, (about
The expansion word 3OR constraints word 1 of the expansion word 2OR constraint words 1 of the expansion word 1OR constraint words 1 of beam word 1) AND (extensions of constraint word 2
The expansion word 3OR constraints word 2 of the expansion word 2OR constraint words 2 of word 1OR constraint words 2) ... the mode of other query statements of AND carries out
Inquiry.
6) sorted according to document weight
It sorts after feedback document in step 5) is given a mark, marking principle is:Introduce original BM25 marking formula:
In formula, WiFor the weight of its equivalent, R (qi, d) be current equivalent and document the degree of correlation, in original BM25
It gives a mark in formula, the score formula S (w, q) of step 4) is added in, each expansion word is also assisted in into point counting together, note is wanted
The document of marking is denoted as d, and by taking some inquiry constraint word as an example, the expansion word for inquiring about constraint word is denoted as q respectively1, q2, q3, look into
It askes constraint word and is denoted as q4, then S (qi, q4) represent inquiry constraint word q4Expansion word qiIt is calculated using the marking formula in step 4)
Their distance, their weights in BM25 are respectively W1, W2, W3, W4, they are denoted as a query set QA, they
In BM25 R (q are denoted as with the degree of correlation of each documenti, d), first calculate each constraint word and the score of its expansion word part
SA, then the S that all inquiry constraint words and its expansion word are calculatedAAll be added, finally along with other query statements pass through it is original
The score S that BM25 marking formula calculateB, i.e., it is them with the BM25 formula without adding in the score formula S (w, q) in step 4)
The score S of calculatingB;
SACalculation formula be:
By all SAAnd plus SBScore afterwards is the final score of every document, afterwards by result press from greatly to
Small return.
Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore
The variation that all shape, principles according to the present invention are made should all be covered within the scope of the present invention.
Claims (6)
1. the query word extended method based on pseudo- feedback with TF-IDF, which is characterized in that comprise the following steps:
1) inquiry constraint word is selected;
2) initial extension candidate word is selected;
3) screening obtains secondary expansion candidate word;
4) final expansion word is obtained by calculating score;
5) expansion word drawn according to step 4) creates inquiry;
6) sorted according to document weight.
2. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step
1) in, query statement is removed into stop words and is segmented, each word separated is first once individually inquired about, the word scored point
It Wei not n1, n2, n3..., the feedback number of documents of each word is recorded, is respectively Nn1, Nn2, Nn3..., all query words are carried out
AND inquiry records the total FA of feedback document, then all inquiries is carried out an OR inquiry, records feedback document
Total FO, in order to find the inquiry of query statement constraint word, first, with first word n in query statement1Exemplified by, to be n1
One query is carried out, the operation of the inquiry is to reject n1Afterwards, remaining all query words are subjected to an AND inquiry, this is looked into
The sum of the feedback document of inquiry is denoted as Fn1, then calculate FA and account for Fn1Proportion, use D1To record as a result, D1Calculation formula be:
<mrow>
<msub>
<mi>D</mi>
<mn>1</mn>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<mi>F</mi>
<mi>A</mi>
</mrow>
<mrow>
<msub>
<mi>Fn</mi>
<mn>1</mn>
</msub>
</mrow>
</mfrac>
</mrow>
Afterwards, to calculate comprising word n1Number of files account for OR inquiries result document number proportion, n is represented with this1Freedom
Degree, uses V1It represents, V1Calculation formula be:
<mrow>
<msub>
<mi>V</mi>
<mn>1</mn>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>Nn</mi>
<mn>1</mn>
</msub>
</mrow>
<mrow>
<mi>F</mi>
<mi>O</mi>
</mrow>
</mfrac>
</mrow>
If there is V1More than D1Then by word n1It is defined as inquiry constraint word, other word n afterwards2, n3..., also repeatedly aforesaid operations, directly
Judgement was all carried out to all words, why V1More than D1Definition be inquiry constraint word, be because V1Represent n1In OR
Degree of freedom in query result, if n1Presence to the no effect of contraction of the inquiry, then include n1AND inquiries account for and do not include
n1AND inquiry proportion D1, it should more than or equal to its degree of freedom V1If V1More than D1, then illustrate that its presence allows AND to look into
The amount of the feedback number of documents reduction of inquiry is higher than threshold value, that is, it is bigger than threshold value to the effect of contraction of the query statement;If V1
Equal to D1, then just, it is at this time threshold just to the effect of contraction of query statement for the feedback number of documents reduction of AND inquiries
Value;If V1Less than D1, then V is illustrated1Corresponding word n1Presence, it is also smaller than threshold value to the effect of contraction of the query statement, change sentence
It talks about, n1Degree of freedom can to regard it as can be the size of advantageous effect that the query statement plays, if its presence does not have
Play the role of to its minimum this, then it is assumed that it is that the query statement Constrained is acted on, therefore is defined as inquiry constraint word;
In addition, inquiry constraint word is subjected to query word extension, if the word directly only is set to inquiry constraint word there are one word.
3. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step
2) in, each inquiry constraint word is subjected to individually inquiry and obtains its feedback document, the content for being fed back document carries out
Stop words and participle are removed, then according to the TFnew*IDF of each word, score is calculated, before ranking 50 word is denoted as the word
Initial candidate expansion word defines candidate word wiIn the feedback document d of constraint wordjThe word frequency of middle appearance is widj, remember the anti-of constraint word
Feedback number of files is k, and the calculation formula of TFnew is:
<mrow>
<mi>T</mi>
<mi>F</mi>
<mi>n</mi>
<mi>e</mi>
<mi>w</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mn>1</mn>
</msub>
<mo>+</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mn>2</mn>
</msub>
<mo>+</mo>
<mn>...</mn>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mi>k</mi>
</msub>
<mo>-</mo>
<mi>max</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mn>2</mn>
</msub>
<mo>,</mo>
<mn>...</mn>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mi>k</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>min</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>d</mi>
<mn>2</mn>
</msub>
<mo>,</mo>
<mn>...</mn>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<msub>
<mi>d</mi>
<mi>k</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>k</mi>
<mo>-</mo>
<mn>2</mn>
</mrow>
</mfrac>
<mo>*</mo>
<mi>k</mi>
</mrow>
Wherein, max () function is the maximum for taking all parameters in bracket, and min () function is to take in bracket all parameters most
Small value;
Number of files all in corpus is defined as N, contains wiNumber of files be wiN, wiThe calculation formula of IDF be:
<mrow>
<mi>I</mi>
<mi>D</mi>
<mi>F</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>l</mi>
<mi>o</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mfrac>
<mi>N</mi>
<mrow>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mi>N</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</mfrac>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
4. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step
It 3), average word frequency and the initial time of more each initial candidate word in the feedback document of its corresponding inquiry constraint word in
Average word frequency of the word in the feedback document of oneself is selected, if the latter's bigger, which from initial candidate set of words is rejected, is obtained
To two level candidate word set, though ensureing some are relevant with constraint word, not only rejected with the related word of constraint word, note
The number that certain word occurs in i-th of document in n document is Xi, the calculation formula of average word frequency TFavg is:
<mrow>
<mi>T</mi>
<mi>F</mi>
<mi>a</mi>
<mi>v</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<mfrac>
<msub>
<mi>X</mi>
<mi>i</mi>
</msub>
<mi>n</mi>
</mfrac>
<mo>.</mo>
</mrow>
5. a kind of query word extended method based on pseudo- feedback with TF-IDF according to claim 1, it is characterised in that:
In step 4), candidate word weight calculation formula is utilized:
S (w, q)=(d+1)-1
In formula, w represents candidate word, and q represents constraint word, and d is the absolute value of the distance of the term vector of two words, represent candidate word with
The distance of word is constrained, why uses (d+1)-1The score of each candidate word is calculated, because first can ensure with distance
Increase, score S is can be gradually smaller, and second when can allow the distance to be close to preset range, score increase within a preset range,
Apart from it is remote when score reduce within a preset range, more meet reality, also more Easy open gap, by sort result, by ranking first three
Expansion word of the word as the constraint word, aforesaid operations are repeated, until going out the expansion word of oneself for each inquiry constraint selected ci poem;
In step 6), original BM25 marking formula is introduced:
<mrow>
<mi>S</mi>
<mi>c</mi>
<mi>o</mi>
<mi>r</mi>
<mi>e</mi>
<mrow>
<mo>(</mo>
<mi>Q</mi>
<mo>,</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mi>i</mi>
<mi>n</mi>
</munderover>
<msub>
<mi>W</mi>
<mi>i</mi>
</msub>
<mo>*</mo>
<mi>R</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>q</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
</mrow>
In formula, WiFor the weight of its equivalent, R (qi, d) be current equivalent and document the degree of correlation, original BM25 give a mark
In formula, the score formula S (w, q) in step 4) is added in, and is sorted, the document to be given a mark is denoted as d, is looked into some
Exemplified by asking constraint word, the expansion word for inquiring about constraint word is denoted as q respectively1, q2, q3, inquire about constraint word and be denoted as q4, then S (qi, q4) table
Show inquiry constraint word q4Expansion word qiTheir distance calculated using the marking formula in step 4), they are in BM25
Weight be respectively W1, W2, W3, W4, they are denoted as a query set QA, their degrees of correlation in BM25 with each document
It is denoted as R (qi, d), first calculate each constraint word and the score S of its expansion word partA, then all inquiry constraint words and its expansion
Open up the S that word calculatesAIt is all added, finally along with other query statements pass through the score S of original BM25 marking formulaB, i.e., with not having
Have and add in the score S that the BM25 formula of the score formula S (w, q) in step 4) calculate for themB;
SACalculation formula be:
<mrow>
<msub>
<mi>S</mi>
<mi>A</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>Q</mi>
<mi>A</mi>
</msub>
<mo>,</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mn>4</mn>
</munderover>
<mi>S</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>q</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>q</mi>
<mn>4</mn>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>W</mi>
<mi>i</mi>
</msub>
<mo>*</mo>
<mi>R</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>q</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
</mrow>
By whole SAThe sum of add SBScore afterwards is the final score of every document, afterwards by result by returning from big to small
It returns.
6. the query word extended method according to claim 1 based on pseudo- feedback with TF-IDF, it is characterised in that:In step
5) in, with each constraint word and its 3 expansion words, with logical relation OR connections, their set is formed, it then will be all
The set that word and their 3 expansion words are formed is constrained, with logical relation AND connections, finally they and others are looked into again
Sentence logical relation AND connections are ask, are inquired about with this relation, obtain feedback document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711179719.6A CN108062355B (en) | 2017-11-23 | 2017-11-23 | Query term expansion method based on pseudo feedback and TF-IDF |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711179719.6A CN108062355B (en) | 2017-11-23 | 2017-11-23 | Query term expansion method based on pseudo feedback and TF-IDF |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108062355A true CN108062355A (en) | 2018-05-22 |
CN108062355B CN108062355B (en) | 2020-07-31 |
Family
ID=62135023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711179719.6A Active CN108062355B (en) | 2017-11-23 | 2017-11-23 | Query term expansion method based on pseudo feedback and TF-IDF |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108062355B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110047A (en) * | 2019-04-30 | 2019-08-09 | 中国农业科学院农业信息研究所 | Subject content polymerization analysis method based on TF-IDF and domain lexicon |
CN110442777A (en) * | 2019-06-24 | 2019-11-12 | 华中师范大学 | Pseudo-linear filter model information search method and system based on BERT |
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Chinese query expansion method for embedding expansion words into query words and counting expansion word union |
CN112307182A (en) * | 2020-10-29 | 2021-02-02 | 上海交通大学 | Question-answering system-based pseudo-correlation feedback extended query method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2104044A1 (en) * | 2008-03-18 | 2009-09-23 | Korea Advanced Institute Of Science And Technology | Query expansion method using augmented terms for improving precision without degrading recall |
CN101876979A (en) * | 2009-04-28 | 2010-11-03 | 株式会社理光 | Query expansion method and equipment |
US8280900B2 (en) * | 2010-08-19 | 2012-10-02 | Fuji Xerox Co., Ltd. | Speculative query expansion for relevance feedback |
CN103678412A (en) * | 2012-09-21 | 2014-03-26 | 北京大学 | Document retrieval method and device |
CN107247745A (en) * | 2017-05-23 | 2017-10-13 | 华中师范大学 | A kind of information retrieval method and system based on pseudo-linear filter model |
-
2017
- 2017-11-23 CN CN201711179719.6A patent/CN108062355B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2104044A1 (en) * | 2008-03-18 | 2009-09-23 | Korea Advanced Institute Of Science And Technology | Query expansion method using augmented terms for improving precision without degrading recall |
CN101876979A (en) * | 2009-04-28 | 2010-11-03 | 株式会社理光 | Query expansion method and equipment |
US8280900B2 (en) * | 2010-08-19 | 2012-10-02 | Fuji Xerox Co., Ltd. | Speculative query expansion for relevance feedback |
CN103678412A (en) * | 2012-09-21 | 2014-03-26 | 北京大学 | Document retrieval method and device |
CN107247745A (en) * | 2017-05-23 | 2017-10-13 | 华中师范大学 | A kind of information retrieval method and system based on pseudo-linear filter model |
Non-Patent Citations (2)
Title |
---|
VAIDYANATHAN, REKHA, SUJOY DAS, AND NAMITA SRIVASTAVA: "Query expansion strategy based on pseudo relevance feedback and term weight scheme for monolingual retrieval", 《INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS》 * |
巩玉玺,王大玲: "一种改进的基于伪相关反馈的查询扩展", 《微计算机信息》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110047A (en) * | 2019-04-30 | 2019-08-09 | 中国农业科学院农业信息研究所 | Subject content polymerization analysis method based on TF-IDF and domain lexicon |
CN110110047B (en) * | 2019-04-30 | 2021-03-19 | 中国农业科学院农业信息研究所 | Topic content aggregation analysis method based on TF-IDF and domain dictionary |
CN110442777A (en) * | 2019-06-24 | 2019-11-12 | 华中师范大学 | Pseudo-linear filter model information search method and system based on BERT |
CN110442777B (en) * | 2019-06-24 | 2022-11-18 | 华中师范大学 | BERT-based pseudo-correlation feedback model information retrieval method and system |
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Chinese query expansion method for embedding expansion words into query words and counting expansion word union |
CN112307182A (en) * | 2020-10-29 | 2021-02-02 | 上海交通大学 | Question-answering system-based pseudo-correlation feedback extended query method |
CN112307182B (en) * | 2020-10-29 | 2022-11-04 | 上海交通大学 | Question-answering system-based pseudo-correlation feedback extended query method |
Also Published As
Publication number | Publication date |
---|---|
CN108062355B (en) | 2020-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
CN109800284B (en) | Task-oriented unstructured information intelligent question-answering system construction method | |
CN104063523B (en) | E-commerce search scoring and ranking method and system | |
CN108062355A (en) | Query word extended method based on pseudo- feedback with TF-IDF | |
CN103020164B (en) | Semantic search method based on multi-semantic analysis and personalized sequencing | |
CN103218719B (en) | A kind of e-commerce website air navigation aid and system | |
JP3607462B2 (en) | Related keyword automatic extraction device and document search system using the same | |
CN101223525B (en) | Relationship networks | |
CN106649272B (en) | A kind of name entity recognition method based on mixed model | |
CN105320772B (en) | A kind of association paper querying method of patent duplicate checking | |
CN111143479A (en) | Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm | |
CN105117487B (en) | A kind of books semantic retrieving method based on content structure | |
CN105045875B (en) | Personalized search and device | |
Singh et al. | Vector space model: an information retrieval system | |
CN105975596A (en) | Query expansion method and system of search engine | |
CN103455487B (en) | The extracting method and device of a kind of search term | |
CN104008171A (en) | Legal database establishing method and legal retrieving service method | |
CN104111933A (en) | Method and device for acquiring business object label and building training model | |
CN101344890A (en) | Grading method for information retrieval document based on viewpoint searching | |
JPH09223161A (en) | Method and device for generating query response in computer-based document retrieval system | |
CN102708100A (en) | Method and device for digging relation keyword of relevant entity word and application thereof | |
CN105653562A (en) | Calculation method and apparatus for correlation between text content and query request | |
CN103593474A (en) | Image retrieval ranking method based on deep learning | |
CN107247743A (en) | A kind of judicial class case search method and system | |
CN112818661B (en) | Patent technology keyword unsupervised extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |