Summary of the invention
Because the above-mentioned defective of prior art, technical matters to be solved by this invention provides a kind of method and apparatus that carries out Webpage search according to sentence serial numbers, the sentence distance that improves key word, keyword or punctuation mark is a sequencing weight zero or webpage that the sentence distance is less, thereby make the rank of webpage forward, promote user's search satisfaction.
The invention discloses and a kind ofly carry out the method for Webpage search, may further comprise the steps according to sentence serial numbers:
A), obtain plurality of webpages, and be downloaded to web database;
B), described plurality of webpages carried out sentence cut apart, and be respectively the distributing serial numbers for sentences of each webpage;
C), make the forward direction concordance list, described forward direction concordance list comprises sentence serial numbers;
D), make inverted index table, described inverted index table comprises described sentence serial numbers;
E), the inputted search item, described search terms is decomposed at least one key word, keyword or punctuation mark;
F), according to described inverted index table, calculate the sequencing weight of the webpage comprise described key word, keyword or punctuation mark, the output Search Results.
Further, described step B) further may further comprise the steps:
B1), described each webpage of index scanning, do the word cutting for described each webpage, write down each speech, word or the punctuation mark position in webpage;
B2), according to described each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent, carry out sentence and cut apart;
B3), be each distributing serial numbers for sentences, determine the sentence serial numbers of described each speech, word or punctuation mark.
Preferably, the rule that described sentence is cut apart is: if fullstop, question mark, suspension points or exclamation mark in quotation marks, and are positioned at paragraph and finish part, the ending of sentence is fullstop, question mark, suspension points or exclamation mark and back quote; If fullstop, question mark, suspension points or exclamation mark are outside quotation marks, the sentence ending is fullstop, question mark, suspension points or exclamation mark.
Preferably, described forward direction concordance list comprises the page sequence number of described each speech, word or punctuation mark, described each speech, word or punctuation mark, the sentence serial numbers of the sequence number of described each speech, word or punctuation mark and described each speech, word or punctuation mark.
Preferably, described inverted index table comprises described each speech, word or punctuation mark, the sequence number of described each speech, word or punctuation mark, the webpage quantity that comprises described each speech, word or punctuation mark, the sentence serial numbers of the page sequence of described each speech, word or punctuation mark number and described each speech, word or punctuation mark.
Further, described step F) in the webpage that comprises described key word, keyword or punctuation mark, according to described inverted index table, judge whether described key word, keyword or punctuation mark belong to same sentence, if belong to same sentence, improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark; If do not belong to same sentence, calculate the sentence distance of described key word, keyword or punctuation mark, if described sentence distance is big, then reduce the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark, if described sentence distance is little, then improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark.
Preferably, the sequencing weight of described webpage is by the sentence distance of described key word, keyword or punctuation mark, the authority of described webpage place domain name, the pouplarity of described webpage, whether described key word, keyword or punctuation mark appear in network address, title, anchor text or the metatag, the flowing of access of described webpage and click-through rate, the log-on data and the public station data of website, described webpage place comprehensively determine.
Preferably, if described key word, keyword or punctuation mark belong to same sentence, further natural language processing done in described sentence.
The invention also discloses and a kind ofly carry out the device of Webpage search, comprise according to sentence serial numbers
The webpage getter is used to obtain and download plurality of webpages;
Web database is used to store the described plurality of webpages of download;
Index is used for that described plurality of webpages is carried out sentence and cuts apart, and is respectively the distributing serial numbers for sentences of each webpage, makes the forward direction concordance list and the inverted index table that comprise sentence serial numbers;
Index data base is used to store described forward direction concordance list and described inverted index table;
Searcher is used for search terms is decomposed at least one key word, keyword or punctuation mark, according to described inverted index table, calculates the sequencing weight of the webpage that comprises described key word, keyword or punctuation mark, the output Search Results;
Described webpage getter, described web database, described index, described index data base, described searcher connect successively.
Beneficial effect of the present invention is:
The forward direction concordance list of the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention and the sentence serial numbers that inverted index table has all comprised webpage, by inquiry sentence serial numbers information, the sentence distance that search engine can improve key word, keyword or punctuation mark is a sequencing weight zero or webpage that the sentence distance is less, thereby make the rank of webpage forward, promote user's search satisfaction.
Of the present invention according to sentence serial numbers carry out the method and apparatus of Webpage search can be directly according to the sentence serial numbers of each word, speech or punctuation mark in each webpage, judge whether two or more key words to be checked, keyword or punctuation mark belong to same sentence or sentence close together fast, and do not need a large amount of comparison operations.The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention has lower time complexity, thereby improves the response speed of search, for the user brings search experience more efficiently.
The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention can provide condition precedent for follow-up natural language processing.If two or more key words to be checked, keyword or punctuation mark belong to same sentence, search engine can be done further deep natural language processing to this sentence.For example, various syntactic analyses done in this sentence, as interdependent syntactic analysis, with dependence between the vocabulary that obtains this sentence and head; Perhaps can do based on sentiment classification (passing judgement on analysis), with tendentiousness of learning this sentence etc. to this sentence.
Embodiment
Be described further below with reference to the technique effect of accompanying drawing, to understand purpose of the present invention, feature and effect fully design of the present invention, concrete structure and generation.
As shown in Figure 1, the invention discloses and a kind ofly carry out the method for Webpage search, may further comprise the steps according to sentence serial numbers:
Step 101, obtain plurality of webpages, and be downloaded to web database;
Search engine companies is obtained plurality of webpages by the webpage getter from the internet, and plurality of webpages is downloaded in the computing machine of search engine companies, also is in the web database.
Step 102, plurality of webpages is carried out sentence cut apart, and be respectively the distributing serial numbers for sentences of each webpage;
At first, index scans each webpage, does the word cutting for each webpage, writes down each speech, word or the punctuation mark position in webpage;
Secondly, according to each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent, carry out sentence and cut apart;
Once more, be each distributing serial numbers for sentences, determine the sentence serial numbers of each speech, word or punctuation mark.The sentence serial numbers of each webpage is to number separately.
Step 103, making forward direction concordance list, the forward direction concordance list comprises sentence serial numbers;
The forward direction concordance list comprises the page sequence number of each speech, word or punctuation mark, each speech, word or punctuation mark, the sentence serial numbers of the sequence number of each speech, word or punctuation mark and each speech, word or punctuation mark.The forward direction concordance list can also comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.
Step 104, making inverted index table, inverted index table comprises sentence serial numbers;
Inverted index table comprises each speech, word or punctuation mark, the sequence number of each speech, word or punctuation mark, the webpage quantity that comprises each speech, word or punctuation mark, the sentence serial numbers of the page sequence of each speech, word or punctuation mark number and each speech, word or punctuation mark.Inverted index table also can comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.
Step 105, inputted search item are decomposed at least one key word, keyword or punctuation mark with search terms;
User's inputted search item, searcher is decomposed into a plurality of key words, keyword or punctuation mark with search terms.Certainly, the search terms of user's input also may itself be key word, keyword or a punctuation mark, and searcher does not then need this is decomposed.
Step 106, according to inverted index table, calculate the sequencing weight of the webpage comprise key word, keyword or punctuation mark, the output Search Results.
In the webpage that comprises described key word, keyword or punctuation mark,, judge whether described key word, keyword or punctuation mark belong to same sentence according to inverted index table.If belong to same sentence, improve the sequencing weight of the affiliated webpage of key word, keyword or punctuation mark; If do not belong to same sentence, calculate the sentence distance of key word, keyword or punctuation mark.If the sentence distance is big, then reduce the sequencing weight of the affiliated webpage of key word, keyword or punctuation mark, if the sentence distance is little, then improve the sequencing weight of the affiliated webpage of key word, keyword or punctuation mark.
See also Fig. 2, the forward direction concordance list comprises the page sequence docid of each speech, word or punctuation mark, each speech, word or punctuation mark word1, word2, word3 ... sequence number word id1, the word id2 of each speech, word or punctuation mark, word id3 ..., sentence serial numbers sentence id1, the sentence id2 of each speech, word or punctuation mark, sentence id3 ...The page sequence of each speech, word or punctuation mark number, each speech, word or punctuation mark, the sequence number of each speech, word or punctuation mark is unique.But the sentence serial numbers of each speech, word or punctuation mark can be for one or more.Because speech, word or a punctuation mark can occur in a plurality of sentences in the webpage.
Certainly, the forward direction concordance list can also comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.But because information such as side-play amount are widely-used in existing search engine, so do not repeat them here.
See also Fig. 3, inverted index table comprises each speech, word or punctuation mark word1, word2, word3 ... each speech, the sequence number word id1 of word or punctuation mark, word id2, word id3 ... comprise each speech, the webpage quantity ndocs1 of word or punctuation mark, ndocs2, ndocs3 ... each speech, the page sequence docid1 of word or punctuation mark, docid2, docid3, docid4, docid5, docid6 ..., each speech, the sentence serial numbers sentence id1 of word or punctuation mark, sentence id2, sentence id3, sentence id4, sentence id5, sentence id6 ...Each speech, word or punctuation mark, the sequence number of each speech, word or punctuation mark comprises the webpage quantity of each speech, word or punctuation mark, and the page sequence of each speech, word or punctuation mark number is unique.But the sentence serial numbers of each speech, word or punctuation mark can be for one or more.Because speech, word or a punctuation mark can occur in a plurality of sentences in the webpage.
Certainly, inverted index table can also comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.But because information such as side-play amount are widely-used in existing search engine, so do not repeat them here.
In the first embodiment of the present invention, the full content of first webpage following (selecting from Yu Dafu " evening that spring breeze is got drunk "):
Because from since last year, I am one day on the one dispirited, almost " whom I am? " how " I am now residing to be a kind of circumstances? " " my still sad at heart still happiness? " these ideas have all been forgotten about.This is asked through her, I again situation in privation over the past half year in layer wanted to come out.So listen after her question, I dull see her, lose one's tongue quite a while.She has seen my this appearance, thinks that I also am a homeless outcast.Just played a kind of expression of loneliness on the face, slight sighing said immediately:
" sound of sighing! You also are the same with me? "
Slight has sighed after one, and she is just silent.
See also Fig. 4, of the present inventionly carry out the device of Webpage search according to sentence serial numbers, also be search engine 40 by webpage getter 401, with the computing machine of first page download, also be web database 402 to search engine companies.
Index 403 scannings first webpage is that first webpage is done the word cutting, writes down each speech, word or the punctuation mark position in webpage.Then, index 403 carries out sentence and cuts apart according to each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent.
Sentence is meant the syntactical unit with independent elocutionary meaning that is made of speech and phrase.In Chinese, the sentence ending should be fullstop, question mark, suspension points or exclamation mark.If these symbols appear in the quotation marks, when being positioned at paragraph, these symbols finish part, and these symbols and back quote are defined as the ending of sentence together.Certainly, the rule of cutting apart of sentence of the present invention is not limited to this, can be set by index 403 to cut apart rule.For example, if fullstop, question mark, suspension points or exclamation mark appear in the quotation marks, even these symbols are positioned at paragraph beginning or paragraph center section, these symbols and back quote also can be defined as the ending of sentence together.
After end cut apart in sentence, be each distributing serial numbers for sentences, thereby can determine the sentence serial numbers of each speech, word or punctuation mark.Preferably, sentence serial numbers is 0,1,2,3,4 ...But the present invention is not limited to this, and sentence serial numbers can be 1,2,3,4 ..., perhaps 2,3,4 ... Deng.The Base Serial Number of sentence serial numbers can be arbitrary integer.
As another embodiment of the present invention, sentence serial numbers also can be 1,3,5,7 ..., perhaps 2,6,10,14 ... Deng.Difference between the sentence serial numbers also can be any natural number.
As another embodiment of the present invention, sentence serial numbers also can for ... 4,3,2,1 etc.Sentence serial numbers also can successively decrease successively.
Sentence serial numbers only need be by the rule unified distribution of setting, promptly applicable to the present invention.
First webpage can be split into following five sentences:
[0] because from since last year, I am one day on the one dispirited, almost " whom I am? " how " I am now residing to be a kind of circumstances? " " my still sad at heart still happiness? " these ideas have all been forgotten about.
[1] this is asked through her, I again over the past half year in privation in layer wanted to come out.
[2] so listen after her question, I dull see her, lose one's tongue quite a while.
[3] she has seen my this appearance, thinks that I also am a homeless outcast.
[4] just played a kind of expression of loneliness on the face, slight sighing said immediately: sound of sighing! You also are the same with me? "
[5] slight having sighed after one, she is just silent.
Certainly, according to the different rules of cutting apart that index 403 is set, first webpage can be divided into and be less than five or more than five sentence.For example also can be that zero sentence is divided into four sentences again with sentence serial numbers.
Index 403 is made the forward direction concordance list, and deposits index data base 404 in.The forward direction concordance list of first webpage as shown in Table 1.Docid is the page sequence number of each speech, word or punctuation mark, and word is each speech, word or punctuation mark, and word id is the sequence number of each speech, word or punctuation mark, and sentence id is the sentence serial numbers of each speech, word or punctuation mark.
The forward direction concordance list of table one first webpage
docid |
word |
word?id |
sentence?id |
0 |
Because |
0 |
0 |
0 |
From |
1 |
0 |
0 |
Last year |
2 |
0 |
0 |
Since |
3 |
0 |
0 |
I |
4 |
0,1,2,3,4 |
0 |
Just |
5 |
0,2 |
0 |
One day |
6 |
0 |
0 |
{。##.##1}, |
7 |
0,1,2,3,4,5 |
0 |
Dispirited |
8 |
0 |
0 |
Go down |
9 |
0 |
0 |
, |
10 |
0,1,2,3 |
0 |
Almost |
11 |
0 |
0 |
{。##.##1}, |
12 |
0 |
0 |
“ |
13 |
0 |
0 |
Be |
14 |
0 |
0 |
What |
15 |
0 |
0 |
The people |
16 |
0 |
0 |
? |
17 |
0 |
0 |
” |
18 |
0,4 |
0 |
Now |
19 |
0 |
0 |
The institute |
20 |
0 |
0 |
The place |
21 |
0 |
0 |
How |
22 |
0 |
0 |
A kind of |
23 |
0,4 |
0 |
Circumstances |
24 |
0 |
0 |
At heart |
25 |
0 |
0 |
Still |
26 |
0 |
0 |
Sad |
27 |
0 |
0 |
Happiness |
28 |
0 |
0 |
These |
29 |
0 |
0 |
Idea |
30 |
0 |
0 |
All |
31 |
0 |
0 |
Forget about |
32 |
0 |
0 |
|
33 |
0,1,3,4,5 |
0 |
|
34 |
0,1,2,3,5 |
0 |
Warp |
35 |
1 |
0 |
She |
36 |
1,2,3,5 |
0 |
This |
37 |
1 |
0 |
One asks |
38 |
1 |
0 |
Again |
39 |
1 |
0 |
|
40 |
1 |
0 |
Over the past half year |
41 |
1 |
0 |
In privation |
42 |
1 |
0 |
Situation |
43 |
1 |
0 |
One deck |
44 |
1 |
0 |
Think |
45 |
1 |
0 |
Come out |
46 |
1 |
0 |
So |
47 |
2 |
0 |
Listen |
48 |
2 |
0 |
Question |
49 |
2 |
0 |
After |
50 |
2 |
0 |
Dull |
51 |
2 |
0 |
See |
52 |
2,3 |
0 |
Quite a while |
53 |
2 |
0 |
Say |
54 |
2,4 |
0 |
No |
55 |
2 |
0 |
Go out |
56 |
2 |
0 |
Words |
57 |
2 |
0 |
Come |
58 |
1,2 |
0 |
This |
59 |
3 |
0 |
Appearance |
60 |
3 |
0 |
Think |
61 |
3 |
0 |
Also be |
62 |
3,4 |
0 |
One |
63 |
3 |
0 |
Homeless |
64 |
3 |
0 |
The outcast |
65 |
3 |
0 |
On the face |
66 |
4 |
0 |
Just |
67 |
4,5 |
0 |
Immediately |
68 |
4 |
0 |
Rise |
69 |
4 |
0 |
Lonely |
70 |
4 |
0 |
Expression |
71 |
4 |
0 |
Slightly |
72 |
4,5 |
0 |
Sighing |
73 |
4 |
0 |
: |
74 |
4 |
0 |
Sound of sighing |
75 |
4 |
0 |
! |
76 |
4 |
0 |
You |
77 |
4 |
0 |
With |
78 |
4 |
0 |
The same |
79 |
4 |
0 |
{。##.##1}, |
80 |
4 |
0 |
Sigh |
81 |
5 |
0 |
One |
82 |
5 |
0 |
Afterwards |
83 |
5 |
0 |
No |
84 |
5 |
0 |
Speak |
85 |
5 |
In the second embodiment of the present invention, the full content of second webpage following (selecting from king's melt " stepping on Stork "):
Daytime, to the greatest extent, ocean current was gone in the Yellow River near the mountain.Ascend another storey to see a thousand miles further.
Equally, second webpage also can pass through webpage getter 401, is downloaded to the computing machine of search engine companies, also is web database 402.403 pairs second webpages of index are cut apart as sentence, and distribute sentence serial numbers.
Second webpage can be split into following two sentences:
[0] daytime uses up near the mountain, and ocean current is gone in the Yellow River.
[1] ascends another storey to see a thousand miles further.
Index 403 is made the forward direction concordance list, and deposits index data base 404 in.The forward direction concordance list of second webpage as shown in Table 2.
The forward direction concordance list of table two second webpage
Docid |
Word |
Word id |
Sentence id |
1 |
Daytime |
86 |
0 |
1 |
Comply with |
87 |
0 |
1 |
The mountain |
88 |
0 |
1 |
To the greatest extent |
89 |
0 |
1 |
, |
10 |
0,1 |
1 |
The Yellow River |
90 |
0 |
1 |
Go into |
91 |
0 |
1 |
The sea |
92 |
0 |
1 |
Stream |
93 |
0 |
1 |
|
34 |
0,1 |
1 |
Desire poor |
94 |
1 |
1 |
A thousand li |
95 |
1 |
1 |
Order |
96 |
1 |
1 |
More go up |
97 |
1 |
1 |
One deck |
44 |
1 |
1 |
The building |
98 |
1 |
As shown in Table 2, the sentence serial numbers of each webpage is an independent numbering.In a second embodiment, the sentence serial numbers numbering of starting from scratch again.Table one is numbered in turn but the page sequence docid of each speech, word or punctuation mark, the sequence number word id of each speech, word or punctuation mark continue.Be noted that, ", ", ".", the word id of " one deck " is assigned to 10,34,44 in Table 1 respectively.Therefore, in table two, the word id of reservation table one still.Hence one can see that, and in whole search engine 40, the sequence number word id of each speech, word or punctuation mark is unique.
After table one and table two completed, index 403 was merged into a total forward direction concordance list with table one and table two.Index 403 is an independent forward direction concordance list of each webpage making, more some forward direction concordance lists is merged into a total forward direction concordance list.Some forward direction concordance lists merge into prior art, do not repeat them here.
According to table one and table two, index 403 is made inverted index table, and deposits index data base 404 in.The inverted index table of first webpage and second webpage as shown in Table 3.Word is each speech, word or punctuation mark, word id is the sequence number of each speech, word or punctuation mark, ndocs is the webpage quantity that comprises each speech, word or punctuation mark, docid is the page sequence number of each speech, word or punctuation mark, and sentence id is the sentence serial numbers of each speech, word or punctuation mark.
The inverted index table of table three first webpage and second webpage
Come out |
46 |
1 |
0 |
1 |
So |
47 |
1 |
0 |
2 |
Listen |
48 |
1 |
0 |
2 |
Question |
49 |
1 |
0 |
2 |
After |
50 |
1 |
0 |
2 |
Dull |
51 |
1 |
0 |
2 |
See |
52 |
1 |
0 |
2,3 |
Quite a while |
53 |
1 |
0 |
2 |
Say |
54 |
1 |
0 |
2,4 |
No |
55 |
1 |
0 |
2 |
Go out |
56 |
1 |
0 |
2 |
Words |
57 |
1 |
0 |
2 |
Come |
58 |
1 |
0 |
1,2 |
This |
59 |
1 |
0 |
3 |
Appearance |
60 |
1 |
0 |
3 |
Think |
61 |
1 |
0 |
3 |
Also be |
62 |
1 |
0 |
3,4 |
One |
63 |
1 |
0 |
3 |
Homeless |
64 |
1 |
0 |
3 |
The outcast |
65 |
1 |
0 |
3 |
On the face |
66 |
1 |
0 |
4 |
Just |
67 |
1 |
0 |
4,5 |
Immediately |
68 |
1 |
0 |
4 |
Rise |
69 |
1 |
0 |
4 |
Lonely |
70 |
1 |
0 |
4 |
Expression |
71 |
1 |
0 |
4 |
Slightly |
72 |
1 |
0 |
4,5 |
Sighing |
73 |
1 |
0 |
4 |
: |
74 |
1 |
0 |
4 |
Sound of sighing |
75 |
1 |
0 |
4 |
! |
76 |
1 |
0 |
4 |
You |
77 |
1 |
0 |
4 |
With |
78 |
1 |
0 |
4 |
The same |
79 |
1 |
0 |
4 |
{。##.##1}, |
80 |
1 |
0 |
4 |
Sigh |
81 |
1 |
0 |
5 |
One |
82 |
1 |
0 |
5 |
Afterwards |
83 |
1 |
0 |
5 |
No |
84 |
1 |
0 |
5 |
Speak |
85 |
1 |
0 |
5 |
Daytime |
86 |
1 |
1 |
0 |
Comply with |
87 |
1 |
1 |
0 |
The mountain |
88 |
1 |
1 |
0 |
To the greatest extent |
89 |
1 |
1 |
0 |
The Yellow River |
90 |
1 |
1 |
0 |
Go into |
91 |
1 |
1 |
0 |
The sea |
92 |
1 |
1 |
0 |
Stream |
93 |
1 |
1 |
0 |
Desire poor |
94 |
1 |
1 |
1 |
A thousand li |
95 |
1 |
1 |
1 |
Order |
96 |
1 |
1 |
1 |
More go up |
97 |
1 |
1 |
1 |
The building |
98 |
1 |
1 |
1 |
Be noted that, have in first webpage and second webpage ", ", ".", " one deck ".Therefore, the Dui Ying webpage quantity ndocs that comprises each speech, word or punctuation mark is 2.
Behind the search subscriber 406 inputted search items, searcher 405 is decomposed into a plurality of key words, keyword or punctuation mark with search terms.Certainly, the search terms of search subscriber 406 inputs also may itself be key word, keyword or a punctuation mark, and 405 of searchers do not need this is decomposed.
Searcher 405 judges according to the sentence serial numbers information of table three whether a plurality of key words, keyword or punctuation mark that search terms decomposes belong to same sentence or the less sentence (for example, the sentence distance is 1, also is adjacent sentence) of sentence distance at webpage.
For example, the search terms of search subscriber 406 is " one deck over the past half year ", and search terms is broken down into keyword " over the past half year " and " one deck ".Searcher 405 question blanks three, the page sequence docid of keyword " over the past half year " and " one deck " all are 0, and sentence serial numbers sentence id is 1, can judge that promptly two keywords " over the past half year ", " one deck " belong to same sentence.For example, the search terms of search subscriber 406 is " a lonely expression ", and search terms is broken down into keyword " loneliness ", " expression ".Searcher 405 question blanks three, the page sequence docid of keyword " loneliness " and " expression " all are 0, and sentence serial numbers sentence id is 4, can judge that promptly both keyword " loneliness ", " expression " belong to same sentence.
Obviously, a plurality of key words, keyword or the punctuation mark that belong to same sentence have higher correlativity under equal sort criteria, and the sequencing weight of affiliated webpage should improve (promptly under equal sort criteria, affiliated webpage is should rank forward).
Do not belong to the webpage of same sentence for a plurality of key words, keyword or punctuation mark, can calculate the sentence distance (absolute value of the difference of sentence serial numbers) of a plurality of key words, keyword or punctuation mark.The sequencing weight of the webpage that the sentence distance is little should improve, and the sequencing weight of the webpage that the sentence distance is big should reduce.
Certainly, the sequencing weight of webpage is by combined factors decision in many ways.Sentence distance except key word, keyword or punctuation mark, the authority that also has webpage place domain name, the pouplarity of webpage, whether key word, keyword or punctuation mark appear in network address, title, anchor text or the metatag, the flowing of access of webpage and click-through rate, some factors such as the log-on data of website, webpage place and public station data.
In addition, if a plurality of key word, keyword or punctuation mark belong to same sentence, then can further do natural language processing to sentence.For example, various syntactic analyses done in this sentence,, obtain the dependence between the vocabulary of this sentence and the head of this sentence as interdependent syntactic analysis.For example, based on sentiment classification (passing judgement on analysis) made in this sentence, learn the tendentiousness of this sentence.Above-mentioned analysis can be simultaneously displayed in the Search Results, for search client 406 provides more perfect value-added service.
As shown in Figure 4, the present invention also provides a kind of and has carried out the device of Webpage search according to sentence serial numbers, also is search engine 40, comprises webpage getter 401, is used to obtain and download plurality of webpages; Web database 402 is used to store the plurality of webpages of download; Index 403 is used for that plurality of webpages is carried out sentence and cuts apart, and is respectively the distributing serial numbers for sentences of each webpage, makes the forward direction concordance list and the inverted index table that comprise sentence serial numbers; Index data base 404 is used to store forward direction concordance list and inverted index table; Searcher 405 is used for search terms is decomposed at least one key word, keyword or punctuation mark, according to inverted index table, calculates the sequencing weight of the webpage that comprises key word, keyword or punctuation mark, the output Search Results.Webpage getter 401, web database 402, index 403, index data base 404, searcher 405 connect successively.Search engine 40 is back to search subscriber 406 with final search result.
First embodiment and second embodiment are example with the Chinese web page, and the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention is set forth.But the present invention is not limited to this, and the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention also can be applicable to the various information retrieval that comprise the natural language of punctuation mark such as English, German, Russian, Japanese, Spanish.The present invention can be applicable to the search of webpage, e-book, structured text etc.
The inverted index table that carries out the method and apparatus of Webpage search according to sentence serial numbers of the present invention comprises the sentence serial numbers of webpage, by inquiry sentence serial numbers information, the sentence distance that search engine can improve key word, keyword or punctuation mark is a sequencing weight zero or webpage that the sentence distance is less, thereby make the rank of webpage forward, promote user's search satisfaction.
Of the present invention according to sentence serial numbers carry out the method and apparatus of Webpage search can be directly according to the sentence serial numbers of each word, speech or punctuation mark in each webpage, judge whether two or more key words to be checked, keyword or punctuation mark belong to same sentence or sentence close together fast, and do not need a large amount of comparison operations.The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention has lower time complexity, thereby improves the response speed of search, for the user brings search experience more efficiently.
The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention can provide condition precedent for follow-up natural language processing.If two or more key words to be checked, keyword or punctuation mark belong to same sentence, search engine can be done further deep natural language processing to this sentence.
More than describe preferred embodiment of the present invention in detail.The ordinary skill that should be appreciated that this area need not creative work and just can design according to the present invention make many modifications and variations.Therefore, all technician in the art all should be in claim protection domain of the present invention under this invention's idea on the basis of existing technology by the available technical scheme of logical analysis, reasoning, or a limited experiment.