CN102880721B - The implementation method of vertical search engine - Google Patents

The implementation method of vertical search engine Download PDF

Info

Publication number
CN102880721B
CN102880721B CN201210390588.7A CN201210390588A CN102880721B CN 102880721 B CN102880721 B CN 102880721B CN 201210390588 A CN201210390588 A CN 201210390588A CN 102880721 B CN102880721 B CN 102880721B
Authority
CN
China
Prior art keywords
coordinate
index
search
keyword
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210390588.7A
Other languages
Chinese (zh)
Other versions
CN102880721A (en
Inventor
黄水清
张尔宁
梁山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NETWORK TECHNOLOGY (SHANGHAI) Co Ltd
Original Assignee
NETWORK TECHNOLOGY (SHANGHAI) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NETWORK TECHNOLOGY (SHANGHAI) Co Ltd filed Critical NETWORK TECHNOLOGY (SHANGHAI) Co Ltd
Priority to CN201210390588.7A priority Critical patent/CN102880721B/en
Publication of CN102880721A publication Critical patent/CN102880721A/en
Application granted granted Critical
Publication of CN102880721B publication Critical patent/CN102880721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of implementation method of vertical search engine, first set up geographical word database, Feature Words database, address search training storehouse.Then info web is collected by webpage capture program.Then by concordance program, for collected webpage, to set up in coordinate figure index, condition code index and keyword index one or more.The query contents of user is finally responded by search program.Search program judges whether query contents belongs to by address search; If belonged to, then retrieve alone or in combination with coordinate figure, condition code, keyword.If do not belonged to, then retrieve alone or in combination with condition code, keyword.Result for retrieval shows user.The application adopts Naive Bayes Classification Algorithm to judge retrieval intention, and constructs three class index databases of webpage, and with this, three classes index database is combined retrieves, and can more be met user's request, more accurately result for retrieval thus.

Description

The implementation method of vertical search engine
Technical field
The application relates to a kind of vertical search engine of internet arena.
Background technology
Search engine refers to collects a large amount of info web, arranges these info webs, and provides the computer system of retrieval service for user.Search engine mainly can be divided into three kinds by its working method: full-text search engine (Full TextSearch Engine), vertical search engine (Vertical Search Engine) and META Search Engine (Meta SearchEngine).
Full-text search engine extensively captures various webpage from internet, and for each webpage sets up index, the querying condition according to user searches the record matched in index database, then by certain putting in order result returned to user.The Typical Representative of full-text search engine is Google, Baidu, and its range of search is extensive, but also has the feature that inquiry is inaccurate, the degree of depth is inadequate.
Vertical search engine is the professional search engine for some industries, and it is by providing retrieval service for a certain specific area, a certain specific crowd or a certain particular demands.The application of vertical search engine comprises employment class, house house property class, traffic trip class, shopping rate of exchange class, software and audio-visual resources-type etc., its range of search is confined to specific application, but has professional, accurate, the deep feature of result for retrieval in this application.
META Search Engine is by the inquiry request of user simultaneously at the enterprising line retrieval of other search engines multiple, and whole result is returned to user.
Search engine all comprises webpage capture program, concordance program, search program etc. usually.The implementation method of existing vertical search engine is as follows:
First, web page contents is collected by webpage capture program.Because each vertical search engine has specific application, thus webpage capture program emphasis collects the info web in this specific application, takes into account other info webs on internet.
Secondly, by concordance program for collected web page contents sets up index.Content of text relevant to application in collected webpage extracts as keyword index by concordance program, and the keyword index of all webpages just defines keyword index storehouse.
Finally, the inquiry request of user is responded by search program.Search program retrieves the record matched with the query contents of user in keyword index storehouse, and sorts after (normally sorting according to the mode such as matching degree, importance) to Output rusults and display.
The implementation method of existing vertical search engine has following shortcoming:
One, what carry out between the query contents that search program inputs user and keyword index storehouse is text matches work, causes result for retrieval accurate not.
Its two, the query contents of user's input comprises one section of accurate (or doubtful) address information sometimes, this demonstrates and carries out by this address the intention retrieved.But search program still just according to Keywords matching, thus cannot provide rational result for retrieval.
Summary of the invention
Technical problems to be solved in this application are to provide a kind of implementation method of vertical search engine.
For solving the problems of the technologies described above, the implementation method of the application's vertical search engine is:
The first step, sets up geographical word database, Feature Words database, address search training storehouse;
Described geographical word database comprises multiple geographical word;
Described Feature Words database comprises multiple Feature Words and corresponding mutual unduplicated condition code thereof;
Described address search training storehouse comprises multistage words, they have been one of " having the intention by address search " or " not having the intention by address search " these two classifications by manual sort all; Each vocabulary in the probable value that described two classifications occur, every section of words for described two classifications conditional probability also statistical computation go out;
Second step, collects info web by webpage capture program;
3rd step, by concordance program, for collected webpage, to set up in coordinate figure index, condition code index and keyword index one or more;
4th step, is responded the query contents of user by search program; Search program judges whether query contents belongs to by address search; If belonged to, then from query contents, extract the coordinate figure of geographical word, the condition code of Feature Words, remaining as keyword, as the condition retrieved alone or in combination in coordinate figure index database, condition code index database, keyword index storehouse; If do not belonged to, then from query contents, extract the condition code of Feature Words, remaining as keyword, as the condition of combined retrieval independent in condition code index database, keyword index storehouse; Result for retrieval shows user.
Compared with existing vertical search engine, the implementation method of the vertical search engine of the application adopts Naive Bayes Classification Algorithm to carry out retrieving the judgement of intention to the query contents that user inputs, establish training storehouse in advance, this significantly can strengthen the accuracy of result for retrieval for this reason.The application also constructs three class index databases of webpage, coordinate figure index database wherein can be used for by address search, condition code index database accurately can describe webpage, and keyword retrieval can be carried out in keyword index storehouse, and with this, three classes index database is combined obtains result for retrieval more accurately.
Accompanying drawing explanation
Fig. 1 is the general flow chart of the implementation method of the application's vertical search engine;
Fig. 2 is the realization flow figure of Naive Bayes Classification Algorithm;
Fig. 3 is the process flow diagram setting up geographical glossarial index in the implementation method of the application's vertical search engine;
Fig. 4 is the process flow diagram setting up Feature Words index in the implementation method of the application's vertical search engine;
Fig. 5 is the process flow diagram of relative users inquiry in the implementation method of the application's vertical search engine.
Embodiment
Be applied in house property field, house for one, be mainly used in searching out the vertical search engine leasing the information of real estate of selling below, the technical scheme of the application is described in detail.
Refer to Fig. 1, the implementation method of the application's vertical search engine comprises the steps:
The first step, sets up geographical word database, Feature Words database, address search training storehouse.
Described geographical word database comprises multiple geographical word.Geographical word is exactly the vocabulary, phrase etc. that can determine concrete coordinate, comprises place name, address, landmark title (building, enterprise, business, factory, means of transportation title etc.).Preferably, described geographical word database also comprises the part or all of coordinate figure corresponding to geographical word, and coordinate figure is preferably longitude and latitude, also can be postcode etc.
Described Feature Words database comprises multiple Feature Words, and these Feature Words correspond to mutual unduplicated condition code respectively.In the present embodiment, Feature Words is divided into multiple classification, and each classification specifically comprises multiple Feature Words.The classification of Feature Words such as has region, plate, cell name, type (house, business etc.), house type, area, price, surrounding resources (educational resource, medical resource, traffic resource, commercial resource etc.).The Feature Words of " house type " this classification specifically include a family, Room, two families, Room, three families, Room ..., a Room, Room one, two Rooms, Room one, three living rooms and one sitting room ..., a Room, Room two, two room two halls, three Rooms, Room two ...Each Feature Words has the condition code of an independent nothing two.Condition code can be arranged arbitrarily, from saving storage space and the angle being convenient to retrieve, is preferably string number, English alphabet and number combinatorics on words etc.
Introduce Naive Bayes Classification Algorithm (Naive Bayes Classifier) here simply.If x={a1, a2 ..., am} is an item to be sorted, each ai(i=1,2 ..., m) be a characteristic attribute of x.Y1, y2 ..., yn is the classification likely occurred.Object judges which yj(j=1 x belongs on earth, 2 ..., n).The core concept of Naive Bayes Classification Algorithm is: calculate the probability P (yj|x) that each classification yj occurs under the condition of this item x appearance to be sorted.If wherein P (yk|x)=max{P (yj|x) }, under the prerequisite namely occurred at this item x to be sorted, the probability of happening of classification yk is maximum, and k is 1,2 ..., one of n.Then think that this item x to be sorted belongs to classification yk.
P(A|B) probability that under the prerequisite that occurred of presentation of events B, event A occurs, is called the probability of event A under event B condition.P(A|B)=P(AB)/P(B), wherein P(AB) the simultaneous probability of presentation of events A and event B, P(B) presentation of events B occur probability.Sometimes, P (A|B) is easy to directly draw, P (B|A) is then difficult to directly draw.Bayes' theorem just can solve this problem: P(B|A)=P(A|B) P(B)/P(A).
According to Bayes' theorem, calculating the method for probability P (yj|x) that each classification yj under the condition that item x to be sorted occurs occurs is:
(1) find the set of the item composition multiple to be sorted of a known classification, the characteristic attribute of each item to be sorted also divides, and this set is called training sample set.A1, a2 ..., am is the set of all characteristic attributes.Y1, y2 ..., yn is the set of all categories.
(2) statistics obtains the probability P (ai|yj) of each characteristic attribute under condition of all categories, and this can by calculating P(aiyj)/P(yj) obtain, and P(aiyj) and P(yj) all can be obtained by statistics.
(3) according to Bayes' theorem: P (yj|x)=P (x|yj) P (yj)/P (x).Because denominator is identical, as long as therefore judge which molecule is maximum.Naive Bayes Classification Algorithm supposes that each characteristic attribute ai is conditional sampling, so have: P (x|yj) P (yj)=P (a1|yj) P (a2|yj) ... P (am|yj) P (yj).
Refer to Fig. 2, comprehensive above-mentioned explanation, adopt Naive Bayes Classification Algorithm to realize classification and comprise the steps:
1a walks, and sets up training sample set.Specifically, be form multiple training sample, each training sample has one or more characteristic attribute, to each training sample manual sort.This is the stage uniquely needing artificial treatment in Naive Bayes Classification Algorithm.
1b walks, training classifier.Specifically, be combined into basic statistical with training sample set to go out the frequency of occurrences of each classification and each characteristic attribute to the conditional probability of each classification.This one-phase can have been calculated automatically by program.
1c walks, and sorter is applied.Specifically, be the probable value according to statistical computation when described training sample set and training, adopt NB Algorithm to classify to the item to be sorted beyond training sample set, judge which classification it belongs to.This one-phase also can be completed by program automatically.
Described address search training storehouse is exactly that the application's application Naive Bayes Classification Algorithm is set up.Described address search intention training storehouse comprises talks about by multistage the training sample set formed, and every section of words are exactly a training sample.Every section of words are made up of one or more vocabulary, and each vocabulary is exactly a characteristic attribute.It is belong to " having the intention by address search " classification or " not having the intention by address search " classification that every section of words have determined, and namely each training sample determines classification.
Establish address search intention training storehouse after, based on it, also count above-mentioned two classifications probability of happening separately, and each vocabulary respectively with above-mentioned two simultaneous probability of classification.The probability (namely each vocabulary is to the conditional probability of above-mentioned two classifications) of each vocabulary under the condition of above-mentioned two classifications just can be calculated based on these two statistical values.
Second step, collects info web by webpage capture program.Such as can capture web page contents continuously according to each webpage hierarchical relationship etc. of the hyperlink relation between webpage, website.Vertical search engine has very strong professional, professional, and each industry, each specialty have a limited number of emphasis website, wherein collect the valuable info web of a large amount of richnesses having the sector, this specialty.The webpage capture program of the application especially frequently, all sidedly collects info web to these emphasis websites.
3rd step, by concordance program, for collected webpage, to set up in coordinate figure index, condition code index and keyword index one or more.
The structure of web page of website usual code requirement when issuing the information of real estate of selling and hiring out of house house property class, show as web page contents and roughly present tabular, each hurdle in form is exactly title, implication, position each field relatively-stationary, such as, comprise the field such as " plate ", " address ", " house type ".
Refer to Fig. 3, the concordance program of the vertical search engine of the application is set up coordinate figure index for collected webpage and is comprised the steps:
3a walks, and concordance program is searched according to structure of web page and described the field of address information, such as, be called " address " field, and different webpages also can adopt other field name.
If there is not " address " in structure of web page though field or there is " address " field but its content for empty, then not for this webpage sets up coordinate figure index.
If it be empty for there is " address " field and its content in structure of web page, then enter 3b and walk.
3b walks, and concordance program judges whether the content in " address " field comprises the arbitrary geographic word in described geographical word database.
If only comprise a geographical word, then the coordinate figure corresponding to described this geographical word of geographical word data base querying, and using the coordinate figure index of this coordinate figure as this webpage.
If comprise multiple geographical word, then the coordinate figure corresponding to the geographical word wherein occurred first according to described geographical word data base querying, and using the coordinate figure index of this coordinate figure as this webpage.
If do not comprise arbitrary geographic word, then enter 3c step.
3c walks, and the content of " address " field is inquired about at third party website (such as map, Scan Specialty website, as long as it can according to address lookup coordinate).
If third party website still cannot obtain coordinate figure, then not for this webpage sets up coordinate figure index.
If third party website can obtain coordinate figure, then using the coordinate figure index of this coordinate figure as this webpage, the content of " address " field and coordinate figure thereof are joined in coordinate figure database simultaneously.
" comprising " in 3b step one word should not be interpreted as the situation of mating completely simply, and be interpreted as the text matches mode that search engine adopts usually, namely there is certain fault-tolerance.Such as still belong to " comprising " situation between " Pudong " and " Pu Dong ", only matching degree < 100%.
Preferably, in 3b step, when concordance program judges that the content in " address " field comprises the multiple geographical word in described geographical word database, using the coordinate figure index of the coordinate figure corresponding to geographical word maximum for wherein matching degree as this webpage.If the maximum geographical word of matching degree has multiple, then using occur first, coordinate figure corresponding to geographical word that matching degree is maximum is as the coordinate figure index of this webpage.
Preferably, in 3c step, if third party website can obtain coordinate figure according to the content of " address " field, and when third party website can provide the better address information corresponding to this coordinate figure, concordance program judges whether the matching degree between the content of " address " field and this better address information is greater than a certain threshold value, if be greater than, then both common grounds and this coordinate figure are joined in coordinate figure database.If third party website does not provide the better address information corresponding to coordinate figure, then get this coordinate figure (if having multiple, then getting first) and the content of " address " field joins in coordinate figure database.
Refer to Fig. 4, the concordance program of the vertical search engine of the application is set up condition code index for collected webpage and is comprised the steps:
4a walks, and concordance program searches each non-NULL field according to structure of web page, and therefrom excludes the field (be such as called " address " field, different webpages also can adopt other field name) describing address information.
If there is not any field in structure of web page or only there is " address " though field or there is field except " address " field but content is sky, then not for this webpage sets up condition code index.
If there is content in structure of web page is not empty, except " address " field field, then enter 4b step.
4b walks, and concordance program judges whether these do not comprise the arbitrary characteristics word in described Feature Words database for the content in empty, except " address " field field.
If comprise one or more Feature Words, then the condition code corresponding to described these Feature Words of Feature Words data base querying, and using the condition code index of these condition codes as this webpage.
If do not comprise arbitrary characteristics word, then not for this webpage sets up condition code index.
" comprising " in 4b step one word also should not be interpreted as the situation of mating completely simply, and be interpreted as the text matches mode that search engine adopts usually, namely there is certain fault-tolerance.
The concordance program of the vertical search engine of the application method of setting up keyword index for collected webpage for: for collected webpage, exclude the content beyond geographical word and Feature Words, comprise the title, description, comment etc. of content of text, content of multimedia, all as the keyword index of this webpage.
Preferably, the webpage capture program of the vertical search engine of the application only collects the webpage of the structure of web page with specification, and so concordance program also only sets up index to the web page contents that these have tabular feature.Or no matter how webpage capture program collects webpage, concordance program all only sets up index to the webpage of the structure of web page with specification.
This each field according to structure of web page extracts the method for geographical word and Feature Words, compared with the full-text index mode of existing concordance program, can more directly extract valuable information, thus more accurately describe, summarizes the feature of webpage.
Such as, certain webpage is filled in " Lujiazui " after " plate " field, fill in after " address " field " Lane 366, Pucheng road ", be then long section words, comprising having " supply falls short of demand for the small apartment in the region such as People's Square, Lujiazui " always after " details description " field.So existing concordance program using " People's Square " also as keyword index, and only can carry out search operaqtion according to keyword index.The application then according to " plate " field by " Lujiazui " as Feature Words, its characteristic of correspondence code is established as condition code index; Also according to " address " field by " Lane 366, Pucheng road " as geographical word, by " (latitude 31.227622974921, the longitude 121.5126108750701) " of its correspondence as coordinate figure index; Using all the other contents except each field just as keyword index.When retrieving, the application retrieves simultaneously in condition code index, coordinate figure index and keyword index, and condition code index and coordinate figure index have the priority higher than keyword index.
The coordinate figure index of the webpage of all collections, condition code index, keyword index just form respectively coordinate figure index database, condition code index database, keyword index storehouse.
4th step, responded the query contents of user by search program, concrete retrieval flow as shown in Figure 5.
5a walks, and search program judges whether the query contents that user inputs belongs to by address search, namely adopts Naive Bayes Classification Algorithm to classify to the query contents that user inputs based on described address search training storehouse.
During specific implementation, first calculate the probable value that described query contents belongs to " having the intention by address search " this classification.If the probable value calculated is more than or equal to certain threshold value, then judge that the query contents that user inputs belongs to by address search; Otherwise judge that the query contents that user inputs does not belong to by address search.Described threshold value is such as 80%.
If judge that the query contents that user inputs belongs to by address search, then enter 5b step.
If judge that the query contents that user inputs does not belong to by address search, then enter 5d step.
5b walks, and search program judges whether comprise the arbitrary geographic word in described geographical word database in described query contents.
If comprise one or more geographical word, then the coordinate figure corresponding to described these geographical words of geographical word data base querying, and record these coordinate figures, then enter 5d step.
If do not comprise arbitrary geographic word, then enter 5c step.
5c walks, and described query contents is carried out coordinate inquiry at third party website by search program.
If third party website can obtain coordinate figure, then described query contents and coordinate figure thereof are joined in coordinate figure database, and enter 5c step.
If third party website still cannot obtain coordinate figure, then enter 5d step.
5d walks, and search program judges whether comprise the arbitrary characteristics word in described Feature Words database in described query contents.
If comprise one or more Feature Words, then the condition code corresponding to described these Feature Words of Feature Words data base querying, and record these condition codes, then enter 5e step.
If do not comprise arbitrary characteristics word, then enter 5e step.
5e walks, if described query contents also has residue content after excluding geographical word, Feature Words, then these is remained contents as keyword, then enters 5f step.
If described query contents does not remain content after excluding geographical word, Feature Words, then enter 5f step.
5f walks, when described query contents has geographical word, using within the scope of the certain distance of coordinate figure that obtains from described query contents as the search condition in coordinate figure index database;
When described query contents has Feature Words, the condition code obtained from described query contents is retrieved in condition code index database;
When described query contents has keyword, described keyword is retrieved in keyword index storehouse;
With the common factor of a kind of of above-mentioned three kinds of retrieval modes or multiple combined obtained result for retrieval, present to user.
During 5a step or 5b walk, search program usually also carries out participle to the query contents of user's input, remove symbol, goes the operations such as stop words.Described participle is decomposed into multiple vocabulary by described query contents.The described symbol that goes is exactly got rid of by the non-Chinese symbol in described query contents.The described stop words that goes is exactly got rid of by the nonsense words in described query contents, such as preposition " ", " ", " obtaining "; Interjection " ", " ", " " etc.
" comprising " in 5b step, 5d step one word also should not be interpreted as the situation of mating completely simply, and be interpreted as the text matches mode that search engine adopts usually, namely there is certain fault-tolerance.
Preferably, in 5c step, if third party website can obtain coordinate figure according to query contents, and when third party website can provide the better address information corresponding to this coordinate figure, search program judges whether the matching degree between query contents and this better address information is greater than a certain threshold value, if be greater than, then both common grounds and this coordinate figure are joined in coordinate figure database.If third party website does not provide the better address information corresponding to coordinate figure, then get this coordinate figure (if having multiple, then getting first) and query contents joins in coordinate figure database.
Seven kinds of situations of 5e step setting are as shown in the table:
If do not comprise geographical word in described query contents, also do not comprise Feature Words but comprise keyword, then described keyword is retrieved by search program in keyword index storehouse;
If do not comprise geographical word in described query contents but comprise Feature Words, also do not comprise keyword, then the condition code obtained from described query contents is retrieved by search program in condition code index database;
If do not comprise geographical word in described query contents but comprise Feature Words, comprise keyword yet, then the condition code obtained from described query contents is retrieved by search program in condition code index database, retrieves alternatively with keyword in keyword index storehouse simultaneously;
If comprise geographical word in described query contents but do not comprise Feature Words, also do not comprise keyword, then search program using within the scope of the certain distance of coordinate figure that obtains from described query contents as the search condition in coordinate figure index database;
If comprise geographical word in described query contents but do not comprise Feature Words, comprise keyword, then search program using within the scope of the certain distance of coordinate figure that obtains from described query contents as the search condition in coordinate figure index database, retrieve in keyword index storehouse with keyword alternatively simultaneously;
If comprise geographical word in described query contents, also comprise Feature Words but do not comprise keyword, then search program using within the scope of the certain distance of coordinate figure that obtains in described query contents as the search condition in coordinate figure index database, the condition code obtained from described query contents is retrieved in condition code index database simultaneously;
If comprise geographical word in described query contents, also comprise Feature Words, also comprise keyword, then search program using within the scope of the certain distance of coordinate figure that obtains in described query contents as the search condition in coordinate figure index database, the condition code obtained from described query contents is retrieved in condition code index database simultaneously, retrieve in keyword index storehouse with keyword alternatively simultaneously;
If retrieved in multiple index database simultaneously, then the common factor getting respective result for retrieval presents to user.
In the certain limit of described coordinate figure be such as within the scope of 500 meters of certain latitude and longitude coordinates, within the scope of 1000 meters, within the scope of 2000 meters etc.; Or same zip code area, adjacent zip code area etc.If described query contents comprises multiple geographical word, then using the union of the certain limit of the coordinate figure of these geographical words as coordinate figure search condition.
In three kinds of situations of carrying out retrieving in keyword index storehouse with keyword alternatively, if keyword retrieval condition is obtained zero result or little result for retrieval as one of combined search conditions, then ignore this keyword retrieval condition.
5e step has special circumstances.When the query contents of user's input includes geographical word and locative Feature Words simultaneously time, then ignore the search condition of coordinate figure.This is that the latter more meets the retrieval intention of user and geo-location is more accurate because the search condition based on geographical word is compared with the locative Feature Words search condition based on the classification such as " region ", " plate ", " cell name ".
What more than enumerate is the vertical search engine of a house house property class, if change the vertical search engine of shopping class into, so only needs amendment condition code database.Feature Words classification now such as changes into: brand, type (food and drink, cinema, Karaoke ...), the pre-capita consumption amount of money, user evaluate etc.The Feature Words of " brand " this classification such as comprises Quanjude, KFC etc.In addition, all the other schemes are then identical.
Compared with existing vertical search engine, the implementation method tool of the vertical search engine of the application has the following advantages:
One, when carrying out index to webpage, introduces coordinate figure index and condition code index innovatively, considerably increases the accuracy caught web page characteristics.。
They are two years old, when webpage is retrieved, the retrieval latitude (only retrieving in keyword index storehouse) of script one dimension is expanded to the retrieval latitude (in coordinate figure index database, condition code index database and keyword index storehouse combined retrieval) being up to multidimensional, make result for retrieval more accurate, also more meet the Search Requirement of user.
Its three, adopt Naive Bayes Classification Algorithm to judge whether the query contents of user has the intention by address lookup, thus enable the search condition of coordinate figure targetedly.
These are only the preferred embodiment of the application, and be not used in restriction the application.For a person skilled in the art, the application can have various modifications and variations.Within all spirit in the application and principle, any amendment done, equivalent replacement, improvement etc., within the protection domain that all should be included in the application.

Claims (9)

1. an implementation method for vertical search engine, is characterized in that, described method is:
The first step, sets up geographical word database, Feature Words database, address search training storehouse;
Described geographical word database comprises multiple geographical word and the part or all of coordinate figure corresponding to geographical word;
Described Feature Words database comprises multiple Feature Words and corresponding mutual unduplicated condition code thereof;
It is described that address search training storehouse comprises multistage words, they are categorized as one of " having the intention by address search " or " not having the intention by address search " these two classifications by Naive Bayes Classification Algorithm all; Each vocabulary in the probable value that described two classifications occur, every section of words for described two classifications conditional probability also statistical computation go out;
Second step, collects info web by webpage capture program;
3rd step, by concordance program, for collected webpage, to set up in coordinate figure index, condition code index and keyword index one or more;
4th step, is responded the query contents of user by search program; According to Naive Bayes Classification Algorithm, search program judges whether query contents belongs to by address search; If belonged to, then from query contents, extract the coordinate figure of geographical word, the condition code of Feature Words, remaining as keyword, as the condition retrieved alone or in combination in coordinate figure index database, condition code index database, keyword index storehouse; If do not belonged to, then from query contents, extract the condition code of Feature Words, remaining as keyword, as the condition retrieved alone or in combination in condition code index database, keyword index storehouse; Result for retrieval shows user.
2. the implementation method of vertical search engine according to claim 1, is characterized in that, in the described method first step, described address search training storehouse comprises talks about by multistage the training sample set formed, and every section of words are exactly a training sample; Every section of words are made up of one or more vocabulary, and each vocabulary is exactly a characteristic attribute; It is belong to " having the intention by address search " classification or " not having the intention by address search " classification that every section of words have determined, and namely each training sample determines classification;
Establish address search training storehouse after, based on it, also count above-mentioned two classifications probability of happening separately, and each vocabulary respectively with above-mentioned two simultaneous probability of classification; Just can calculate the probability of each vocabulary under the condition of above-mentioned two classifications based on these two statistical values, namely each vocabulary is to the conditional probability of above-mentioned two classifications.
3. the implementation method of vertical search engine according to claim 1, is characterized in that, in described method the 3rd step, sets up coordinate figure index and comprises the steps:
3a walks, and concordance program searches the field describing address information according to structure of web page;
If there is not " address " in structure of web page though field or there is " address " field but its content for empty, then not for this webpage sets up coordinate figure index;
If it be empty for there is " address " field and its content in structure of web page, then enter 3b and walk;
3b walks, and concordance program judges whether the content in " address " field comprises the arbitrary geographic word in described geographical word database;
If only comprise a geographical word, then the coordinate figure corresponding to described this geographical word of geographical word data base querying, and using the coordinate figure index of this coordinate figure as this webpage;
If comprise multiple geographical word, then the coordinate figure corresponding to the geographical word wherein occurred first according to described geographical word data base querying, and using the coordinate figure index of this coordinate figure as this webpage;
If do not comprise arbitrary geographic word, then enter 3c step;
3c walks, and the content of " address " field is inquired about at third party website;
If third party website still cannot obtain coordinate figure, then not for this webpage sets up coordinate figure index;
If third party website can obtain coordinate figure, then using the coordinate figure index of this coordinate figure as this webpage, the content of " address " field and coordinate figure thereof are joined in geographical word database simultaneously.
4. the implementation method of vertical search engine according to claim 1, is characterized in that, in described method the 3rd step, sets up condition code index and comprises the steps:
4a walks, and concordance program searches each non-NULL field according to structure of web page, and therefrom excludes the field describing address information;
If there is not any field in structure of web page or only there is " address " though field or there is field except " address " field but content is sky, then not for this webpage sets up condition code index;
If there is content in structure of web page is not empty, except " address " field field, then enter 4b step;
4b walks, and concordance program judges whether these do not comprise the arbitrary characteristics word in described Feature Words database for the content in empty, except " address " field field;
If comprise one or more Feature Words, then the condition code corresponding to described these Feature Words of Feature Words data base querying, and using the condition code index of these condition codes as this webpage;
If do not comprise arbitrary characteristics word, then not for this webpage sets up condition code index.
5. the implementation method of vertical search engine according to claim 1, it is characterized in that, in described method the 3rd step, the method setting up keyword index for: for collected webpage, exclude content beyond geographical word and Feature Words all as the keyword index of this webpage.
6. the implementation method of vertical search engine according to claim 1, is characterized in that, in described method the 3rd step, concordance program only sets up index to the webpage of the structure of web page with specification.
7. the implementation method of vertical search engine according to claim 1, is characterized in that, described method the 4th step specifically comprises:
5a walks, search program judges whether the query contents that user inputs belongs to by address search, namely adopts Naive Bayes Classification Algorithm to carry out the classification of " having the intention by address search " and " not having the intention by address search " these two classifications based on the query contents that described address search training storehouse inputs user;
If judge that the query contents that user inputs belongs to by address search, then enter 5b step;
If judge that the query contents that user inputs does not belong to by address search, then enter 5d step;
5b walks, and search program judges whether comprise the arbitrary geographic word in described geographical word database in described query contents;
If comprise one or more geographical word, then the coordinate figure corresponding to described these geographical words of geographical word data base querying, and record these coordinate figures, then enter 5d step;
If do not comprise arbitrary geographic word, then enter 5c step;
5c walks, and described query contents is carried out coordinate inquiry at third party website by search program;
If third party website can obtain coordinate figure, then described query contents and coordinate figure thereof are joined in geographical word database, and enter 5d step;
If third party website still cannot obtain coordinate figure, then enter 5d step;
5d walks, and search program judges whether comprise the arbitrary characteristics word in described Feature Words database in described query contents;
If comprise one or more Feature Words, then the condition code corresponding to described these Feature Words of Feature Words data base querying, and record these condition codes, then enter 5e step;
If do not comprise arbitrary characteristics word, then enter 5e step;
5e walks, if described query contents also has residue content after excluding geographical word, Feature Words, then these is remained contents as keyword, then enters 5f step;
If described query contents does not remain content after excluding geographical word, Feature Words, then enter 5f step;
5f walks, when described query contents has geographical word, using within the scope of the certain distance of coordinate figure that obtains from described query contents as the search condition in coordinate figure index database;
When described query contents has Feature Words, the condition code obtained from described query contents is retrieved in condition code index database;
When described query contents has keyword, described keyword is retrieved in keyword index storehouse;
With the common factor of a kind of of above-mentioned three kinds of retrieval modes or multiple combined obtained result for retrieval, present to user.
8. the implementation method of vertical search engine according to claim 7, is characterized in that, in 5f step,
If do not comprise geographical word in described query contents, do not comprise Feature Words, comprise keyword, then keyword is retrieved by search program in keyword index storehouse;
If do not comprise geographical word in described query contents, comprise Feature Words, do not comprise keyword, then condition code is retrieved by search program in condition code index database;
If do not comprise geographical word in described query contents, comprise Feature Words, comprise keyword, then condition code is retrieved by search program in condition code index database; Or condition code is retrieved by search program in condition code index database, retrieve in keyword index storehouse with keyword simultaneously;
If comprise geographical word in described query contents, do not comprise Feature Words, do not comprise keyword, then search program using within the scope of the certain distance of coordinate figure as the search condition in coordinate figure index database;
If comprise geographical word in described query contents, do not comprise Feature Words, comprise keyword, then search program using within the scope of the certain distance of coordinate figure as the search condition in coordinate figure index database; Or, search program using within the scope of the certain distance of coordinate figure as the search condition in coordinate figure index database, retrieve in keyword index storehouse with keyword simultaneously;
If comprise geographical word in described query contents, comprise Feature Words, do not comprise keyword, then search program using within the scope of the certain distance of coordinate figure as the search condition in coordinate figure index database, condition code is retrieved in condition code index database simultaneously;
If comprise geographical word in described query contents, comprise Feature Words, comprise keyword, then search program using within the scope of the certain distance of coordinate figure as the search condition in coordinate figure index database, condition code is retrieved in condition code index database simultaneously; Or, search program using within the scope of the certain distance of coordinate figure as the search condition in coordinate figure index database, condition code is retrieved in condition code index database simultaneously, retrieves in keyword index storehouse with keyword simultaneously;
When retrieving in multiple index database simultaneously, the common factor getting respective result for retrieval presents to user.
9. the implementation method of vertical search engine according to claim 7, is characterized in that, in 5f step, when the query contents of user's input includes geographical word and locative Feature Words simultaneously time, then ignores the search condition of coordinate figure.
CN201210390588.7A 2012-10-15 2012-10-15 The implementation method of vertical search engine Active CN102880721B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210390588.7A CN102880721B (en) 2012-10-15 2012-10-15 The implementation method of vertical search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210390588.7A CN102880721B (en) 2012-10-15 2012-10-15 The implementation method of vertical search engine

Publications (2)

Publication Number Publication Date
CN102880721A CN102880721A (en) 2013-01-16
CN102880721B true CN102880721B (en) 2015-10-28

Family

ID=47482047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210390588.7A Active CN102880721B (en) 2012-10-15 2012-10-15 The implementation method of vertical search engine

Country Status (1)

Country Link
CN (1) CN102880721B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207901B (en) * 2013-03-21 2019-03-08 百度在线网络技术(北京)有限公司 A kind of method and apparatus that IP address ownership place is obtained based on search engine
CN104123319B (en) * 2013-04-28 2019-08-27 百度在线网络技术(北京)有限公司 The method and apparatus that search terms with map demand are parsed
CN104572992B (en) * 2015-01-06 2018-07-17 武汉工程大学 Internet geographical location information normalization method based on multiple constraint reasoning
CN104794152A (en) * 2015-01-30 2015-07-22 北京东方泰坦科技股份有限公司 Massive Chinese web page online geography informationizing method based on geographical name database
CN106503259A (en) * 2016-11-18 2017-03-15 政和科技股份有限公司 Search index method and search engine
CN106933962A (en) * 2017-02-06 2017-07-07 涂正富 A kind of film micro area network insertion and vertical search precise positioning obtain mesh calibration method
CN111581490A (en) * 2019-02-15 2020-08-25 北京无限光场科技有限公司 Information searching method and device, storage medium and electronic equipment
CN114428834B (en) * 2021-12-27 2023-03-21 北京百度网讯科技有限公司 Retrieval method, retrieval device, electronic equipment and storage medium
CN116204568B (en) * 2023-05-04 2023-10-03 华能信息技术有限公司 Data mining analysis method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901247A (en) * 2010-03-29 2010-12-01 北京师范大学 Vertical engine searching method and system for domain body restraint
CN102012922A (en) * 2010-11-30 2011-04-13 无锡快度信息技术有限公司 Modeling method for industrial application model of universal vertical search engine
WO2012034069A1 (en) * 2010-09-10 2012-03-15 Veveo, Inc. Method of and system for conducting personalized federated search and presentation of results therefrom
CN102567483A (en) * 2011-12-20 2012-07-11 华中科技大学 Multi-feature fusion human face image searching method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901247A (en) * 2010-03-29 2010-12-01 北京师范大学 Vertical engine searching method and system for domain body restraint
WO2012034069A1 (en) * 2010-09-10 2012-03-15 Veveo, Inc. Method of and system for conducting personalized federated search and presentation of results therefrom
CN102012922A (en) * 2010-11-30 2011-04-13 无锡快度信息技术有限公司 Modeling method for industrial application model of universal vertical search engine
CN102567483A (en) * 2011-12-20 2012-07-11 华中科技大学 Multi-feature fusion human face image searching method and system

Also Published As

Publication number Publication date
CN102880721A (en) 2013-01-16

Similar Documents

Publication Publication Date Title
CN102880721B (en) The implementation method of vertical search engine
CN106682150B (en) Information processing method and device
CN100465954C (en) Reinforced clustering of multi-type data objects for search term suggestion
CN101174273B (en) News event detecting method based on metadata analysis
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
Ghahremanlou et al. Geotagging twitter messages in crisis management
CN105468605A (en) Entity information map generation method and device
CN102880623B (en) Personage&#39;s searching method of the same name and system
CN105718579A (en) Information push method based on internet-surfing log mining and user activity recognition
CN102163214A (en) Numerical map generation device and method thereof
CN110532309B (en) Generation method of college library user portrait system
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN103823893A (en) User comment-based product search method and system
CN102054029A (en) Figure information disambiguation treatment method based on social network and name context
CN104794242A (en) Searching methods
CN112328794B (en) Typhoon event information aggregation method
CN103390044A (en) Method and device for identifying linkage type POI (Point Of Interest) data
CN105787066A (en) Digital content distribution system based on total analysis
CN100470549C (en) Form locating data mining method
CN112800083B (en) Government decision-oriented government affair big data analysis method and equipment
CN104536957A (en) Retrieval method and system for rural land circulation information
Chatterjee et al. SAGEL: smart address geocoding engine for supply-chain logistics
CN105159898A (en) Searching method and searching device
CN109947914A (en) A kind of software defect automatic question-answering method based on template
WO2021142968A1 (en) Multilingual-oriented semantic similarity calculation method for general place names, and application thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant