CN101551813A - Network connection apparatus, search equipment and method for collecting search engine data source - Google Patents

Network connection apparatus, search equipment and method for collecting search engine data source Download PDF

Info

Publication number
CN101551813A
CN101551813A CNA2009100394591A CN200910039459A CN101551813A CN 101551813 A CN101551813 A CN 101551813A CN A2009100394591 A CNA2009100394591 A CN A2009100394591A CN 200910039459 A CN200910039459 A CN 200910039459A CN 101551813 A CN101551813 A CN 101551813A
Authority
CN
China
Prior art keywords
return data
search
data
reproducing unit
access device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2009100394591A
Other languages
Chinese (zh)
Inventor
张程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNA2009100394591A priority Critical patent/CN101551813A/en
Publication of CN101551813A publication Critical patent/CN101551813A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

A method for collecting search engine data source, including following steps: copying page server from network connection apparatus to response to returned data generated by clients request; analyzing the returned data to obtain web page information; storing the web page information into search engine web page database and establishing indexes. The method of collecting search engine data source through copying page server from network connection apparatus to response to returned data generated by clients request, being able to obtain the web page information can not obtained through the web spider, enlarging the search data source of search engine. Furthermore, also provided a copy apparatus, network connection apparatus and search equipment.

Description

The method of network access device, search equipment and collecting search engine data source
[technical field]
The present invention relates to computer networking technology, particularly the method for the reproducing unit in the computer network, network access device, search equipment and collecting search engine data source.
[background technology]
The development of computer networking technology has improved the convenience that people obtain information greatly.Stored the information of magnanimity in the computer network, found own required information for the ease of people, search engine is widely used.People can find the webpage that comprises this keyword by the input keyword.
The course of work of search engine roughly can be divided into following three steps:
Grasp webpage: each independently search engine oneself webpage capture program (spider is called Web Spider) is all arranged.The hyperlink of webpage capture program in webpage grasps webpage continuously.Crawled webpage is referred to as snapshots of web pages.Because hyperlink is very universal in the internet, in theory,,, just can collect most webpages from the webpage of certain limit if on the webpage suitable hyperlink is arranged.
Organize your messages: after search engine is caught webpage, also will do a large amount of pre-service work, just can provide retrieval service.The process of search engine organize your messages is called " setting up index ".Search engine not only will be preserved and collect the information of getting up, and also they will be carried out layout according to certain rule.Like this, search engine does not find desired data rapidly with thumbing the information of its all preservation again.
Retrieval service is provided: the user imports keyword and retrieves, and search engine finds the webpage of this keyword of coupling from index data base; It mainly is that form with web page interlinkage provides that search engine returns, and by these links, the user just can arrive the webpage that contains own required information like this.Usually search engine can provide a bit of summary info from these webpages to judge to help the user whether this webpage contains the content of oneself needs under these links.
From the course of work of above-mentioned search engine as can be seen, the webpage capture program of search engine can only obtain info web according to existing link, and can't obtain the info web of following several pages:
1. do not have direct or indirect URL (Uniform Resource Locator between the webpage, URL(uniform resource locator), i.e. web page address) linking relationship, that is to say, the links and accesses that webpage itself can't provide by other webpage can only be visited by the mode of manual input URL;
2. the page is to need could visit through authenticating user identification, though the webpage that has provides the link that inserts, owing to could visit after need landing by username and password, so the webpage capture program also can't obtain the info web of this type of page;
3. the page has adopted dynamic data technology (AJAX etc.), and the data on the page are inquired about the backstage according to user input data and generated, and can not directly obtain from page html.
Because the info web of above-mentioned three kinds of pages can't obtain by the webpage capture program, therefore limited the scope of search engine data source to a certain extent.
[summary of the invention]
Based on this, be necessary to provide a kind of method that enlarges the collecting search engine data source of search engine data source.
A kind of method of collecting search engine data source may further comprise the steps: duplicate the return data that the request of page server customer in response end is produced from network access device; Resolve described return data, obtain info web; Deposit in info web in the search-engine web page database and set up index.
In a preferred embodiment, the step that also comprises the solicited message that writes down client.
In a preferred embodiment, comprise also whether judgement has the step of the return data of identical URL in the given time, and if had not to after identical return data handle.
In a preferred embodiment, also comprise the step of storing described return data.
In addition, also be necessary to provide a kind of reproducing unit that enlarges search engine data source.
A kind of reproducing unit, described reproducing unit is used for linking to each other with network access device and duplicates the return data that the request of page server customer in response end is produced from network access device, and the described return data that duplicates is sent to the search data source of search equipment as described search equipment.
In a preferred embodiment, described reproducing unit also is used to write down the solicited message of client, or store described return data, or judge whether the return data of identical URL is arranged in the given time, if having then the return data of the identical URL that will not duplicate sends to described search equipment.
In addition, also be necessary to provide a kind of network access device that enlarges search engine data source.
A kind of network access device, comprise coupling arrangement and reproducing unit, described coupling arrangement is used to connect client and page server, the return data that the request of page server customer in response end is produced is sent to described client, described reproducing unit links to each other with described coupling arrangement, duplicate the return data that the request of described page server customer in response end is produced from described coupling arrangement, and the described return data that duplicates is sent to the search data source of search equipment as described search equipment.
In a preferred embodiment, described reproducing unit also is used to write down the solicited message of client, or store described return data, or judge whether the return data of identical URL is arranged in the given time, if having then the return data of the identical URL that will not duplicate sends to described search equipment.
In addition, also be necessary to provide a kind of search equipment that enlarges search engine data source.
A kind of search equipment comprises: reproducing unit is used for linking to each other with network access device and duplicates the return data that the request of page server customer in response end is produced from network access device; Resolver links to each other with described reproducing unit, receives and resolve described return data, obtains info web; Indexing unit links to each other with described resolver, deposits in info web in the search-engine web page database and sets up index; Searcher is used to search described index and produces Search Results.
In a preferred embodiment, described reproducing unit also is used to write down the solicited message of client, or store described return data, or judge whether the return data of identical URL is arranged in the given time, if having then the return data of the identical URL that will not duplicate sends to described search equipment.
The method of above-mentioned reproducing unit, network access device, search equipment and collecting search engine data source is by duplicating the return data that the request of page server customer in response end is produced from network access device, can obtain and to have enlarged the search data source of search engine by the info web of Web Spider acquisition.
[description of drawings]
Fig. 1 is the process flow diagram of method of the collecting search engine data source of an embodiment;
Fig. 2 is the synoptic diagram of traditional page browsing system;
Fig. 3 is the synoptic diagram of the page browsing system of first embodiment;
Fig. 4 is the synoptic diagram of the page browsing system of second embodiment;
Fig. 5 is the synoptic diagram of the page browsing system of the 3rd embodiment.
[embodiment]
When the user carried out the network browsing operation, the user submitted to request and page server return data to be undertaken by computer network by client.By on network access device (switch, router), installing reproducing unit additional, when network access device is transferred to client with return data, reproducing unit duplicates return data portion and offers search engine as data source, can get access to that classic method is difficult to or can not getable data.That is to say that said method need not to use web crawlers (Web Spider) program in the present main flow search engine data acquisition technology.
As shown in Figure 1, it is the process flow diagram of method of the collecting search engine data source of an embodiment.
Step S110 at first, the solicited message of record client.The user sends the solicited message of accession page server by client, and these solicited messages can comprise the directly network address of input of user, and the webpage of this network address correspondence itself may be to arrive by the links and accesses that other webpage provides; It also can be the solicited message that comprises authentication information such as user name and password; It can also be the solicited message that comprises user input data.In the information such as network address that the solicited message of the client of record can be time, source IP (Internet Protocol, Internet Protocol) address, target ip address, the user directly imports or click one or more.By writing down these solicited messages, be convenient to the subsequent analysis user and browse custom and interest preference, to being provided, the Search Results that more meets user personality provides the basic data support.
Step S120 duplicates the data of returning.Page server produces corresponding return data (for example static data or dynamic data) and is sent to client by network access device, display web page content on the browser of client after receiving the solicited message of client.In the process of network access device transfer, duplicate the return data that the request of above-mentioned page server customer in response end is produced in data that page server returns from network access device, and can be with the data of duplicating by memory stores.Because the data volume by network access device is bigger usually, by long-time accumulative total data quantity stored will be huger, therefore can take following dual mode to reduce memory capacity: one, regularly empty the data of storage, for example delete month data of storage in the past every day; They are two years old, when network access device is received the data that page server returns, whether judge has the return data of identical URL in (a for example week) at the fixed time, and if had not to after identical return data handle (not carrying out the S130 in the following steps, the processing of S140 etc.).
Step S130 resolves described return data, obtains info web.Usually also comprise source IP addresses (being the page server address), target ip address (being client address), info web etc. in the packet of return data.Label in literal, picture, the html language etc. can be comprised in the info web,, the info web in the return data can be obtained by resolving above-mentioned return data.
Step S140 deposits in info web in the search-engine web page database and sets up index.With similar by webpage capture program (spider is called Web Spider) acquisition info web, search engine not only will be preserved and collect the info web of getting up, and also info web will be carried out layout to set up index according to certain rule.Because the still static webpage that does not need authentication of most of webpages in the network, therefore, a large amount of info webs still can obtain by Web Spider in the data of returning by network access device, index may be stored and set up to these info webs in the search-engine web page database, correspondingly, though in step S140, set up and not store identical info web of URL or the identical info web of the different content of pages of URL in the process of index.
Step S150 when receiving searching request, searches in the search-engine web page database.Search engine is when receiving client to searching request that search engine sends, and search engine is accepted inquiry and returned data to client.Search engine all will be received the inquiry of almost sending simultaneously from a large amount of clients all the time, checks the index of search-engine web page database according to the request of each client, the data that finds the user to need in the short time at the utmost point, and return to client.At present, it mainly is that form with web page interlinkage provides that search engine returns, and by these links, the user just can arrive the webpage that contains own required information.Usually search engine can provide a bit of summary info from these webpages to judge to help the user whether this webpage contains the content of oneself needs under these links.
Below with a concrete example said method is described in detail.The A of forum is a forum that needs authentication just can check, web crawlers can't obtain content wherein, because do not possess access rights.User B is the authorized user of the A of forum, logins and visited the content C of the A of forum by browser.Because content C transmits on the internet, so necessarily need router device D through operator, therefore, by on the routing device D with content C copying and saving under, and with the data source of content C as search engine, thereby obtained the content C that do not have authority to have access to, enlarged the data source of search engine.Similarly, the page for the page that does not independently have other links to point to and employing dynamic page technology also can obtain by the method for above-mentioned collecting search engine data source.
As shown in Figure 2, be the synoptic diagram of traditional page browsing system.The user can be by client 100 through network access device 200 accession page servers 300, and page server 300 returns data to client 300 by network access device 200.Of particular note, network access device 200 can be a router, also can be many routers, and usually terminal user's client is 100 to be can have access to page server through too much platform router.
Be illustrated in figure 3 as the synoptic diagram of the page browsing system of first embodiment.The page browsing system also comprises reproducing unit 400 and search equipment 500.In the present embodiment, reproducing unit 400 conducts independently hardware device are connected between network access device 200 and the search equipment 500, duplicate the return data that 100 requests of page server 300 customer in response ends are produced from network access device 200, and the return data that duplicates is sent to the search data source of search equipment 500 as search equipment 500.Reproducing unit 400 also is used to write down the solicited message of client, in the information such as network address that solicited message can be time, source IP (Internet Protocol, Internet Protocol) address, target ip address, the user directly imports or click one or more.By writing down these solicited messages, be convenient to the subsequent analysis user and browse custom and interest preference, to being provided, the Search Results that more meets user personality provides the basic data support.Reproducing unit 400 can also be stored above-mentioned return data, return data or solicited message for fear of the not enough storing excess of memory capacity of reproducing unit 400, reproducing unit 400 can regularly empty the data of storage, for example deletes month data of storage in the past every day; In addition, reproducing unit 400 also can be when network access device be received the data that page server returns, judge the return data whether identical URL was arranged in (a for example week) at the fixed time, if the return data of the identical URL that has then will not duplicate sends to described search equipment 500, can not store simultaneously yet.
Search equipment 500 comprises resolver 510, indexing unit 520 and searcher 530.Resolver 510 receives return data and resolves described return data from reproducing unit 400, obtains info web.Indexing unit 520 deposits in info web in the search-engine web page database and sets up index.Searcher 530 is searched in the search-engine web page database when receiving searching request, and the result that will obtain returns client.
Be illustrated in figure 4 as the synoptic diagram of the page browsing system of second embodiment.In the present embodiment, network access device 200 comprises coupling arrangement 210 and reproducing unit 220.Reproducing unit 220 is connected between coupling arrangement 210 and the search equipment 500 as the part of network access device.Coupling arrangement 210 connects client 100 and page server 300, and the return data that 100 requests of page server 300 customer in response ends are produced is sent to described client 100.Reproducing unit 220 links to each other with coupling arrangement 210, duplicate the return data that 100 requests of page server 300 customer in response ends are produced from coupling arrangement 210, and the described return data that duplicates is sent to the search data source of search equipment 500 as described search equipment 500.Reproducing unit 220 also is used to write down the solicited message of client, in the information such as network address that solicited message can be time, source IP (InternetProtocol, Internet Protocol) address, target ip address, the user directly imports or click one or more.By writing down these solicited messages, be convenient to the subsequent analysis user and browse custom and interest preference, to being provided, the Search Results that more meets user personality provides the basic data support.Reproducing unit 220 can also be stored above-mentioned return data, return data or solicited message for fear of the not enough storing excess of memory capacity of reproducing unit 220, reproducing unit 220 can regularly empty the data of storage, for example deletes month data of storage in the past every day; In addition, reproducing unit 220 also can be when network access device be received the data that page server returns, judge the return data whether identical URL was arranged in (a for example week) at the fixed time, if the return data of the identical URL that has then will not duplicate sends to described search equipment 500, can not store simultaneously yet.
Be illustrated in figure 5 as the synoptic diagram of the page browsing system of the 3rd embodiment.Search equipment 500 comprises resolver 510, indexing unit 520, searcher 530 and reproducing unit 540.Reproducing unit 540 links to each other with network access device 200 and duplicates the return data that 100 requests of page server 300 customer in response ends are produced from network access device 200.Resolver 510 links to each other with reproducing unit 540, receives and resolve described return data, obtains info web.Indexing unit 520 links to each other with described resolver 510, deposits in info web in the search-engine web page database and sets up index.Searcher 530 is used to search described index and produces Search Results.
Because the still static webpage that does not need authentication of most of webpages in the network, therefore, a large amount of info webs still can obtain by Web Spider in the data of returning by network access device, index may be stored and set up to these info webs in the search-engine web page database, correspondingly, indexing unit 520 also is used for relatively from the info web of resolver 510 receptions and the info web of the search-engine web page database of having stored, if stored the info web of identical URL or the info web of same page content, then do not store the info web that receives from resolver 510.
Reproducing unit 540 also is used to write down the solicited message of client, in the information such as network address that solicited message can be time, source IP (Internet Protocol, Internet Protocol) address, target ip address, the user directly imports or click one or more.By writing down these solicited messages, be convenient to the subsequent analysis user and browse custom and interest preference, to being provided, the Search Results that more meets user personality provides the basic data support.Reproducing unit 540 can also be stored above-mentioned return data, return data or solicited message for fear of the not enough storing excess of memory capacity of reproducing unit 540, reproducing unit 540 can regularly empty the data of storage, for example deletes month data of storage in the past every day; In addition, reproducing unit 540 also can be when network access device be received the data that page server returns, judge the return data whether identical URL was arranged in (a for example week) at the fixed time, if the return data of the identical URL that has then will not duplicate sends to described search equipment 500, can not store simultaneously yet.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (10)

1, a kind of method of collecting search engine data source is characterized in that, may further comprise the steps:
Duplicate the return data that the request of page server customer in response end is produced from network access device;
Resolve described return data, obtain info web;
Deposit in info web in the search-engine web page database and set up index.
2, the method for collecting search engine data source according to claim 1 is characterized in that, also comprises the step of the solicited message that writes down client.
3, the method for collecting search engine data source according to claim 1 is characterized in that, comprises also whether judgement has the step of the return data of identical URL in the given time, and if had not to after identical return data handle.
4, the method for collecting search engine data source according to claim 1 is characterized in that, also comprises the step of storing described return data.
5, a kind of reproducing unit, it is characterized in that, described reproducing unit is used for linking to each other with network access device and duplicates the return data that the request of page server customer in response end is produced from network access device, and the described return data that duplicates is sent to the search data source of search equipment as described search equipment.
6, reproducing unit according to claim 5, it is characterized in that, described reproducing unit also is used to write down the solicited message of client, or store described return data, or judge whether the return data of identical URL is arranged in the given time, if having then the return data of the identical URL that will not duplicate sends to described search equipment.
7, a kind of network access device, it is characterized in that, described network access device comprises coupling arrangement and reproducing unit, described coupling arrangement is used to connect client and page server, the return data that the request of page server customer in response end is produced is sent to described client, described reproducing unit links to each other with described coupling arrangement, duplicate the return data that the request of described page server customer in response end is produced from described coupling arrangement, and the described return data that duplicates is sent to the search data source of search equipment as described search equipment.
8, network access device according to claim 7, it is characterized in that, described reproducing unit also is used to write down the solicited message of client, or store described return data, or judge whether the return data of identical URL is arranged in the given time, if having then the return data of the identical URL that will not duplicate sends to described search equipment.
9, a kind of search equipment is characterized in that, comprising:
Reproducing unit is used for linking to each other with network access device and duplicates the return data that the request of page server customer in response end is produced from network access device;
Resolver links to each other with described reproducing unit, receives and resolve described return data, obtains info web;
Indexing unit links to each other with described resolver, deposits in info web in the search-engine web page database and sets up index;
Searcher is used to search described index and produces Search Results.
10, search equipment according to claim 9, it is characterized in that, described reproducing unit also is used to write down the solicited message of client, or store described return data, or judge whether the return data of identical URL is arranged in the given time, if having then the return data of the identical URL that will not duplicate sends to described search equipment.
CNA2009100394591A 2009-05-13 2009-05-13 Network connection apparatus, search equipment and method for collecting search engine data source Pending CN101551813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2009100394591A CN101551813A (en) 2009-05-13 2009-05-13 Network connection apparatus, search equipment and method for collecting search engine data source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2009100394591A CN101551813A (en) 2009-05-13 2009-05-13 Network connection apparatus, search equipment and method for collecting search engine data source

Publications (1)

Publication Number Publication Date
CN101551813A true CN101551813A (en) 2009-10-07

Family

ID=41156060

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2009100394591A Pending CN101551813A (en) 2009-05-13 2009-05-13 Network connection apparatus, search equipment and method for collecting search engine data source

Country Status (1)

Country Link
CN (1) CN101551813A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629265A (en) * 2012-03-06 2012-08-08 奇智软件(北京)有限公司 Method and system for building up web page database
CN103618649A (en) * 2013-12-03 2014-03-05 北京人民在线网络有限公司 Website data acquisition method and device
CN103793070A (en) * 2014-01-27 2014-05-14 百度在线网络技术(北京)有限公司 Input request message transmitting method and system
CN105938473A (en) * 2015-12-02 2016-09-14 杭州迪普科技有限公司 Method and device for saving website snapshots
CN107103079A (en) * 2017-04-25 2017-08-29 中科院微电子研究所昆山分所 The live broadcasting method and system of a kind of dynamic website
CN110096666A (en) * 2019-05-08 2019-08-06 上海泰豪迈能能源科技有限公司 The method and device of data processing
CN110362730A (en) * 2019-07-15 2019-10-22 百度在线网络技术(北京)有限公司 A kind of index establishing method and device
CN110555159A (en) * 2018-03-30 2019-12-10 北大方正集团有限公司 Webpage retrieval method, device, equipment and storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629265A (en) * 2012-03-06 2012-08-08 奇智软件(北京)有限公司 Method and system for building up web page database
CN102629265B (en) * 2012-03-06 2016-01-13 北京奇虎科技有限公司 A kind of method and system setting up web database
CN103618649A (en) * 2013-12-03 2014-03-05 北京人民在线网络有限公司 Website data acquisition method and device
CN103793070A (en) * 2014-01-27 2014-05-14 百度在线网络技术(北京)有限公司 Input request message transmitting method and system
CN103793070B (en) * 2014-01-27 2017-01-18 百度在线网络技术(北京)有限公司 Input request message transmitting method and system
CN105938473A (en) * 2015-12-02 2016-09-14 杭州迪普科技有限公司 Method and device for saving website snapshots
CN107103079A (en) * 2017-04-25 2017-08-29 中科院微电子研究所昆山分所 The live broadcasting method and system of a kind of dynamic website
CN110555159A (en) * 2018-03-30 2019-12-10 北大方正集团有限公司 Webpage retrieval method, device, equipment and storage medium
CN110096666A (en) * 2019-05-08 2019-08-06 上海泰豪迈能能源科技有限公司 The method and device of data processing
CN110362730A (en) * 2019-07-15 2019-10-22 百度在线网络技术(北京)有限公司 A kind of index establishing method and device

Similar Documents

Publication Publication Date Title
US7797295B2 (en) User content feeds from user storage devices to a public search engine
US8572100B2 (en) Method and system for recording search trails across one or more search engines in a communications network
CA2420382C (en) A method for searching and analysing information in data networks
US20060206460A1 (en) Biasing search results
US7020082B2 (en) Network usage monitoring device and associated method
US6336117B1 (en) Content-indexing search system and method providing search results consistent with content filtering and blocking policies implemented in a blocking engine
US6006217A (en) Technique for providing enhanced relevance information for documents retrieved in a multi database search
US7949702B2 (en) Method and apparatus for synchronizing cookies across multiple client machines
US20010037325A1 (en) Method and system for locating internet users having similar navigation patterns
CN101551813A (en) Network connection apparatus, search equipment and method for collecting search engine data source
US20100125781A1 (en) Page generation by keyword
AU2001290363A1 (en) A method for searching and analysing information in data networks
RU2453916C1 (en) Information resource search method using readdressing
JP2008507057A (en) Improved user interface
CN110430188A (en) A kind of quick url filtering method and device
US20050125412A1 (en) Web crawling
JP2004110080A (en) Computer network connection method on internet by real name, and computer network system
WO2001075668A2 (en) Search systems
WO2015062652A1 (en) Technique for data traffic analysis
JP2002351913A (en) Method and device for creating portal site
JP2007087349A (en) Information sharing system
JP2013090277A (en) Method, apparatus and system for solving cache server
JP4259858B2 (en) WWW site history search device, method and program
US20050165745A1 (en) Method and apparatus for collecting user feedback based on search queries
WO2001075652A3 (en) Media exchange system and process

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20091007