WO2015074455A1 - 一种计算关联网页URL模式pattern的方法和装置 - Google Patents

一种计算关联网页URL模式pattern的方法和装置 Download PDF

Info

Publication number
WO2015074455A1
WO2015074455A1 PCT/CN2014/086522 CN2014086522W WO2015074455A1 WO 2015074455 A1 WO2015074455 A1 WO 2015074455A1 CN 2014086522 W CN2014086522 W CN 2014086522W WO 2015074455 A1 WO2015074455 A1 WO 2015074455A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
webpage
feature
page
pattern
Prior art date
Application number
PCT/CN2014/086522
Other languages
English (en)
French (fr)
Inventor
王智广
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201310603918.0A external-priority patent/CN103617225B/zh
Priority claimed from CN201310607854.1A external-priority patent/CN103617229A/zh
Priority claimed from CN201310606990.9A external-priority patent/CN103631906A/zh
Priority claimed from CN201310607851.8A external-priority patent/CN103617228A/zh
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015074455A1 publication Critical patent/WO2015074455A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a method for calculating an associated web page URL pattern pattern, and an apparatus for calculating an associated web page URL pattern pattern.
  • Search engines need to adopt different scheduling strategies for different types of web pages.
  • the identification of web page types is a basic work.
  • the identification of page turning pages is a relatively important task.
  • the so-called page turning page is to view the previous page of the paging file, the next page or any non-current page existing. Turning pages can change the content of a physical book or mobile web form to view different content.
  • This mechanism also presents user interface elements that can be used to browse to other pages when used on the Internet.
  • the existing method for identifying a page turning page is to identify whether it is an index page according to a keyword included in a URL (Uniform Resource Locator) of the web page. For example, when the URL includes keywords such as page, pn, and p, and a number after the keyword, the web page corresponding to the URL is determined to be a page turning page.
  • a URL Uniform Resource Locator
  • the present invention has been made in order to provide a method of calculating an associated web page URL pattern pattern and a corresponding apparatus for calculating an associated web page URL pattern pattern that overcomes the above problems or at least partially solves the above problems.
  • a method for calculating an associated web page URL pattern pattern including:
  • a method for identifying a page number identifier in a webpage URL including:
  • a method for establishing an associated web page database including:
  • the associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
  • an associated web page search method including:
  • Receiving a search request the request includes a search keyword
  • Determining whether the webpage is an associated webpage if yes, returning the webpage and the homepage information associated with the webpage.
  • an apparatus for calculating an associated web page URL pattern pattern including:
  • the page turning feature anchor determining module is adapted to determine whether the page element of the specified webpage has a page turning feature anchor; if yes, calling the associated URL extracting module;
  • a URL extraction module configured to extract an associated URL to which the page turning feature anchor is linked
  • the associated webpage URL pattern calculation module is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
  • a computer program comprising computer readable code when said calculating
  • the machine readable code when run on a computing device, causes the computing device to perform the method of calculating an associated web page URL pattern pattern according to any of claims 1-8.
  • a computer readable medium storing the computer program according to claim 23 is provided.
  • the invention adopts the page turning feature anchor to identify the associated webpage, and the recognition accuracy is high.
  • the associated webpage URL pattern patte is calculated based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.
  • the present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix.
  • first feature URL prefix is the same as the second feature URL prefix
  • second feature URL prefix is used as the associated webpage URL pattern.
  • the present invention uses the common part of the URL to perform matching, further improves the recognition accuracy of the associated webpage, and the recall rate is greatly improved, and more than 90% of the associated webpages can be identified in practical applications. .
  • the invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page.
  • the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association.
  • the coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
  • the invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .
  • the invention When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.
  • FIG. 1 is a flow chart showing the steps of Embodiment 1 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention
  • FIG. 2 is a view schematically showing an example of a web page structure according to an embodiment of the present invention
  • FIG. 3 is a view schematically showing an example of a page turning block showing an embodiment of the present invention
  • FIG. 4 is a flow chart showing the steps of Embodiment 2 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention
  • FIG. 5 is a flow chart showing the steps of an embodiment of a method for identifying a page number identifier in a webpage URL according to an embodiment of the present invention
  • FIG. 6 is a flow chart showing the steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention
  • FIG. 7 is a flow chart showing the steps of an embodiment of an associated webpage search method according to an embodiment of the present invention.
  • FIG. 8 is a block diagram showing a structural diagram of Embodiment 1 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention
  • FIG. 9 is a block diagram showing a structural diagram of Embodiment 2 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention.
  • Figure 10 schematically shows a block diagram of a computing device for performing the method according to the invention
  • Fig. 11 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 1 a flow chart of the steps of the method for calculating the associated web page URL pattern patte is shown in the following steps.
  • Step 101 it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 102 is performed;
  • the webpage can be divided into multiple areas according to functions. Take a page of a forum (BBS) as an example. As shown in FIG. 2, the page can be divided into a navigation block (1) and a garbage block (2, 4). , page turning block (3), title block (5), author information block (6), publication date block (7), text block (8). Wherein, the navigation block can be located at the top of the web page header, or The lower part of the banner (the banner of the web page) is used to point to the information section of the web page.
  • a garbage block can be an area where a page element having a low relevance to a web page topic is located, such as a "post", "reply", and the like.
  • the page turning block can be an area indicating the page turning.
  • the title block can be the area in which the title of the web page (such as "Secure Browser Gather Black Thursday” shown in Figure 2) is located.
  • the author information block is an area that records the author information of the web page.
  • the body block is the area in which the body of the subject of the web page is recorded.
  • FIG. 3 there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.
  • the page turning block may mainly be composed of a page turning feature anchor, and the page turning feature anchor is a page turning feature string, which may be a page element for identifying a page turning.
  • the page turning feature anchor may include one or more of the following:
  • page turning feature anchor is only used as an example.
  • other page turning feature anchors may be set according to actual conditions, which is not limited by the embodiment of the present invention.
  • the step 101 may specifically include the following sub-steps:
  • Sub-step S11 using a page turning feature anchor to perform matching in the DOM tree node of the current webpage
  • Sub-step S12 when the matching is successful, it is determined that the current webpage has a page turning feature anchor.
  • the DOM (Document Object Model) is a standard programming interface for handling extensible markup languages.
  • the DOM can access and modify the content and structure of a document in a platform- and language-independent manner, representing and processing an HTML (Hypertext Markup Language) or XML (eXtensible Markup Language).
  • HTML Hypertext Markup Language
  • XML eXtensible Markup Language
  • the DOM is actually a document model that is described in an object-oriented manner.
  • the DOM defines the objects needed to represent and modify documents, the behavior and properties of those objects, and the relationships between these objects.
  • the DOM can be thought of as a tree representation of the data and structure on the page, but of course the page may not be implemented in this way.
  • HTML document can be refactored via JavaScript, and items on the page can be added, removed, changed, or rearranged.
  • HTML documents can be thought of as a tree structure, and this structure is called a node tree (HTML DOM). With the HTML DOM, all nodes in the tree are accessible via JavaScript. All HTML elements (nodes) can be modified, and nodes can be created or deleted.
  • HTML DOM node tree
  • the nodes in the node tree have a hierarchical relationship with each other. Terms such as parent, child, and sibling can be used to describe these relationships. Among them, the parent node has child nodes. The child nodes of the same level are called siblings (brothers or sisters). In the node tree, the top node is called the root. Each node has a parent node, except for the root (it has no parent). A node can have any number of children, and a sibling is a node that has the same parent.
  • getElementById() and getElementsByTagName() can find any HTML element in the entire HTML document. Both methods ignore the structure of the document. If you look up all the ⁇ p> elements in the document, getElementsByTagName() will find them all, no matter which level in the document the ⁇ p> element is in. At the same time, the getElementById() method will also return the correct element, no matter where it is hidden in the document structure. These two methods provide whatever HTML elements are needed, regardless of where they are in the document.
  • getElementById() returns the page element with the specified ID.
  • the hyperlink ⁇ a> (anchor) in the HTML text DOM tree of the web page may be identified to include [ ⁇ ], [>>], [ ⁇ ⁇ ], [> >], [ "], ["], [>], [ ⁇ ], [Previous], [Previous], [Next], [next], [Last], [Last] One or more of [Previous Page], [Next Page], [ ⁇ Previous Page], [ ⁇ Previous], [Next], [Next Page], [1...] If yes, it is determined that the current webpage has a page turning feature anchor.
  • ⁇ a> can be used to connect the text or picture at the current position to other pages, texts or images.
  • the basic syntax structure of the ⁇ a> tag can be as follows:
  • the content of the ⁇ a> identifier in the following HTML text is:
  • Step 102 Extract an associated URL (Un and nn Resource Locator) to which the page turning feature anchor is linked;
  • the page flip feature anchor may be linked to one or more associated URLs.
  • Step 103 Calculate according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is linked Calculating an associated webpage URL pattern pattern corresponding to the specified webpage.
  • the associated web page URL pattern Pattern which can be a collection of long-formed or functionally similar URLs/web pages.
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S21 replacing a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier;
  • Sub-step S31 replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix
  • the wildcard character may be any character, which is not limited in this embodiment of the present invention.
  • the interval identifier may be a symbol for the interval in the URL, such as "/", “.”, “-”, “?”, “:”, and the like.
  • the digital block needs to be a consecutive number in the interval identifier, for example "123ABC" is not a digital block.
  • the sub-step S21 may further include the following sub-steps:
  • Sub-step S211 replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix
  • the sub-step S31 may further comprise the following sub-steps:
  • Sub-step S311 replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.
  • the URL of the specified webpage and the associated URL may have one or more digital blocks.
  • the digital block may be replaced with the same wildcard character.
  • the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html
  • the associated URL is http://bbs.XXX.com/forum-99-3.html, where "99” "2" is recognized as a digital block, and "( ⁇ d+)" is an example of a wildcard character.
  • the first feature URL prefix can be http://bbs.XXX.com/forum-( ⁇ d+ )-( ⁇ d+).html
  • the second feature URL prefix can be http://bbs.XXX.com/forum-( ⁇ d+)-( ⁇ d+).html.
  • the sub-step S21 may further include the following sub-steps:
  • Sub-step S212 using different replacement characters to replace the digital blocks in different positions in the URL of the specified webpage, to obtain the first feature URL prefix;
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S312 replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.
  • the URL of the specified webpage and the associated URL may have one or more digital blocks, and may be different to determine whether the subsequent first feature URL prefix is the same as the second feature URL and the efficiency of the identification of the digital block.
  • the wildcard character replaces the numeric block.
  • the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html
  • the associated URL is http://bbs.XXX.com/forum-99-3.html
  • "99" "2" is recognized as a digital block, with "( ⁇ d+)” and "( ⁇ e+)” as an example of a wildcard character
  • the first feature URL prefix can be http://bbs.XXX. Com/forum-( ⁇ d+)-( ⁇ e+).html
  • the second feature URL prefix can be http://bbs.XXX.com/forum-( ⁇ d+)-( ⁇ e+).html.
  • Sub-step S41 when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.
  • the webpage corresponding to the associated webpage of the specified webpage is the associated page turning webpage.
  • the first feature URL prefix or the second feature URL prefix may be used as the associated webpage URL pattern Pattern.
  • the invention adopts the page turning feature anchor to identify the associated webpage, and has high recognition accuracy, and calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.
  • the present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix.
  • first feature URL prefix is the same as the second feature URL prefix
  • second feature URL prefix is used as the associated webpage URL pattern.
  • the invention adopts the common part of the URL to perform matching, thereby further improving the recognition accuracy of the associated webpage, so that the recall rate is greatly improved, and more than 90% of the associations can be identified in practical applications. Web page.
  • FIG. 4 a flow chart of the steps of the second embodiment of the method for calculating the URL pattern pattern of the associated webpage is shown in the following steps.
  • Step 401 it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 402 is performed;
  • Step 402 Extract an associated URL to which the page turning feature anchor is linked
  • Step 403 Calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is associated;
  • Step 404 Perform structural analysis on the common part in the associated webpage URL pattern pattern, extract the page turning block in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain the URL of the homepage associated webpage;
  • the page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL pattern patterns.
  • the URL may include one or more of the following structures:
  • protocol specifies the transport protocol used, the most commonly used is the HTTP protocol, which is also the most widely used protocol in the current WWW.
  • the transport protocol includes a file protocol (the resource is a file on the local computer, the format is file:///), the ftp protocol (accessing the resource through FTP, the format is FTP://), and the gopher (accessing the resource through the Gopher protocol).
  • http protocol accessing resources via HTTP, format is http://
  • https protocol accessing resources through secure HTTPS, format is HTTPS://
  • HTTPS HyperText Protocol
  • hostname The domain name system (DNS) host name or IP address of the server hosting the resource. Sometimes, you can also include the username and password (in the format username:password) required to connect to the server before the host name.
  • DNS domain name system
  • Port (port number) The default port of the scheme is used when omitted. Each transport protocol has a default port number. For example, the default port of http is 80. If omitted when typing, the default port number is used. Sometimes for security or other considerations, the port can be redefined on the server, that is, a non-standard port number is used. In this case, the port number cannot be omitted from the URL.
  • path A string separated by zero or more "/" symbols, generally used to represent a directory or file address on the host.
  • parameters can be used to specify the optional parameters of the optional parameters.
  • dynamic web pages such as web pages created using CGI, ISAPI, PHP / JSP / ASP / ASP.NET technology
  • fragment (information) can be used to specify fragments in network resources. For example, if there is multiple nouns in a web page, you can use the fragment to directly locate a noun explanation.
  • the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning.
  • Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.
  • the foregoing homepage associated webpage is only an example.
  • the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.
  • the invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page.
  • the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association.
  • the coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
  • the method may include the following steps:
  • Step 501 Acquire an associated URL to which the page turning feature anchor corresponding to the page element of the specified webpage is linked;
  • the step 501 may specifically include the following sub-steps:
  • Sub-step S51 using a page turning feature anchor to perform matching in a DOM tree node of a specified webpage
  • Sub-step S52 when the matching is successful, the associated URL is obtained from the matching paged feature anchor.
  • Step 502 Calculate an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;
  • the step 502 may specifically include the following sub-steps:
  • Sub-step S61 replacing the digital block in the URL of the specified webpage with the wildcard character to obtain the first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;
  • Sub-step S71 replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix
  • the sub-step S61 may further include the following sub-steps:
  • Sub-step S611, replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
  • sub-step S71 may further comprise the following sub-steps:
  • Sub-step S311 replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.
  • the sub-step S61 may further include the following sub-steps:
  • Sub-step S612 which replaces the digital blocks at different positions in the URL of the specified webpage by using different replacement characters to obtain the first feature URL prefix;
  • sub-step S71 may further comprise the following sub-steps:
  • Sub-step S712 replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.
  • Sub-step S81 when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.
  • Step 503 Determine, according to the associated webpage URL pattern pattern corresponding to the specified webpage, a page number feature part of the specified webpage URL and a page number feature part in the associated URL, respectively;
  • the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.
  • the page number feature part in the associated webpage URL pattern pattern may be determined, which may be the same position but different numbers in the multiple associated webpage URL pattern patterns. Digital block.
  • Step 504 Compare the specified webpage URL with the page number feature part of the associated page URL, and extract a page number identifier that is identified by the different digital identification part as the specified webpage URL.
  • the page number identifier may include a homepage identifier
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the page turning block may be replaced with the first page identifier to obtain the URL of the first page associated web page.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning.
  • Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.
  • the foregoing homepage associated webpage is only an example.
  • the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.
  • the invention adopts the page turning feature anchor to identify the associated webpage, has high recognition accuracy, calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and has high calculation efficiency, and compares the common parts of the URL to greatly improve the recall rate. More than 90% of related web pages can be identified in practical applications.
  • the present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix.
  • first feature URL prefix is the same as the second feature URL prefix
  • second feature URL prefix is used as the associated webpage URL pattern, and the present invention uses the common part of the URL to match, further improving the association. The accuracy of the recognition of the web page.
  • the invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page.
  • the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association.
  • the coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
  • FIG. 6 a flow chart of steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention is shown, which may specifically include the following steps:
  • Step 601 it is determined whether the captured web page includes the associated web page URL mode; if yes, step 602 is performed;
  • Web crawlers also known as web spiders, are Web Spiders.
  • Web spiders use web pages to find web pages. Start with a page (usually the home page), read the content of the web page, and find other link addresses in the web page. And then look for the next page through these link addresses, so that it keeps looping until all the pages of the site are crawled. If the entire Internet is treated as a website, then web spiders can use this principle to capture all the web pages on the Internet.
  • the associated webpage URL pattern may be a common part of the page turning webpage, that is, a set formed by a long-term or functionally similar URL/webpage.
  • the step 601 may specifically include the following sub-steps:
  • Sub-step S91 determining whether there is a page turning feature string in the page element of the current webpage; if yes, extracting the URL of the page turning feature string link;
  • FIG. 3 there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.
  • the page turning block may be mainly composed of a page turning feature string (ie, a page turning feature ancho), and the page turning feature string may be a page element for identifying a page turning.
  • a page turning feature string ie, a page turning feature ancho
  • the page turning feature string may be a page element for identifying a page turning.
  • the page turning feature string may include one or more of the following:
  • page turning feature string is only used as an example.
  • other page turning feature strings may be set according to actual conditions, which is not limited by the embodiment of the present invention.
  • the current webpage may be the webpage that is captured.
  • the sub-step S91 may further include the following sub-steps:
  • Sub-step S911 using a page turning feature string to perform matching in the DOM tree node of the current webpage;
  • Sub-step S912 when the matching is successful, it is determined that the current webpage has a page turning feature string.
  • Sub-step S92 replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;
  • Sub-step S93 replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix
  • the sub-step S92 may further include the following sub-steps:
  • Sub-step S921 replacing the digital block at different positions in the URL of the current webpage with the same replacement character, to obtain the first feature URL prefix
  • sub-step S93 may further comprise the following sub-steps:
  • Sub-step S931 replacing the digital blocks at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.
  • the sub-step S92 may further include the following sub-steps:
  • Sub-step S922 which uses different replacement characters to replace the digital blocks in different positions in the URL of the current webpage to obtain the first feature URL prefix;
  • sub-step S93 may further comprise the following sub-steps:
  • Sub-step S932 replacing the digital block of the URL of the feature string link in the same position with the same replacement character as the first feature URL, respectively, to obtain the second feature URL prefix.
  • Sub-step S94 when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL pattern.
  • Step 602 Acquire the associated webpage URL pattern.
  • the step 602 may specifically include the following sub-steps:
  • Sub-step S101 the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.
  • the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character.
  • the digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained.
  • the prefix is used as the corresponding associated webpage URL pattern of the current webpage.
  • the present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage.
  • the recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.
  • Step 603 Acquire a corresponding associated webpage based on the associated webpage URL pattern.
  • the associated webpage may include a homepage associated webpage and other related webpages, wherein the homepage associated webpage generally records important content, such as the text block shown in FIG. 3, so the importance of the homepage associated webpage is relatively high, so It is important to know the homepage associated with the homepage.
  • the step 603 may specifically include the following sub-steps:
  • Sub-step S111 by performing structural analysis on the common part in the associated webpage URL pattern, extracting the page turning block in the associated webpage URL pattern, and replacing the flipping block with the first page identifier to obtain the URL of the homepage associated webpage;
  • the page turning block is a digital block having the same position but different numbers in a plurality of associated web page URL patterns;
  • Sub-step S112 accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage.
  • the coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.
  • Step 604 Establish an associated webpage database by using an associated webpage corresponding to the associated webpage URL pattern.
  • the associated webpage corresponding to the webpage URL pattern may include a homepage associated webpage and other related webpages, which may be all of the associated webpages, or may be a part of all associated webpages, which is not limited by the embodiment of the present invention.
  • the data processing of the webpage file captured by the spider may be performed, which may specifically include:
  • Web page structure That is, the HTML code of the associated web page is deleted, and the web content is extracted.
  • Link analysis Query the back link of the page, export the number of links and the inner chain, and then give the page how much weight and so on.
  • the processed data can be stored in the associated web page database.
  • the invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .
  • FIG. 7 a flow chart of steps of an embodiment of an associated webpage search method according to an embodiment of the present invention is shown. Specifically, the method may include the following steps:
  • Step 701 Receive a search request, where the request includes a search keyword
  • the search request may refer to a request by the user to perform an associated information search for a certain search keyword.
  • the user can input a search keyword in the browser address bar, the search bar, the search keyword input box in the search engine, and press the enter key or click the search button, which is equivalent to receiving the user's search request.
  • Step 702 Perform a search in the preset related webpage database according to the search keyword, and obtain a webpage that matches the keyword;
  • the collected information is generally a keyword or phrase that indicates the content of the associated web page (including the web page itself, the URL address of the web page, the code that makes up the web page, and the connection to and from the web page).
  • the search word set q is segmented, the URL corresponding to each keyword in q is sorted—the index library, and the keyword is also calculated according to the user's query mode and part of speech. Important, then only a comprehensive sorting algorithm is needed to get the search results.
  • the associated web page database can be established in the following manner:
  • Sub-step S101 it is determined whether the captured web page includes the associated web page URL mode; if so, sub-step S102 is performed;
  • the sub-step S101 may specifically include the following sub-steps:
  • Sub-step S121 determining whether the page element of the current webpage has a page turning feature string; if yes, extracting the URL of the page turning feature string link;
  • the sub-step S121 may further include the following sub-steps:
  • Sub-step S1211 using a page turning feature string to perform matching in a DOM tree node of the current webpage
  • Sub-step S1212 when the matching is successful, it is determined that the current webpage has a page turning feature string.
  • Sub-step S122 replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single number or multiple digits separated by the interval identifier;
  • Sub-step S123 replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix
  • the sub-step S122 may further include the following sub-steps:
  • sub-step S123 may further comprise the following sub-steps:
  • Sub-step S1231 replacing the digital block at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.
  • the sub-step S122 may further include the following sub-steps:
  • Sub-step S1222 which replaces the digital blocks at different positions in the URL of the current webpage by using different replacement characters to obtain the first feature URL prefix
  • sub-step S123 may further comprise the following sub-steps:
  • Sub-step S1232 replacing the digital block of the URL of the feature string link at the same position with the same replacement character as the first feature URL, respectively, to obtain a second feature URL prefix.
  • Sub-step S124 when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL mode.
  • Sub-step S102 acquiring the associated webpage URL pattern
  • the sub-step S102 may specifically include the following sub-steps:
  • Sub-step S131 the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.
  • the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character.
  • the digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained.
  • the prefix is used as the corresponding associated webpage URL pattern of the current webpage.
  • the present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage.
  • the recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.
  • Sub-step S103 acquiring the corresponding associated webpage by using the associated webpage URL pattern
  • the sub-step S103 may specifically include the following sub-steps:
  • Sub-step S141 extracting the associated webpage URL by performing structural analysis on the common part in the associated webpage URL pattern a page turning block in the mode, the page turning block is replaced with a first page identifier to obtain a URL of a homepage associated webpage; wherein the page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL patterns;
  • Sub-step S142 accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.
  • the invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage.
  • the coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.
  • Sub-step S104 the associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
  • Step 703 it is determined whether the webpage is an associated webpage; if yes, step 706 is performed;
  • determining whether the webpage includes an associated webpage URL pattern can determine whether the webpage is an associated webpage. That is, when the webpage includes an associated webpage URL pattern, the webpage is determined to be an associated webpage.
  • Step 704 returning the webpage and the homepage information associated with the webpage.
  • the embodiment of the present invention may store the corresponding relationship between the URL pattern of the associated webpage and the corresponding webpage, and the homepage associated with the webpage may be obtained by querying the corresponding webpage URL pattern of the webpage and the corresponding relationship of the webpage.
  • the search engine can display the search results on the user's viewing interface for the user to use.
  • the invention When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.
  • FIG. 8 a block diagram of a device embodiment 1 for calculating an associated web page URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:
  • the page turning feature anchor determining module 801 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 802 is invoked;
  • the URL extraction module 802 is adapted to extract an associated URL to which the page turning feature anchor is linked;
  • the associated webpage URL pattern calculation module 803 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
  • the page turning feature anchor determining module 801 is further adapted to:
  • Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;
  • the page flip feature anchor may be linked to one or more associated URLs.
  • the associated webpage URL pattern calculation module 803 may specifically include the following modules:
  • a first feature URL prefix obtaining module adapted to replace a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier ;
  • a second feature URL prefix obtaining module configured to replace the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix
  • the associated webpage URL pattern obtaining module is configured to use the first feature URL prefix or the second feature URL prefix as the associated webpage URL pattern pattern when the first feature URL prefix is the same as the second feature URL prefix.
  • the first feature URL prefix obtaining module may further be adapted to:
  • the second feature URL prefix obtaining module may further be adapted to:
  • the second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
  • the first feature URL prefix obtaining module may further be adapted to:
  • the first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.
  • the second feature URL prefix obtaining module may also be adapted to:
  • FIG. 9 a structural block diagram of a device 2 for calculating an associated webpage URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:
  • the page turning feature anchor determining module 901 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 902 is invoked;
  • the URL extraction module 902 is adapted to extract an associated URL to which the page turning feature anchor is linked;
  • the associated webpage URL pattern tablet computing module 903 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is associated;
  • the homepage related webpage URL obtaining module 904 is adapted to extract a page turning block in the associated webpage URL pattern pattern by performing structural analysis on the common part in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain a homepage association.
  • a URL of the webpage wherein the page turning block is a digital block having the same position but different numbers in the plurality of associated webpage URL pattern patterns.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of some or all of the components of the device for calculating the associated web page URL pattern pattern in accordance with an embodiment of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 10 illustrates a computing device, such as a user terminal device or an application server, that can implement the calculation of an associated web page URL pattern pattern in accordance with the present invention.
  • the computing device conventionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020.
  • the memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • the memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps.
  • storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 1020 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 1031', ie, code that can be read by, for example, a processor such as 1010, which when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种计算关联网页URL模式pattern的方法和装置,所述方法包括:判断指定网页的页面元素中是否具有翻页特征anchor;若是,则提取所述翻页特征anchor对应链接到的关联URL;根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页模式pattern。采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL和关联URL计算出关联网页URL模式pattern,计算效率高。

Description

一种计算关联网页URL模式pattern的方法和装置 技术领域
本发明涉及数据处理技术领域,尤其涉及一种计算关联网页URL模式pattern的方法、一种计算关联网页URL模式pattern的装置。
背景技术
随着因特网的发展,愈来愈多的信息是通过网页方式呈现在因特网上供用户查询,同样的通过搜寻引擎在因特网中查询数据也成为最常使用的数据搜寻方法。
搜索引擎收录网页时需要针对不同种类的网页采取不同的调度策略,网页种类的识别是一项基础工作,其中翻页(Page turning)网页的识别是一项比较关键的工作。所谓翻页网页,即查看分页文件的上一个页面、下一个页面或任意存在的非当前页面。翻页网页可以将实体书或者移动Web窗体中的内容进行改变,以观看不同内容。在互联网上运用时该机制还呈现可用于浏览到其他页的用户界面元素。
现有的翻页网页的识别方法是根据网页的URL(Uniform Resource Locator,统一资源定位符)所包含的关键词来识别是否是索引页。例如,当URL包含有page、pn、p等关键词以及关键词后面有数字时,判断该URL对应的网页为翻页网页。
但是,这种识别方法召回率低,并且很多网站的翻页是不具有这些关键词的,比如“http://cq.ABC.com/lvshi/o12/”、“http://bbs.BCA.com/t661_10”、“http://china.BCD.com/product/20110617/2647”,但是这些网页依然是翻页,使得这些识别方法容易造成误操作,实用性低。
发明内容
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种计算关联网页URL模式pattern的方法和相应的一种计算关联网页URL模式pattern的装置。
根据本发明的一个方面,提供了一种计算关联网页URL模式pattern的方法,包括:
判断指定网页的页面元素中是否具有翻页特征anchor;若是,则提取所述翻页特征anchor对应链接到的关联URL;
根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。
根据本发明的另一方面,提供了一种识别网页URL中页码标识的方法,包括:
获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;
依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;
基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;
比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。
根据本发明的另一方面,提供了一种关联网页数据库的建立方法,包括:
判断抓取到的网页是否包括关联网页URL模式;若是,则获取所述关联网页URL模式;
基于所述关联网页URL模式获取对应的关联网页;
采用所述关联网页URL模式对应的关联网页建立关联网页数据库。
根据本发明的另一方面,提供了一种关联网页搜索方法,包括:
接收搜索请求;所述请求中包括搜索关键词;
依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页;
判断所述网页是否为关联网页;若是,则返回所述网页及所述网页关联的首页信息。
根据本发明的另一方面,提供了一种计算关联网页URL模式pattern的装置,包括:
翻页特征anchor判断模块,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块;
URL提取模块,适于提取所述翻页特征anchor对应链接到的关联URL;
关联网页URL模式pattern计算模块,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算 机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-8中的任一个所述的计算关联网页URL模式pattern方法。
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了如权利要求23所述的计算机程序。
本发明的有益效果为:
本发明采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL中和关联URL计算出关联网页URL模式pattem,计算效率高。
本发明使用通配字符替换数字块获得第一特征URL前缀和获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式,本发明采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。
本发明将关联网页URL模式pattern的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。
本发明基于当前抓取到的网页提取关联网页URL模式,采用关联网页URL模式对应的关联网页建立关联网页数据库,避免了重复抓取网页,减少了***资源的占用,大大提高了数据库的建立效率。
本发明在判断获得与关键词匹配的网页为关联网页时,返回该网页及该网页关联的首页信息,避免了用户重复搜索或者查找首页的过程,进一步减少了***的操作,减少了***资源的占用,提高了搜索的效率。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的方法实施例1的步骤流程图;
图2示意性示出了根据本发明一个实施例的一种网页结构示例图;
图3示意性示出了示出了本发明一个实施例的一种翻页块的示例图;
图4示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的方法实施例2的步骤流程图;
图5示意性示出了本发明一个实施例的一种识别网页URL中页码标识的方法实施例的步骤流程图;
图6示意性示出了本发明一个实施例的一种关联网页数据库的建立方法实施例的步骤流程图;
图7示意性示出了本发明一个实施例的一种关联网页搜索方法实施例的步骤流程图;
图8示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的装置实施例1的结构框图;
图9示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的装置实施例2的结构框图;
图10示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及
图11示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。
具体实施方式
下面结合附图和具体的实施方式对本发明作进一步的描述。
参照图1,示出了本发明一个实施例的一种计算关联网页URL模式pattem的方法实施例1的步骤流程图,具体可以包括如下步骤:
步骤101,判断指定网页的页面元素中是否具有翻页特征anchor;若是,则执行步骤102;
网页按照功能可以划分为多个区域,以某一个论坛(Bulletin Board System,BBS)的页面为例,如图2所示,该页面可以划分为导航块(1)、垃圾块(2、4)、翻页块(3)、标题块(5)、作者信息块(6)、发表日期块(7)、正文块(8)。其中,导航块可以位于网页页眉顶部,或者 banner(网页的横幅广告)下部,用于指向网页的信息栏目。垃圾块可以为与网页主题相关度很低的页面元素所在的区域,例如“发帖”、“回复”等功能按钮。翻页块可以为指示翻页的区域。标题块可以为网页主题的标题(例如图2所示的“安全浏览器聚集黑色星期四”)所在的区域。作者信息块为记载该网页主题作者信息的区域。正文块为记载该网页主题正文的区域。
参照图3,示出了示出了本发明一个实施例的一种翻页块的示例图。
如图3所示,翻页块主要可以由翻页特征anchor组成,翻页特征anchor即翻页特征字符串,其可以为用于标识翻页的页面元素。
在具体实现中,翻页特征anchor可以包括以下的一种或多种:
[<<]、[>>]、[<  <]、[>  >]、[《]、[》]、[>]、[<]、[下一页]、[上一页]、[上一]、[下一]、[next]、[末页]、[尾页]、[前页]、[后页]、[<上一页]、[>上一]、[下一>]、[下一页>]、[1...]。
当然,上述翻页特征anchor只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他翻页特征anchor,本发明实施例对此不加以限制。
在本发明的一种优选实施例中,所述步骤101具体可以包括如下子步骤:
子步骤S11,采用翻页特征anchor在当前网页的DOM树节点中进行匹配;
子步骤S12,当匹配成功时,则判断当前网页具有翻页特征anchor。
DOM(文件对象模型,Document Object Model)是处理可扩展置标语言的标准编程接口。DOM可以以一种独立于平台和语言的方式访问和修改一个文档的内容和结构,是表示和处理一个HTML(Hypertext Markup Language,超文本标记语言)或XML(eXtensible Markup Language,可扩展标记语言)文档的常用方法。
DOM实际上是以面向对象方式描述的文档模型。DOM定义了表示和修改文档所需的对象、这些对象的行为和属性以及这些对象之间的关系。可以把DOM认为是页面上数据和结构的一个树形表示,不过页面当然可能并不是以这种树的方式具体实现。
通过JavaScript可以重构整个HTML文档,可以添加、移除、改变或重排页面上的项目。
要改变页面的某个东西,JavaScript就需要获得对HTML文档中所有元素进行访问的入口。这个入口,连同对HTML元素进行添加、移动、改变或移除的方法和属性,都是通过文档对象模型来获得的(DOM)。
可以将HTML文档视作树结构,而这种结构被称为节点树(HTML DOM)。通过HTML DOM,树中的所有节点均可通过JavaScript进行访问。所有HTML元素(节点)均可被修改,也可以创建或删除节点。
节点树中的节点彼此拥有层级关系。可以采用父(parent)、子(child)和同胞(sibling)等术语用于描述这些关系。其中,父节点拥有子节点。同级的子节点被称为同胞(兄弟或姐妹)。在节点树中,顶端节点被称为根(root)。每个节点都有父节点、除了根(它没有父节点)。一个节点可拥有任意数量的子,同胞是拥有相同父节点的节点。
具体可以通过若干种方法在节点树来查找希望操作的网页元素:
例如,可以通过使用getElementById()和getElementsByTagName()方法进行查找。
又例如,可以通过使用一个元素节点的parentNode、firstChild以及lastChild属性。
其中,getElementById()和getElementsByTagName()这两种方法,可查找整个HTML文档中的任何HTML元素。而这两种方法会忽略文档的结构。假如查找文档中所有的<p>元素,getElementsByTagName()会把它们全部找到,不管<p>元素处于文档中的哪个层次。同时,getElementById()方法也会返回正确的元素,不论它被隐藏在文档结构中的什么位置。这两种方法会提供任何所需要的HTML元素,不论它们在文档中所处的位置。
此外,getElementById()可通过指定的ID来返回网页元素。
在具体实现中,可以通过识别该网页的HTML文本DOM树中超链接<a>(anchor,锚点)标识是否包括[<<]、[>>]、[<  <]、[>  >]、[《]、[》]、[>]、[<]、[下一页]、[上一页]、[上一]、[下一]、[next]、[末页]、[尾页]、[前页]、[后页]、[<上一页]、[<上一]、[下一>]、[下一页>]、[1...]中的一种或多种,若是,则判断当前网页具有翻页特征anchor。
其中,<a>可以用于把当前位置的文本或图片连接到其他的页面、文本或图像等。
<a>标识的基本语法结构可以如下:
<a
class=type
id=value
href=reference
name=value
rel=same|next|parent|previous
rev=value
target=window
style=value
title=title
onclick=function
onmouseout=function
onMouseOver=function>显示文字或者图片的代码</a>
例如以下一种HTML文本中<a>标识的内容为:
<div id=″pgt″class=″bm bw0 pgs cl″>
<span id=″fd_page top″>
<div class=″pg″>
<a
href=″forum-99-1.html″class=″prev″></a>
<a
href=″forum-99-1.html″>1</a><strong>2<>
<a
href=″forum-99-3.html″>3</a>
<a
href=″forum-99-4.html″>4</a>
<a
href=″forum-99-5.html″>5</a>
<a
href=″forum-99-6.html″>6</a>
<a
href=″forum-99-7.html″>7</a>
<a
href=″forum-99-8.html″>8</a>
<a
href=″forum-99-9.html″>9</a>
<a
href=″forum-99-10.html″>10</a>
<a
href=″forum-99-1000.html″class=″last″>...2107</a>
<label>
<inputtype=″text″name=″custompage″class=″px″size=″2″title=″输入页码,按回车快速跳转″value=″2″onkeydown=″if(event.keyCode==13){window.location=′forum.php?mod=forumdisplay&fid=99&page=′+this.value;doane(event);}″/>
<span title=″共1000页″>/1000页</span>
</label>
<a
href=″forum-99-3.html″class=″nxt″>下一页</a>
</div>
</span>
通过HTML文本中<a>标识的匹配,可以判断该网页具有一个或多个翻页特征anchor。
步骤102,提取所述翻页特征anchor对应链接到的关联URL(Un而nn Resource Locator,统一资源定位符);
在实现应用中,所述翻页特征anchor可以对应链接到一个或多个关联URL。
具体地,在识别出该一个或多个翻页特征anchor之后,提取该一个或多个翻页特征anchor链接的一个或多个关联URL,该一个或多个关联URL指向其他的与当前网页关联的翻页网页。
步骤103,根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计 算与所述指定网页对应的关联网页URL模式pattern。
关联网页URL模式Pattern,可以为长相或者功能类似的URL/网页聚在一起形成的集合。
在本发明的一种优选实施例中,所述步骤103具体可以包括如下子步骤:
子步骤S21,使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
子步骤S31,使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;
需要说明的是,通配字符可以为任意字符,本发明实施例对此不加以限制。间隔标识可以为URL中用于间隔的符号,例如“/”、“.”、“-”、“?”、“:”等等。数字块需要为间隔标识中连续的数字,例如“123ABC”不为数字块。
在本发明实施例的一种优选示例中,所述子步骤S21进一步可以包括如下子步骤:
子步骤S211,采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S211相对应地,所述子步骤S31进一步可以包括如下子步骤:
子步骤S311,采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。
在具体实现中,指定网页的URL和关联URL可以具有一个或多个数字块,为减少替换的操作步骤和***的资源占用,可以用相同的通配字符替换数字块。
例如,指定网页的URL为http://bbs.XXX.com/forum-99-2.html,关联URL为http://bbs.XXX.com/forum-99-3.html,其中“99”、“2”被识别出为数字块,以“(\d+)”作为通配字符的一种示例,则第一特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\d+).html,第二特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\d+).html。
在本发明的一种实施例中,所述子步骤S21进一步可以包括如下子步骤:
子步骤S212,分别采用不同的替换字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S212相对应地,所述步骤103具体可以包括如下子步骤:
子步骤S312,分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。
在具体实现中,指定网页的URL和关联URL可以具有一个或多个数字块,为提高后续第一特征URL前缀与第二特征URL是否相同的判断以及对数字块的标识的效率,可以采用不同的通配字符替换数字块。
例如,指定网页的URL为http://bbs.XXX.com/forum-99-2.html,关联URL为http://bbs.XXX.com/forum-99-3.html,其中“99”、“2”被识别出为数字块,以“(\d+)”、“(\e+)”作为通配字符的一种示例,则第一特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\e+).html,第二特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\e+).html。
子步骤S41,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。
在实际应用中,当第一特征URL前缀与第二特征URL前缀相同时,可以判定指定网页的和关联URL对应的网页为关联的翻页网页。
因为第一特征URL前缀和第二特征URL相同,则以第一特征URL前缀或第二特征URL前缀作为关联网页URL模式Pattern均可。
本发明采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL中和关联URL计算出关联网页URL模式pattern,计算效率高。
本发明使用通配字符替换数字块获得第一特征URL前缀和获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式,本发明采用采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。
参照图4,示出了本发明一个实施例的一种计算关联网页URL模式pattern的方法实施例2的步骤流程图,具体可以包括如下步骤:
步骤401,判断指定网页的页面元素中是否具有翻页特征anchor;若是,则执行步骤402;
步骤402,提取所述翻页特征anchor对应链接到的关联URL;
步骤403,根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern;
步骤404,通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;
其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。
在实际应用中,URL可以包括以下的一种或多种结构:
1、protocol(协议):指定使用的传输协议,最常用的是HTTP协议,它也是目前WWW中应用最广的协议。具体地,传输协议包括file协议(资源是本地计算机上的文件,格式为file:///)、ftp协议(通过FTP访问资源,格式为FTP://)、gopher(通过Gopher协议访问资源)、http协议(通过HTTP访问资源,格式为HTTP://)、https协议(通过安全的HTTPS访问资源,格式为HTTPS://)等等。
2、hostname(主机名):指存放资源的服务器的域名***(DNS)主机名或IP地址。有时,在主机名前也可以包含连接到服务器所需的用户名和密码(格式为username:password)。
3、port(端口号):省略时使用方案的默认端口,各种传输协议都有默认的端口号,如http的默认端口为80。如果输入时省略,则使用默认端口号。有时候出于安全或其他考虑,可以在服务器上对端口进行重定义,即采用非标准端口号,此时,URL中就不能省略端口号这一项。
4、path(路径):由零或多个“/”符号隔开的字符串,一般用来表示主机上的一个目录或文件地址。
5、parameters(参数):可以用于指定特殊参数的可选项。
6、query(查询):可以用于给动态网页(如使用CGI、ISAPI、PHP/JSP/ASP/ASP.NET等技术制作的网页)传递参数,可有多个参数,用“&”符号隔开,每个参数的名和值用“=”符号隔开。
7、fragment(信息片断):可以用于指定网络资源中的片断。例如一个网页中有多个名词解释,可使用fragment直接定位到某一名词解释。
在具体实现中,通过对多个关联网页URL模式中的共性部分进行结构分析,提取关联网页URL模式中的翻页块,然后将所述翻页块替换为首页标识获得首页关联网页的URL。
例如,对于上述示例的关联网页URL模式-http://bbs.XXX.com/forum-(\d+)-(\e+).html,在识别出(\e+)为翻页块,然后将翻页块替换为首页标识后,获得首页关联网页的URL-http://bbs.XXX.com/forum-99-1.html。
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。
在具体实现中,关联网页中的首页关联网页一般会记载有重要的内容,例如图3所示的正文块,因此首页关联网页的重要性比较高,因此获知首页关联网页具有比较重要的意义。而不同的网站会采用不同的翻页结构,造成了首页关联网页的不同。例如,某些网站会采用第0页作为首页关联网页,某些网站会采用第1页作为首页关联网页,某些网站会采用最大页(例如图3所示的2100)作为首页关联网页,等等。
当然,上述首页关联网页只是作为示例,在实施本发明实施例时,可以根据实际情况将数字快替换为任一关联网页的标识获取对应的关联网页,本发明实施例对此不一一加以详述。
本发明将关联网页URL模式pattern的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。
参照图5,示出了本发明一个实施例的一种识别网页URL中页码标识的方法实施例的步骤流程图,具体可以包括如下步骤:
步骤501,获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;
在本发明的一种优选实施例中,所述步骤501具体可以包括如下子步骤:
子步骤S51,使用翻页特征anchor在指定网页的DOM树节点中进行匹配;
子步骤S52,当匹配成功时,则从匹配成功的翻页特征anchor中获取关联URL。
步骤502,依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;
在本发明的一种优选实施例中,所述步骤502具体可以包括如下子步骤:
子步骤S61,使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
子步骤S71,使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;
在本发明实施例的一种优选示例中,所述子步骤S61进一步可以包括如下子步骤:
子步骤S611,采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S611相对应地,所述子步骤S71进一步可以包括如下子步骤:
子步骤S311,采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。
在本发明的一种实施例中,所述子步骤S61进一步可以包括如下子步骤:
子步骤S612,分别采用不同的替换字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S612相对应地,所述子步骤S71进一步可以包括如下子步骤:
子步骤S712,分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。
子步骤S81,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。
步骤503,基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;
在具体实现中,通过对多个关联网页URL模式中的共性部分进行结构分析,提取关联网页URL模式中的翻页块,然后将所述翻页块替换为首页标识获得首页关联网页的URL。
通过对关联网页URL模式pattern中的共性部分进行结构分析,可以确定关联网页URL模式pattern中的页码特征部分,即翻页块,具体可以为多个关联网页URL模式pattern中位置相同但数字不同的数字块。
步骤504,比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。
在具体实现中,所述页码标识可以包括首页标识,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。
在提取关联网页URL模式中的翻页块后可以将所述翻页块替换为首页标识获得首页关联网页的URL。
例如,对于上述示例的关联网页URL模式-http://bbs.XXX.com/forum-(\d+)-(\e+).html,在识别出(\e+)为翻页块,然后将翻页块替换为首页标识后,获得首页关联网页的URL-http://bbs.XXX.com/fomm-99-1.html。
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。
在具体实现中,关联网页中的首页关联网页一般会记载有重要的内容,例如图3所示的正文块,因此首页关联网页的重要性比较高,因此获知首页关联网页具有比较重要的意义。而不同的网站会采用不同的翻页结构,造成了首页关联网页的不同。例如,某些网站会采用第0页作为首页关联网页,某些网站会采用第1页作为首页关联网页,某些网站会采用最大页(例如图3所示的2100)作为首页关联网页,等等。
当然,上述首页关联网页只是作为示例,在实施本发明实施例时,可以根据实际情况将数字快替换为任一关联网页的标识获取对应的关联网页,本发明实施例对此不一一加以详述。
本发明采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL中和关联URL计算出关联网页URL模式pattern,计算效率高,采用URL的共性部分进行比较,大幅提高召回率,在实际应用中可以识别90%以上的关联网页。
本发明使用通配字符替换数字块获得第一特征URL前缀和获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式,本发明采用URL的共性部分进行匹配,进一步提高了关联 网页的识别准确率。
本发明将关联网页URL模式pattern的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。
参照图6,示出了本发明一个实施例的一种关联网页数据库的建立方法实施例的步骤流程图,具体可以包括如下步骤:
步骤601,判断抓取到的网页是否包括关联网页URL模式;若是,则执行步骤602;
需要说明的是,搜索引擎从万维网上自动提取网页的功能可以是通过网络爬虫实现的。网络爬虫又称为网络蜘蛛,即Web Spider,网络蜘蛛是通过网页的链接地址来寻找网页,从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。
关联网页URL模式可以为翻页网页的共性部分Pattern,即长相或者功能类似的URL/网页聚在一起形成的集合。
在本发明的一种优选实施例中,所述步骤601具体可以包括如下子步骤:
子步骤S91,判断当前网页的页面元素中是否具有翻页特征字符串;若是,则提取所述翻页特征字符串链接的URL;
参照图3,示出了示出了本发明一个实施例的一种翻页块的示例图。
如图3所示,翻页块主要可以由翻页特征字符串(即翻页特征ancho)组成,而翻页特征字符串可以为用于标识翻页的页面元素。
在具体实现中,翻页特征字符串可以包括以下的一种或多种:
[<<]、[>>]、[<  <]、[>  >]、[《]、[》]、[>]、[<]、[下一页]、[上一页]、[上一]、[下一]、[next]、[末页]、[尾页]、[前页]、[后页]、[<上一页]、[<上一]、[下一>]、[下一页>]、[1...]。
当然,上述翻页特征字符串只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他翻页特征字符串,本发明实施例对此不加以限制。
需要说明的是,当前网页可以为被抓取到的网页。
在本发明的一种优选实施例中,所述子步骤S91进一步可以包括如下子步骤:
子步骤S911,采用翻页特征字符串在当前网页的DOM树节点中进行匹配;
子步骤S912,当匹配成功时,则判断当前网页具有翻页特征字符串。
子步骤S92,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
子步骤S93,采用预置的替换字符替换所述翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀;
在本发明的一种实施例中,所述子步骤S92进一步可以包括如下子步骤:
子步骤S921,采用相同的替换字符替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S921相对应地,所述子步骤S93进一步可以包括如下子步骤:
子步骤S931,采用相同的替换字符替换所述特征字符串链接的URL中不同位置的数字块,获得第二特征URL前缀。
在本发明的一种实施例中,所述子步骤S92进一步可以包括如下子步骤:
子步骤S922,分别采用不同的替换字符,替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S922相对应地,所述子步骤S93进一步可以包括如下子步骤:
子步骤S932,分别采用与第一特征URL相同的替换字符替换所述特征字符串链接的URL在相同位置的数字块,获得第二特征URL前缀。
子步骤S94,当所述第一特征URL前缀与所述第二特征URL前缀相同时,则判定抓取到的网页是否包括关联网页URL模式。
步骤602,获取所述关联网页URL模式;
在本发明的一种实施例中,所述步骤602具体可以包括如下子步骤:
子步骤S101,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式。
本发明在当前网页的页面元素中具有翻页特征字符串时,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀,并采用预置的替换字符替换翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式,本发明采用翻页特征字符串进行识别关联网页,识别准确率高,采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。
步骤603,基于所述关联网页URL模式获取对应的关联网页;
在具体实现中,关联网页可以包括首页关联网页和其他关联网页,其中,首页关联网页一般会记载有重要的内容,例如图3所示的正文块,因此首页关联网页的重要性比较高,因此获知首页关联网页具有比较重要的意义。
在本发明的一种优选实施例中,所述步骤603具体可以包括如下子步骤:
子步骤S111,通过对关联网页URL模式中的共性部分进行结构分析,提取关联网页URL模式中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式中位置相同但数字不同的数字块;
子步骤S112,访问所述首页关联网页的URL获取所述首页关联网页。
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。
本发明将关联网页URL模式的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。
步骤604,采用所述关联网页URL模式对应的关联网页建立关联网页数据库。
在具体实现中,关联网页URL模式对应的关联网页可以包括首页关联网页和其他关联网页,可以是所有关联网页的全部,也可以是所有关联网页的部分,本发明实施例对此不加以限制。
作为一种优选示例,可以对蜘蛛抓取的网页文件进行数据处理,具体可以包括:
1、网页结构化。即关联网页的HTML代码删掉,提取出网页内容。
2、消噪。在网页结构化中,已经删掉了HTML代码,剩下了网页内容,那么消噪指的就是留下网页的主题内容,删掉没用的内容,比如版权。
3、查重。查找重复的网页与内容,如果找到重复的页面,就删除。
4、分词。提取出网页内容,然后分成N个词语,排列出来,存入索引库,同时也会计算这一个词在这个页面出现了多少次。
5、链接分析。查询页面的反向链接,导出链接有多少以及内链,然后给这个页面多少的权重等。
在进行了上边的数据处理之后,就可以把这些处理好的数据存储在关联网页数据库中。
本发明基于当前抓取到的网页提取关联网页URL模式,采用关联网页URL模式对应的关联网页建立关联网页数据库,避免了重复抓取网页,减少了***资源的占用,大大提高了数据库的建立效率。
参照图7,示出了本发明一个实施例的一种关联网页搜索方法实施例的步骤流程图,具体可以包括如下步骤:
步骤701,接收搜索请求;所述请求中包括搜索关键词;
搜索请求可以是指用户发出的对某搜索关键词进行相关联信息搜索的请求。例如,用户可以在浏览器地址栏、搜索栏、搜索引擎中的搜索关键字输入框中输入搜索关键词并按下回车键或点击搜索按钮,相当于接收到了用户的搜索请求。
步骤702,依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页;
在搜索引擎的后台预置有关联网页数据库,用于存放搜集到的关联网页的信息。所收集的信息一般是能表明关联网页内容(包括网页本身、网页的URL地址、构成网页的代码以及进出网页的连接)的关键词或者短语。
作为一种优选示例,首先可以把用户输入的搜索关键词切分为一个关键词序列,用q来进行表示,则用户搜索的关键词q被切分为q={q1,q2,q3,......,qn}。然后再根据用户查询方式,例如是所有词连在一起,还是中间有空格等,以及根据q中不同关键词的词性,来确定所需查询词中每一个词在查询结果的展示上所占有的重要性。当切分出搜索词集合q后,q中每个关键词所对应的URL排序——索引库,同时也根据用户的查询方式与词性计算出每个关键词在查询结果的展示上所占有的重要,那么只需要进行一点综合性的排序算法,即可以获得搜索结果。
在本发明的一种优选实施例中,所述关联网页数据库可以通过以下方式建立:
子步骤S101,判断抓取到的网页是否包括关联网页URL模式;若是,则执行子步骤S102;
在本发明的一种优选实施例中,所述子步骤S101具体可以包括如下子步骤:
子步骤S121,判断当前网页的页面元素中是否具有翻页特征字符串;若是,则提取所述翻页特征字符串链接的URL;
在本发明的一种优选实施例中,所述子步骤S121进一步可以包括如下子步骤:
子步骤S1211,采用翻页特征字符串在当前网页的DOM树节点中进行匹配;
子步骤S1212,当匹配成功时,则判断当前网页具有翻页特征字符串。
子步骤S122,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
子步骤S123,采用预置的替换字符替换所述翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀;
在本发明的一种实施例中,所述子步骤S122进一步可以包括如下子步骤:
子步骤S1221,采用相同的替换字符替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S1221相对应地,所述子步骤S123进一步可以包括如下子步骤:
子步骤S1231,采用相同的替换字符替换所述特征字符串链接的URL中不同位置的数字块,获得第二特征URL前缀。
在本发明的一种实施例中,所述子步骤S122进一步可以包括如下子步骤:
子步骤S1222,分别采用不同的替换字符,替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;
与子步骤S1222相对应地,所述子步骤S123进一步可以包括如下子步骤:
子步骤S1232,分别采用与第一特征URL相同的替换字符替换所述特征字符串链接的URL在相同位置的数字块,获得第二特征URL前缀。
子步骤S124,当所述第一特征URL前缀与所述第二特征URL前缀相同时,则判定抓取到的网页是否包括关联网页URL模式。
子步骤S102,获取所述关联网页URL模式;
在本发明的一种实施例中,所述子步骤S102具体可以包括如下子步骤:
子步骤S131,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式。
本发明在当前网页的页面元素中具有翻页特征字符串时,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀,并采用预置的替换字符替换翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式,本发明采用翻页特征字符串进行识别关联网页,识别准确率高,采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。
子步骤S103,采用所述关联网页URL模式获取对应的关联网页;
在本发明的一种优选实施例中,所述子步骤S103具体可以包括如下子步骤:
子步骤S141,通过对关联网页URL模式中的共性部分进行结构分析,提取关联网页URL 模式中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式中位置相同但数字不同的数字块;
子步骤S142,访问所述首页关联网页的URL获取所述首页关联网页。
本发明将关联网页URL模式的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。
子步骤S104,采用所述关联网页URL模式对应的关联网页建立关联网页数据库。
步骤703,判断所述网页是否为关联网页;若是,则执行步骤706;
在具体实现中,判断所述网页是否包括关联网页URL模式即可判断所述网页是否为关联网页。即当所述网页包括关联网页URL模式时,判断所述网页为关联网页。
步骤704,返回所述网页及所述网页关联的首页信息。
本发明实施例可以存储有关联网页URL模式及其对应的网页的对应关系,只要查询所述网页的关联网页URL模式及其对应的网页的对应关系即可获得所述网页关联的首页。
当获得搜索结果后,搜索引擎即可以将搜索结果展示在用户阅览的界面上以供用户使用。
本发明在判断获得与关键词匹配的网页为关联网页时,返回该网页及该网页关联的首页信息,避免了用户重复搜索或者查找首页的过程,进一步减少了***的操作,减少了***资源的占用,提高了搜索的效率。
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
参照图8,示出了本发明一个实施例的一种计算关联网页URL模式pattern的装置实施例1的结构框图,具体可以包括如下模块:
翻页特征anchor判断模块801,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块802;
URL提取模块802,适于提取所述翻页特征anchor对应链接到的关联URL;
关联网页URL模式pattern计算模块803,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。
在本发明的一种优选实施例中,所述翻页特征anchor判断模块801还可以适于:
采用翻页特征anchor在当前网页的DOM树节点中进行匹配;
当匹配成功时,则判断当前网页具有翻页特征anchor。
在本发明的一种优选实施例中,所述翻页特征anchor可以对应链接到一个或多个关联URL。
在本发明的一种优选实施例中,所述关联网页URL模式pattern计算模块803具体可以包括如下模块:
第一特征URL前缀获得模块,适于使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
第二特征URL前缀获得模块,适于使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;
关联网页URL模式pattern获得模块,适于在所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。
在本发明的一种优选实施例中,所述第一特征URL前缀获得模块还可以适于:
采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
所述第二特征URL前缀获得模块还可以适于:
采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。
在本发明的一种优选实施例中,所述第一特征URL前缀获得模块还可以适于:
分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
第二特征URL前缀获得模块还可以适于:
分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得 第二特征URL前缀。
对于图8的装置实施例而言,由于其与图1的方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
参照图9,示出了本发明一个实施例的计算一种关联网页URL模式pattern的装置施例2的结构框图,具体可以包括如下模块:
翻页特征anchor判断模块901,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块902;
URL提取模块902,适于提取所述翻页特征anchor对应链接到的关联URL;
关联网页URL模式pattem计算模块903,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern;
首页关联网页URL获得模块904,适于通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。
对于图9的装置实施例而言,由于其与图4的方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的计算关联网页URL模式pattern的设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图10示出了可以实现根据本发明的计算关联网页URL模式pattern的的计算设备,例如用户终端设备或应用服务器。该计算设备传统上包括处理器1010和以存储器1020形式的计算机程序产品或者计算机可读介质。存储器1020可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器1020具有用于执行上述方法中的任何方法步骤的程序代码1031的存储空间1030。例如,用于程序代码的存储空间1030可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1031。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图11所述的便携式或者固定存储单元。该存储单元可以具有与图10的计算设备中的存储器1020类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码1031’,即可以由例如诸如1010之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第 二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。

Claims (24)

  1. 一种计算关联网页URL模式pattern的方法,包括:
    判断指定网页的页面元素中是否具有翻页特征anchor;若是,则提取所述翻页特征anchor对应链接到的关联URL;
    根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。
  2. 如权利要求1所述的方法,其特征在于,所述判断指定网页的页面元素中是否具有翻页特征anchor的步骤包括:
    采用翻页特征anchor在当前网页的DOM树节点中进行匹配;
    当匹配成功时,则判断当前网页具有翻页特征anchor。
  3. 如权利要求1所述的方法,其特征在于,所述翻页特征anchor对应链接到一个或多个关联URL。
  4. 如权利要求1或2或3所述的方法,其特征在于,所述根据所述指定网页的URL以及所述关联页URL计算所述关联网页URL模式pattern的步骤进一步包括:
    使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
    使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;
    当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。
  5. 如权利要求4所述的方法,其特征在于,所述使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀的步骤为:
    采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
    所述使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀的步骤为:
    采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。
  6. 如权利要求5所述的方法,其特征在于,所述使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀的步骤为:
    分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
    所述使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀的步骤为:
    分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。
  7. 如权利要求1或2或3或5或6所述的方法,其特征在于,还包括:
    通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。
  8. 如权利要求7所述的方法,其特征在于,所述首页标识包括0、1和/或当前关联网页中的最大数值。
  9. 一种识别网页URL中页码标识的方法,包括:
    获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;
    依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;
    基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;
    比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。
  10. 一种关联网页数据库的建立方法,包括:
    判断抓取到的网页是否包括关联网页URL模式;若是,则获取所述关联网页URL模式;
    基于所述关联网页URL模式获取对应的关联网页;
    采用所述关联网页URL模式对应的关联网页建立关联网页数据库。
  11. 一种关联网页搜索方法,包括:
    接收搜索请求;所述请求中包括搜索关键词;
    依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页;
    判断所述网页是否为关联网页;若是,则返回所述网页及所述网页关联的首页信息。
  12. 一种计算关联网页URL模式pattern的装置,包括:
    翻页特征anchor判断模块,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块;
    URL提取模块,适于提取所述翻页特征anchor对应链接到的关联URL;
    关联网页URL模式pattern计算模块,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。
  13. 如权利要求12所述的装置,其特征在于,所述翻页特征anchor判断模块还适于:
    采用翻页特征anchor在当前网页的DOM树节点中进行匹配;
    当匹配成功时,则判断当前网页具有翻页特征anchor。
  14. 如权利要求12所述的装置,其特征在于,所述翻页特征anchor对应链接到一个或多个关联URL。
  15. 如权利要求12或13或14所述的装置,其特征在于,所述关联网页URL模式pattern计算模块包括:
    第一特征URL前缀获得模块,适于使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;
    第二特征URL前缀获得模块,适于使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;
    关联网页URL模式pattem获得模块,适于在所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。
  16. 如权利要求15所述的装置,其特征在于,所述第一特征URL前缀获得模块还适于:
    采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
    所述第二特征URL前缀获得模块还适于:
    采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。
  17. 如权利要求16所述的装置,其特征在于,所述第一特征URL前缀获得模块还适于:
    分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;
    第二特征URL前缀获得模块还适于:
    分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。
  18. 如权利要求12或13或14或16或17所述的装置,其特征在于,还包括:
    首页关联网页URL获得模块,适于通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。
  19. 如权利要求18所述的装置,其特征在于,所述首页标识包括0、1和/或当前关联网页中的最大数值。
  20. 如权利要求12所述的装置,其特征在于,还包括:
    页码特征部分确定模块,适于基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;
    页码标识确定模块,适于比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。
  21. 如权利要求12所述的装置,其特征在于,还包括:
    关联网页数据库建立模块,适于采用所述关联网页URL模式对应的关联网页建立关联网页数据库。
  22. 如权利要求21所述的装置,其特征在于,还包括:
    搜索请求接收模块,适于接收搜索请求;所述请求中包括搜索关键词;
    匹配网页获得模块,适于依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页;
    多页关联网页判断模块,适于判断所述网页是否为关联网页;若是,则调用信息返回模块;
    信息返回模块,适于返回所述网页及所述网页关联的首页信息。
  23. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-8中的任一个所述的计算关联网页URL模式pattem的方法。
  24. 一种计算机可读介质,其中存储了如权利要求23所述的计算机程序。
PCT/CN2014/086522 2013-11-25 2014-09-15 一种计算关联网页URL模式pattern的方法和装置 WO2015074455A1 (zh)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201310603918.0 2013-11-25
CN201310603918.0A CN103617225B (zh) 2013-11-25 2013-11-25 一种关联网页搜索方法和***
CN201310607851.8 2013-11-25
CN201310607854.1A CN103617229A (zh) 2013-11-25 2013-11-25 一种关联网页数据库的建立方法和装置
CN201310606990.9A CN103631906A (zh) 2013-11-25 2013-11-25 一种识别网页url中页码标识的方法和装置
CN201310607851.8A CN103617228A (zh) 2013-11-25 2013-11-25 一种计算关联网页URL模式pattern的方法和装置
CN201310607854.1 2013-11-25
CN201310606990.9 2013-11-25

Publications (1)

Publication Number Publication Date
WO2015074455A1 true WO2015074455A1 (zh) 2015-05-28

Family

ID=53178902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/086522 WO2015074455A1 (zh) 2013-11-25 2014-09-15 一种计算关联网页URL模式pattern的方法和装置

Country Status (1)

Country Link
WO (1) WO2015074455A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874443A (zh) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 一种url模式获取方法、装置、电子设备及可读存储介质
CN111177522A (zh) * 2018-11-09 2020-05-19 百度在线网络技术(北京)有限公司 页面聚合方法、装置、计算机设备及存储介质
CN111723378A (zh) * 2020-06-17 2020-09-29 浙江网新恒天软件有限公司 一种基于网站地图的网站目录***方法
CN114117181A (zh) * 2022-01-25 2022-03-01 北京金堤科技有限公司 网站翻页逻辑的获取、及网站翻页控制方法和装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101133415A (zh) * 2005-03-04 2008-02-27 Chutnoon公司 使用页面集而提供信息搜索服务的服务器、方法和***
CN102053979A (zh) * 2009-10-27 2011-05-11 华为技术有限公司 一种信息收集方法和***
CN103049557A (zh) * 2012-12-31 2013-04-17 百度在线网络技术(北京)有限公司 一种站点资源管理方法及装置
CN103258032A (zh) * 2013-05-10 2013-08-21 清华大学 平行网页获取方法及装置
CN103617225A (zh) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 一种关联网页搜索方法和***
CN103617229A (zh) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 一种关联网页数据库的建立方法和装置
CN103617228A (zh) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 一种计算关联网页URL模式pattern的方法和装置
CN103631906A (zh) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 一种识别网页url中页码标识的方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101133415A (zh) * 2005-03-04 2008-02-27 Chutnoon公司 使用页面集而提供信息搜索服务的服务器、方法和***
CN102053979A (zh) * 2009-10-27 2011-05-11 华为技术有限公司 一种信息收集方法和***
CN103049557A (zh) * 2012-12-31 2013-04-17 百度在线网络技术(北京)有限公司 一种站点资源管理方法及装置
CN103258032A (zh) * 2013-05-10 2013-08-21 清华大学 平行网页获取方法及装置
CN103617225A (zh) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 一种关联网页搜索方法和***
CN103617229A (zh) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 一种关联网页数据库的建立方法和装置
CN103617228A (zh) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 一种计算关联网页URL模式pattern的方法和装置
CN103631906A (zh) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 一种识别网页url中页码标识的方法和装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874443A (zh) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 一种url模式获取方法、装置、电子设备及可读存储介质
CN111177522A (zh) * 2018-11-09 2020-05-19 百度在线网络技术(北京)有限公司 页面聚合方法、装置、计算机设备及存储介质
CN111177522B (zh) * 2018-11-09 2023-08-18 百度在线网络技术(北京)有限公司 页面聚合方法、装置、计算机设备及存储介质
CN111723378A (zh) * 2020-06-17 2020-09-29 浙江网新恒天软件有限公司 一种基于网站地图的网站目录***方法
CN114117181A (zh) * 2022-01-25 2022-03-01 北京金堤科技有限公司 网站翻页逻辑的获取、及网站翻页控制方法和装置

Similar Documents

Publication Publication Date Title
JP4857075B2 (ja) ウェブドキュメントの集合において効率的に日付を検索する方法、コンピュータプログラム
CN106095979B (zh) Url合并处理方法和装置
US20090089278A1 (en) Techniques for keyword extraction from urls using statistical analysis
CN102436563B (zh) 一种检测页面篡改的方法及装置
US20130074148A1 (en) Method and system for compiling a unique sample code for specific web content
CN102446255B (zh) 一种检测页面篡改的方法及装置
WO2015196906A1 (zh) 一种基于搜索获取疾病咨询信息的方法和装置
US7962523B2 (en) System and method for detecting templates of a website using hyperlink analysis
CN102314494B (zh) 一种用于处理网页内容的方法和设备
WO2021068681A1 (zh) 标签分析方法、装置及计算机可读存储介质
CN102591965A (zh) 一种黑链检测的方法及装置
WO2015074455A1 (zh) 一种计算关联网页URL模式pattern的方法和装置
CN104239582A (zh) 基于特征向量模型识别钓鱼网页的方法及装置
CN104036190A (zh) 一种检测页面篡改的方法及装置
CN103617225B (zh) 一种关联网页搜索方法和***
CN102567521A (zh) 网页数据抓取过滤方法
CN106446123A (zh) 一种网页中验证码元素识别方法
WO2017000659A1 (zh) 一种富集化url的识别方法和装置
CN110532784A (zh) 一种暗链检测方法、装置、设备及计算机可读存储介质
CN103631906A (zh) 一种识别网页url中页码标识的方法和装置
CN104036189A (zh) 页面篡改检测方法及黑链数据库生成方法
CN104778232B (zh) 一种基于长查询的搜索结果的优化方法和装置
CN104881453A (zh) 一种识别网页类型的方法和装置
CN103617229A (zh) 一种关联网页数据库的建立方法和装置
CN115186240A (zh) 基于关联性信息的社交网络用户对齐方法、装置、介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14864611

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14864611

Country of ref document: EP

Kind code of ref document: A1