WO2020063448A1 - Information blocking method, device and terminal - Google Patents

Information blocking method, device and terminal Download PDF

Info

Publication number
WO2020063448A1
WO2020063448A1 PCT/CN2019/106728 CN2019106728W WO2020063448A1 WO 2020063448 A1 WO2020063448 A1 WO 2020063448A1 CN 2019106728 W CN2019106728 W CN 2019106728W WO 2020063448 A1 WO2020063448 A1 WO 2020063448A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
information
level
character string
data
Prior art date
Application number
PCT/CN2019/106728
Other languages
French (fr)
Chinese (zh)
Inventor
付振中
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020063448A1 publication Critical patent/WO2020063448A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • Embodiments of the present invention relate to the field of webpage analysis and interception technologies, and in particular, to an information interception method, device, and terminal.
  • the server loads the Easylist rule list while caching the page content, hides the advertising elements through the rule list, and then returns the hidden page content to the client. Make a presentation.
  • the Easylist rule list contains multiple strings. It is an ad blocking rule set opened by an open source organization. It defines which elements in a web page are ads and should be blocked.
  • the embodiments of the present invention provide a method, a device and a terminal for information interception. Based on the terminal's implementation of the method of advertising interception, through optimization of the rule matching method, there are many rules for advertising interception and there is no reasonable matching method leading to matching Increased number of issues.
  • an embodiment of the present invention provides an information interception terminal.
  • the terminal may include one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs.
  • One or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the terminal, cause the terminal to perform the following steps:
  • the target information is intercepted.
  • the terminal intercepts the target information in the browser page through the first data with a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the web page access and the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to a large number of strings of intercepted target information and no reasonable matching method.
  • the foregoing "tree structure” may include:
  • Including multiple nodes multiple nodes including a root node and at least one level of child nodes, each level of at least one level of child nodes including at least two child nodes;
  • the nodes of each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on multiple nodes in a tree structure according to a preset rule.
  • the terminal may specifically perform the following steps:
  • the information of accessing the webpage is matched step by step from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until it is determined whether the information of the accessing the webpage includes the target information.
  • the information on the longer visited webpage cannot be directly matched. Therefore, the information on the visited webpage is matched step by step to ensure that the information on the visited webpage can be completely matched and the interception target is improved. The accuracy of the information.
  • the above-mentioned "tree structure” may specifically include m-level child nodes, and each level of the m-level child nodes is divided according to different preset rules among the n preset rules.
  • N, m are integers greater than or equal to 1, n is greater than or equal to m;
  • the jth level child node selects one preset rule from the f preset rules for division, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules, j-1
  • the level child node is a level child node of the level j child node, the level j child node is any level child node of the level m child node, and j and f are integers greater than or equal to 1.
  • each of the n preset rules includes at least two categories of strings
  • the first data includes a plurality of character strings.
  • the character string of the first data is divided into m-level child nodes. Each child node in the m-level child node corresponds to a different character string type in n preset rules. Each child node Include multiple strings with categories with different strings.
  • this application provides a variety of preset rules and categories. You can select preset rules according to your needs. This step can improve the flexibility of the tree structure and is applicable to more Scene.
  • the "n preset rules” may include at least one of the following rules:
  • Black and white list rules positioning and preset matching rules, tag attribute rules, or character rules.
  • the foregoing "black and white list rule” may include:
  • the categories of the white list and the category of the black list are divided according to the black-and-white list rules. Corresponds to one of the child nodes in the first level.
  • the terminal may perform the following steps:
  • the information of the visited web page is matched with the character string of the white list category.
  • the terminal determines that the information of the visited web page does not include the target information.
  • the terminal may further perform the following steps:
  • the information of the visited webpage does not include the character string of the white list category
  • the information of the visited web page is matched with the character string of the black list category
  • the terminal determines that the information of accessing the webpage does not include the target information
  • the terminal matches the information of the visited webpage level by level with the child nodes of the parent node of the child node of the character string belonging to the blacklisted category until the information of the visited webpage is determined. After being matched, the terminal intercepts the target information in the information of accessing the web page.
  • the above “positioning and preset matching rule” may specifically include:
  • the category of the positioning match and the category of the preset match are divided according to the positioning and preset matching rules.
  • the character string of the category corresponds to a child node of the second-level child node, and any child node of the second-level child node has a parent-child relationship with the child nodes of the black-listed category of the first-level child node. .
  • the above “locating matching category” may be used to filter information for accessing a webpage for information where a character string exists at a first preset position, or information where a separator exists at a second preset position. At least one of
  • the preset matching category is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
  • label attribute rule may specifically include:
  • the third-level children among the m-level child nodes are divided according to the label attribute rules, and the first data is a character string that belongs to a category with a label and a character string that does not have a label.
  • the first data is a character string that belongs to a category with a label and a character string that does not have a label.
  • categories with tags can be used to filter information of visited webpages including tag attribute information, and categories without tags are used to filter information about webpages visited without tags. Attribute information; where,
  • the categories with tags include: categories with only host names, categories with only host information for advertising attributes, categories with two levels of classification for hosts and domain names, categories for URLs with uniform resource locators for hosts and advertisements, or domains and
  • the URL information of the advertisement is at least one of different categories.
  • this method can provide more possibilities and more accurately intercept target information.
  • character rule may include:
  • the fourth-level children of the m-level children are divided according to the character rules.
  • the first data belongs to the category of the first string and the category of the preset string.
  • the character strings correspond to one child node of the fourth-level child node, and any child node of the fourth-level child node is in a parent-child relationship with one child node of the third-level child node.
  • the "category of the first character string" in the foregoing may be used to filter information that the information of the visited web page and the character string of the type of the first character string have the same first character information;
  • the category of the preset character string is used to filter information for accessing the webpage, and the character string of the category of the preset character string has the same information as the preset character string.
  • the above-mentioned "information for accessing a webpage” may include a URL of a user accessing a page or a URL of each element of a webpage, and the target information is advertisement information.
  • the above-mentioned "first data" is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of the browser.
  • the valid string is a string that is determined by filtering the open source string in the open source website and the historical data reported by the terminal within a preset period of time to determine a usage rate greater than a preset threshold.
  • the overall matching process described above is performed in the terminal, so this method greatly improves the speed of information matching by the terminal and avoids the need for the server to have The problem of high performance can quickly complete the processing of the page content.
  • an embodiment of the present invention provides a data processing server, which is characterized by comprising: one or more processors, a transceiver, and a plurality of application programs; and one or more computer programs, of which one One or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the server, cause the server to perform the following steps:
  • the server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains the target information according to the determination.
  • the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
  • target information may be advertisement information
  • the information for accessing the webpage includes at least one of a URL of the user accessing the page or a URL of accessing each element of the webpage.
  • the above server may perform the following specific steps: periodically obtaining at least one open source string from an open source website;
  • the second data is determined according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
  • each browser server generally has different standards, that is, the target information may be defined as advertisement information on A site, but it is not defined as advertisement information on B site. Therefore, a browser server was added when generating the second data. Custom strings to make matching second data flexible and can be widely used.
  • the foregoing server may perform the following specific steps:
  • Each of the n preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
  • the second data includes multiple character strings, each of which includes multiple character strings belonging to different types of character strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
  • Each child node in the k-th child node has a parent-child relationship with one child node in the k-1 level.
  • the k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
  • the "n preset rules” may include at least one of the following rules:
  • the server performs the following steps:
  • Multiple child nodes are divided into m-level child nodes according to black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
  • the foregoing server may perform the following specific steps:
  • the first level of the m-level subnodes is divided into two subnodes according to the whitelist and blacklist categories, and one of the two subnodes
  • the node includes a character string that belongs to the white list category in the second data
  • the other child node includes the character string that belongs to the black list category in the second data.
  • the foregoing server may perform the following specific steps:
  • the second level of the m-level sub-nodes is divided into two sub-nodes, two One of the child nodes includes a character string belonging to the category of the positioning match in the second data, and the other child node includes a character string belonging to the category of the preset match in the second data.
  • Two child nodes in the second level and The nodes where the strings belonging to the blacklisted category in the first level are located in a parent-child relationship.
  • the above server may perform the following specific steps:
  • m is classified according to the category with the tag and the category without the tag.
  • the third level of the sub-node is divided into two sub-nodes.
  • One of the two sub-nodes includes a character string belonging to a category with a label in the second data, and the other sub-node includes a sub-label in the second data.
  • a character string of a category, where any child node in the third level is in a parent-child relationship with a child node in the second level.
  • the above-mentioned "labeled category” may include: a category with only a host name, a category with only host information of an advertisement attribute, a category with two levels of classification of a host and a domain name, a category of the host and an advertisement There is at least one of a category of the uniform resource locator URL information or a category in which only the domain name and the URL information of the advertisement are different.
  • the foregoing server may perform the following specific steps:
  • the fourth level of the m-level child node is divided into two child nodes according to the type of the first character string and the type of the preset character string.
  • One of the child nodes includes a character string belonging to the category of the first character string in the second data
  • the other child node includes a character string belonging to the category of the preset character string in the second data, in which any one of the fourth levels
  • the child node is in a parent-child relationship with one of the child nodes in the third level.
  • an embodiment of the present invention provides a method for intercepting information.
  • the method may be executed based on a terminal.
  • the method may include the following steps:
  • the target information is intercepted.
  • the method intercepts the target information in the browser page through the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the web page and the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method.
  • the overall matching speed can be increased by more than 40%.
  • the above-mentioned "tree structure" may include multiple nodes, the multiple nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
  • the nodes of each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on multiple nodes in a tree structure according to a preset rule.
  • the method may specifically include:
  • the information of accessing the webpage is matched step by step from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until it is determined whether the information of the accessing the webpage includes the target information.
  • the information on the longer visited webpage cannot be directly matched. Therefore, the information on the visited webpage is matched step by step to ensure that the information on the visited webpage can be completely matched and the interception target is improved. The accuracy of the information.
  • the foregoing "tree structure” may include:
  • each level of the m-level child nodes is divided according to different preset rules among n preset rules, n and m are integers greater than or equal to 1, n is greater than or equal to m;
  • the jth level child node selects one preset rule from the f preset rules for division, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules, j-1
  • the level child node is a level child node of the level j child node, the level j child node is any level child node of the level m child node, and j and f are integers greater than or equal to 1.
  • each of the n preset rules includes at least two categories of strings
  • the first data includes a plurality of character strings.
  • the character string of the first data is divided into m-level child nodes. Each child node in the m-level child node corresponds to a different character string type in n preset rules. Each child node Include multiple strings with categories with different strings.
  • this application provides a variety of preset rules and categories. You can select preset rules according to your needs. This step can improve the flexibility of the tree structure and is applicable to more Scene.
  • the "n preset rules” may include at least one of the following rules:
  • Black and white list rules positioning and preset matching rules, tag attribute rules, or character rules.
  • the foregoing "black and white list rules" may include categories of white lists and black lists.
  • the first-level child nodes of the m-level child nodes are divided according to the black and white list rules.
  • the character string belonging to the whitelisted category and the character string belonging to the blacklisted category respectively correspond to one child node of the first-level child node.
  • the method may specifically include:
  • the information of the visited web page is matched with the character string of the white list category.
  • the information of the visited web page includes the character string of the white list category, it is determined that the information of the visited web page does not include the target information.
  • the method may specifically include: when the information of the visited webpage does not include the whitelist When the character string of the category is matched, the information of the visited webpage is matched with the character string of the blacklist category;
  • the information of the visited webpage includes the character string of the blacklisted category
  • the information of the visited webpage is gradually matched with the child nodes of the parent-child relationship of the child nodes of the character string belonging to the blacklisted category until it is determined that the information of the visited webpage is After the matching is completed, the target information in the information of visiting the webpage is intercepted.
  • the foregoing "positioning and preset matching rules" may specifically include the category of the positioning match and the category of the preset match, and the second-level child node of the m-level child node matches according to the positioning and the preset match.
  • the rules are divided, and the strings belonging to the category of the positioning match and the strings belonging to the category of the preset match in the first data respectively correspond to a child node of the second-level child node, and any one of the second-level child nodes A node has a parent-child relationship with a child node of a character string belonging to a blacklisted category among the first-level child nodes.
  • the above “locating matching categories” may be used to filter information for accessing a webpage where a character string exists in a first preset position, or in information where a separator exists in a second preset position. At least one of
  • the preset matching category is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
  • the above-mentioned "label attribute rule” may include a category with a label and a category without a label.
  • the third-level child node of the m-level child node is divided according to the label attribute rule.
  • the first data The character string belonging to the category with the label and the character string without the label corresponds to one child node of the third-level child node, and any one of the child nodes of the third-level child node and the second-level child node A child node of the parent-child relationship.
  • the above-mentioned "category with tag” can be used to filter information of a visited web page that includes information of tag attributes, and the category without a tag is used to filter information that does not include tag attributes in the information of visited web pages information;
  • the categories with tags include: a category with only a host name, a category with only advertising information, a category with two levels of classification of the host and domain name, a category with the URL information of the URL of the host and the advertisement, or only
  • the domain name and the URL information of the advertisement are at least one of different categories.
  • this method can provide more possibilities and more accurately intercept target information.
  • the foregoing "character rule" may include the category of the first character string and the category of the preset character string.
  • the fourth-level child node of the m-level child node is divided according to the character rule.
  • the first data The character string belonging to the category of the first character string and the character string of the preset character string respectively correspond to one child node of the fourth-level child node, and any one of the fourth-level child node and the third-level child node A child of a node is in a parent-child relationship.
  • the foregoing "category of the first character string” may be used to filter information that the information of the visited web page and the character string of the category of the first character string have the same first character;
  • the category of the preset character string is used to filter information for accessing the webpage, and the character string of the category of the preset character string has the same information as the preset character string.
  • the foregoing "information for accessing a web page" may include a URL of a user accessing a page or a URL of each element of a web page, and the target information is advertisement information.
  • the foregoing "first data" is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of the browser.
  • the character string is a character string determined by filtering the open source character string in the open source website and the historical data reported within a preset period of time to determine a usage rate greater than a preset threshold.
  • an embodiment of the present invention provides a data processing method.
  • the method may be executed based on a server (ie, a server).
  • the method may specifically include the following steps:
  • the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
  • target information may be advertisement information
  • the information for accessing the webpage includes at least one of a URL of the user accessing the page or a URL of accessing each element of the webpage.
  • the method may further include: periodically obtaining at least one open source string from an open source website;
  • the second data is determined according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
  • each browser server generally has different standards, that is, the target information may be defined as advertisement information on A site, but it is not defined as advertisement information on B site. Therefore, a browser server was added when generating the second data. Custom strings to make matching second data flexible and can be widely used.
  • the method may specifically include: dividing a plurality of child nodes into m levels according to n preset rules, Each level of the m-level child nodes has different preset rules;
  • Each of the n preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
  • the second data includes multiple character strings, each of which includes multiple character strings belonging to different types of character strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
  • Each child node in the k-th child node has a parent-child relationship with one child node in the k-1 level.
  • the k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
  • the "n preset rules” may include at least one of the following rules:
  • the server performs the following steps:
  • Multiple child nodes are divided into m-level child nodes according to black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
  • the method may specifically include:
  • the first level of the m-level subnodes is divided into two subnodes according to the whitelist and blacklist categories, and one of the two subnodes
  • the node includes a character string that belongs to the white list category in the second data
  • the other child node includes the character string that belongs to the black list category in the second data.
  • the method may specifically include:
  • the second level of the m-level sub-nodes is divided into two sub-nodes, two One of the child nodes includes a character string belonging to the category of the positioning match in the second data, and the other child node includes a character string belonging to the category of the preset match in the second data.
  • Two child nodes in the second level and The nodes where the strings belonging to the blacklisted category in the first level are located in a parent-child relationship.
  • the method may specifically include:
  • the 3rd level of the m-level child node is divided into two child nodes according to the category with and without the label.
  • One of the child nodes includes a character string belonging to a category with a label in the second data
  • the other child node includes a character string belonging to a category without a label in the second data.
  • Any one of the child nodes in the third level and the second One of the child nodes in the hierarchy has a parent-child relationship.
  • the above-mentioned "labeled category” may specifically include: a category with only a host name, a category with only host information of an advertisement attribute, a category with two levels of classification of a host and a domain name, and a host and an advertisement At least one of the category of the URL information of the uniform resource locator or only the category of the URL information of the domain name and the advertisement is different.
  • the method may specifically include:
  • the fourth level of the m-level child node is divided into two child nodes according to the type of the first character string and the type of the preset character string.
  • One of the child nodes includes a character string belonging to the category of the first character string in the second data
  • the other child node includes a character string belonging to the category of the preset character string in the second data, in which any one of the fourth levels
  • the child node is in a parent-child relationship with one of the child nodes in the third level.
  • an embodiment of the present invention provides a device, and the device may include:
  • a transceiver module for acquiring information for accessing a web page
  • the processing module is further configured to match the information of the visited web page with the first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes the target information; When target information is included, the target information is intercepted.
  • the device intercepts the target information in the browser page through the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of accessing the web page from the first
  • the number of matching times of the data thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method.
  • the overall matching speed can be increased by more than 40%.
  • the above-mentioned "tree structure" may include multiple nodes, the multiple nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
  • the nodes of each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on multiple nodes in a tree structure according to a preset rule.
  • processing module may be specifically used to step through the information of accessing the webpage from the first data of the parent node of the tree structure to the first of the child nodes in a parent-child relationship with the parent node. A piece of data is matched until it is determined whether the information of the visited webpage includes the target information.
  • the information on the longer visited webpage cannot be directly matched. Therefore, the information on the visited webpage is matched step by step to ensure that the information on the visited webpage can be completely matched and the interception target is improved. The accuracy of the information.
  • the above-mentioned "tree structure” may include m-level child nodes, and each level of the m-level child nodes is divided according to different preset rules among the n types of preset rules.
  • m is an integer greater than or equal to 1
  • n is greater than or equal to m;
  • the jth level child node selects one preset rule from the f preset rules for division, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules, j-1
  • the level child node is a level child node of the level j child node, the level j child node is any level child node of the level m child node, and j and f are integers greater than or equal to 1.
  • each of the n preset rules includes at least two categories of strings
  • the first data includes a plurality of character strings.
  • the character string of the first data is divided into m-level child nodes. Each child node in the m-level child node corresponds to a different character string type in n preset rules. Each child node Include multiple strings with categories with different strings.
  • this application provides a variety of preset rules and categories. You can select preset rules according to your needs. This step can improve the flexibility of the tree structure and is applicable to more Scene.
  • the "n preset rules” may include at least one of the following rules:
  • Black and white list rules positioning and preset matching rules, tag attribute rules, or character rules.
  • the foregoing "black and white list rules" may include categories of white lists and black lists.
  • the first-level child nodes of the m-level child nodes are divided according to the black and white list rules.
  • the character string belonging to the whitelisted category and the character string belonging to the blacklisted category respectively correspond to one child node of the first-level child node.
  • processing module may be specifically configured to match information of a visited web page with a character string of a white list category, and when the information of a visited web page includes a character string of a white list category , Make sure that the information you visit the web page does not include the target information.
  • processing module may be specifically configured to: when the information for accessing the webpage does not include the character string of the whitelist category, perform the processing of the information of the webpage access with the character string of the blacklist category match;
  • the information of the visited webpage includes the character string of the blacklisted category
  • the information of the visited webpage is gradually matched with the child nodes of the parent-child relationship of the child nodes of the character string belonging to the blacklisted category until it is determined that the information of the visited webpage is After the matching is completed, the target information in the information of visiting the webpage is intercepted.
  • the foregoing "positioning and preset matching rules” may include a positioning matching category and a preset matching category, and the second-level child nodes of the m-level child nodes according to the positioning and preset matching rules Divide, in the first data, the character string belonging to the category of the positioning match and the character string belonging to the category of the preset match correspond to one child node of the second-level child node, and any one of the second-level child nodes A child relationship with a child node of a character string belonging to a blacklisted category among the first-level child nodes.
  • the above “locating matching categories” may be used to filter information for accessing a webpage where a character string exists in a first preset position, or in information where a separator exists in a second preset position. At least one of
  • the preset matching category is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
  • the above-mentioned "label attribute rule” may include a category with a label and a category without a label.
  • the third-level child node of the m-level child node is divided according to the label attribute rule.
  • the first data The character string belonging to the category with the label and the character string without the label corresponds to one child node of the third-level child node, and any one of the child nodes of the third-level child node and the second-level child node A child node of the parent-child relationship.
  • the above-mentioned "category with tag” can be used to filter information of a visited web page that includes information of tag attributes, and the category without a tag is used to filter information that does not include tag attributes in the information of visited web pages Information; of which
  • the categories with tags include: categories with only host names, categories with only host information for advertising attributes, categories with two levels of classification for hosts and domain names, categories for URLs with uniform resource locators for hosts and advertisements, or domains and
  • the URL information of the advertisement is at least one of different categories.
  • this method can provide more possibilities and more accurately intercept target information.
  • the foregoing "character rule" may include the category of the first character string and the category of the preset character string.
  • the fourth-level child node of the m-level child node is divided according to the character rule.
  • the first data The character string belonging to the category of the first character string and the character string of the preset character string respectively correspond to one child node of the fourth-level child node, and any one of the fourth-level child node and the third-level child node A child of a node is in a parent-child relationship.
  • the foregoing "category of the first character string” may be used to filter information that the information of the visited web page and the character string of the category of the first character string have the same first character;
  • the category of the preset character string is used to filter information for accessing the webpage, and the character string of the category of the preset character string has the same information as the preset character string.
  • the foregoing "information for accessing a web page" may include a URL of a user accessing a page or a URL of each element of a web page, and the target information is advertisement information.
  • first data may be obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of the browser, where:
  • the valid string is a string that is determined by filtering the open source string in the open source website and the historical data reported within a preset period of time to determine that the usage rate is greater than a preset threshold.
  • an embodiment of the present invention provides a data processing apparatus, which is characterized by including:
  • a processing module that performs tree transformation processing on the second data to determine the first data
  • the transceiver module sends the first data to the terminal, so that the terminal determines whether the accessed webpage contains target information according to the determination.
  • the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
  • target information may be advertisement information
  • the information for accessing the webpage includes at least one of a URL of the user accessing the page or a URL of accessing each element of the webpage.
  • the foregoing “transceiving module” may also be used to periodically obtain at least one open source string from an open source website;
  • processing module may also be used to select, from at least one open source character string and historical data reported by the client within a preset period of time, a plurality of character strings with a visit amount greater than a first threshold as valid character strings;
  • the aforementioned “transceiving module” can also be used to obtain a custom string of the browser server;
  • processing module may also be used to determine the second data according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
  • each browser server generally has different standards, that is, the target information may be defined as advertisement information on A site, but it is not defined as advertisement information on B site. Therefore, a browser server was added when generating the second data. Custom strings to make matching second data flexible and can be widely used.
  • processing module may be specifically used to divide multiple child nodes into m levels according to n preset rules, and the preset rules of each level of the m-level child nodes are different;
  • Each of the n preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
  • the second data includes multiple character strings, each of which includes multiple character strings belonging to different types of character strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
  • Each child node in the k-th child node has a parent-child relationship with one child node in the k-1 level.
  • the k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
  • the "n preset rules" may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule;
  • processing module may also be used to divide multiple child nodes into m-level child nodes according to black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
  • the above-mentioned “processing module” may be specifically used to: when the blacklist and whitelist rules include a whitelist category and a blacklist category, rank m according to the whitelist category and the blacklist category.
  • the first level of the child node is divided into two child nodes. One of the two child nodes includes a character string belonging to the white list category in the second data, and the other child node includes the black list category in the second data. String.
  • the foregoing "processing module" may be specifically used to: when the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching Category, divide the second level of the m-level sub-nodes into two sub-nodes, one of the two sub-nodes includes a string belonging to the category of the positioning match in the second data, and the other sub-node includes the second data A character string belonging to a preset matching category, where the two child nodes in the second level are in a parent-child relationship with the node where the character string belonging to the blacklisted category in the first level is located.
  • the above-mentioned "processing module" may be specifically used, when the tag attribute rule includes a category with and without a tag, according to the category with and without the tag, Divide the third level of the m-level child nodes into two child nodes.
  • One of the two child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string that does not have the second data.
  • a string of the category of the label, where any child node in the third level is in a parent-child relationship with a child node in the second level.
  • the above-mentioned "labeled category” may specifically include: a category with only a host name, a category with only host information of an advertisement attribute, a category with two levels of classification of a host and a domain name, and a host and an advertisement At least one of the category of the URL information of the uniform resource locator or only the category of the URL information of the domain name and the advertisement is different.
  • the foregoing "processing module" may be specifically used, when the character rule includes the category of the first character string and the category of the preset character string, according to the category of the first character string and the preset character string Category, divide the 4th level of the m-level child node into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the second data A character string belonging to a category of a preset character string in which any child node in the fourth level has a parent-child relationship with a child node in the third level.
  • an embodiment of the present invention provides a computer-readable storage medium, which may include instructions that, when run on a computer, cause the computer to perform the following steps:
  • the target webpage information includes the target information
  • the target information is intercepted.
  • an embodiment of the present invention provides a computer-readable storage medium including instructions that, when run on a computer, cause the computer to perform the following steps:
  • the server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
  • an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the following steps:
  • the target webpage information includes the target information
  • the target information is intercepted.
  • an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the following steps:
  • the server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
  • FIG. 1 is a schematic diagram of an application scenario of advertisement blocking
  • FIG. 2 is a schematic diagram of another application scenario of advertisement blocking
  • FIG. 3 is a schematic diagram of an application scenario of advertisement blocking according to an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a data processing method according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of matching results of URLs of elements accessed by a browser client according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of a tree structure according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a tree structure based on a black-and-white list rule division according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a tree structure divided based on positioning and preset matching rules according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a statistical classification structure based on label attribute rules or character rules provided by an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a tree structure based on rule division according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a tree structure based on sub-classification according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of a tree structure based on a black-and-white list rule, a positioning and preset matching rule, and a tag attribute rule according to an embodiment of the present invention
  • FIG. 13 is a schematic diagram of a tree structure based on character rule division according to an embodiment of the present invention.
  • 15 is a schematic structural diagram of an information interception terminal according to an embodiment of the present invention.
  • 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention.
  • FIG. 17 is a schematic structural diagram of an information interception device according to an embodiment of the present invention.
  • FIG. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
  • the technology used for advertisement interception can be the use of Opera's server to block, as shown in Figure 1, the Opera's server can include: a browser server, a web cache library, and a page processing server.
  • a client for example, a mobile phone, a tablet computer, etc.
  • the client sends a webpage access request to the server
  • the browser server receives the webpage access request and sends webpage query information to the webpage cache library
  • the webpage cache library will find the corresponding data according to the webpage information and send it to the browser server, and the browser server will then return the webpage content.
  • the relevant data stored in the webpage cache library is the page processing server periodically sending webpage access requests and receiving webpage content information to process the webpage content information.
  • the processed content may include: image compression, text compression or advertisement filtering. At least one, the processed content information is compressed and sent to a webpage cache library for storage, so that the browsing server can make a query.
  • the method is to hide the advertisement based on the server, and then return the webpage content after the advertisement has been hidden to the client for display. This method requires a large number of pages to be cached on the server and the entire content of the web page to be parsed. This process requires the server on the server to have high performance to quickly complete the processing of the page content. Storage requirements are very high.
  • the browser server needs to download the easylist rule list, and the browser client periodically downloads the ad blocking string to the browser server.
  • the above list of easylist rules includes advertisement blocking strings.
  • the aging problem of the easylist rule list downloaded through the browser server for example, the URLs in the easylist rule list are currently about 4.5W, and volunteers are continuously increasing, and volunteers are only willing to add new rules to "contribute", Unwilling to do things that do not add value to them, such as: deleting old URLs in the easylist rule list, and deleting old rules has risks, so the URl in the easylist rule list continues to grow.
  • the string is determined by the URL in the easylist rule list.
  • many URLs in the easylist rule list were proposed very early, and the original website has modified the page implementation.
  • the URLs in the easylist rule list are outdated, so the URLs in the outdated easylist rule list cannot be browsers
  • the client provides a valid advertisement to intercept the string for interception.
  • the URLs in the visited web pages have a low matching performance with the URLs in the easylist rule list.
  • the URL rules in the easylist rule list are about 4.5W.
  • embodiments of the present invention provide a method, device, and terminal for client-based information interception.
  • the terminal intercepts target information in a browser page through a first data having a tree structure.
  • the tree structure The string in the first data can be distinguished in depth, which effectively reduces the number of matches between the information on the web page and the first data, thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method. .
  • the embodiment of the present invention uses the target information in the visited webpage as an example of advertisement information.
  • the method provided by the embodiment of the present invention can also be used for information other than advertisement information, such as consultation and webpage address.
  • FIG. 3 is a schematic diagram of an application scenario of advertisement blocking according to an embodiment of the present invention.
  • the scenario may include a client and a server.
  • the client may be a browser client
  • the server may be a browser server.
  • the method may include two processes.
  • the first process may be determining the first data by the browser server.
  • the browser server obtains at least one URL of a page accessed by a user or at least one URL of a page element.
  • the page element may include at least one of a text, a link, or an image;
  • a browser server periodically obtains an open source list of an open source website (for example, an easylist rule list).
  • the browser server uses the browser server to learn based on at least one of the URL of the page accessed by the user or at least one URL of the page element and the open source website (the open source list can include strings for ad blocking).
  • Mechanism for example, the cloud-side learning mechanism in Figure 3 to determine a valid string (for example, a string whose visits are greater than a preset threshold within a predetermined number of days, which is represented by a string of Xdays visits top1w in Figure 3 ).
  • the purpose of this step is to remove invalid or rarely accessed strings and reduce the number of rules in order to effectively reduce the number of subsequent matches.
  • the browser server end combines the valid character string and the custom character string of the browser (for example, the self-operating interception rule representation in FIG. 3) to determine the second data.
  • each of the valid character string and the custom character string includes at least one character string.
  • the browser server converts the second data into a tree-like private format, generates the first data, stores the tree-like private format (the first data with a tree structure) to a private format optimization rule base, and synchronizes to the browser client end.
  • the browser client periodically downloads the first data with a tree structure from the browser server.
  • the web page information of the third web page is matched with the first data with the tree structure to determine As a result of the matching, if it is matched in the first data with a tree structure, the browser client intercepts the matched target information in the web page information of the third web page, which is generally advertising information .
  • this method by counting the data accessed by a large number of users, removes invalid or low-access strings from the original open source list, which not only guarantees the validity of the rule, but also reduces the number of matching targets.
  • the strings are classified according to the corresponding rules to form a tree structure, which greatly reduces a single piece of information during matching (that is, access to each element in the third web page).
  • Information elements generally refers to the number of matches of text, pictures, videos, etc.).
  • FIG. 4 is a schematic flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 4, steps S410-S470 may be included, as follows:
  • the browser server receives an instruction from the browser client to access a webpage.
  • the instruction for the browser client to access the webpage may be an instruction for a large number of users to access multiple webpages through the browser; or, an instruction for a large number of users to access the same webpage through the browser.
  • the browser client records at least one of the URL of the accessed page or the URL of the page element according to the instruction of a large number of users to access the web page, and performs at least one of the URL of the accessed page or the URL of the page element within a preset time. Compression, the compressed file is the browser client's instruction to access the web page.
  • the second message does not have any user identification, and the purpose is to protect the privacy of the user.
  • the browser client will send the browser client's instructions for accessing the webpage to the browser server multiple times.
  • the browser server periodically obtains the latest open source list from the open source website (for example, the easylist string or a list containing the easylist string). For example, the server obtains the latest open source list from the open source website every day at 12 am.
  • the open source website for example, the easylist string or a list containing the easylist string.
  • the browser server determines the valid string according to the open source list and the instruction of the browser client to access the webpage, that is, a rule for filtering high hit rates.
  • the server extracts at least one of the URL of the user access page or the URL of the page element in the instruction for the browser client to access the webpage, Then the two are matched against the strings in the open source list. If the corresponding strings are matched, the string is counted up by 1, and this step is repeated until the browser server sends the URL or page element of the user to the page.
  • at least one of the URL of the page accessed by the user or the URL of the page element can be placed in the backup directory. In a possible implementation, it can be set in Delete the file within a preset time period.
  • S440 The browser server merges the valid character string and the custom character string of the browser to determine the second data.
  • valid strings are filtered from an open source list (for example, an easylist rule list), so valid strings are open source.
  • an open source list for example, an easylist rule list
  • custom rules that is, the custom string of the browser.
  • the method may further include: obtaining a custom character string of the browser server.
  • S450 The browser server performs tree transformation processing on the second data to determine the first data.
  • the first data may include a character string for matching target information.
  • the target information refers to advertisement information in this application.
  • the first data is used by a browser client to intercept advertisement information in a visited page according to the first data.
  • the browser server divides the second data into m levels according to n types of preset rules, and each level of the m-level child nodes has different preset rules; each of the n types of preset rules A category that includes at least two character strings, and each layer in the m level is divided into at least two child nodes according to the category of the character strings; the second data includes multiple character strings, and each child node Each of them includes a plurality of character strings belonging to different types of character strings.
  • the n and m are all integers greater than or equal to 1, and the n is greater than or equal to the m.
  • the browser server divides the second data into m levels (m is a positive integer greater than 0, and n is greater than or equal to m) according to n preset rules (n is a positive integer greater than 0), and m level
  • Each level includes at least two child nodes
  • each of the n preset rules includes at least two categories
  • at least two child nodes in each level are divided according to at least two categories (that is, each level Each child node in the class corresponds to a category)
  • at least two child nodes in each level include multiple character strings with one category.
  • the preset rules of each level in the m-level child nodes are different. One may be arranged in the order of the n preset rules, and the other is the n preset rules. Randomly select any two or three in the rule, but at least two.
  • the browser server divides the second data into 4 levels according to 4 preset rules (the fourth level is not shown), and each of the 4 levels includes at least two child nodes
  • the 4 preset rules can include: black and white list rules, positioning and preset matching rules, label attribute rules or character rules. It should be noted that the preset rules can also include other possibilities (for example: logos, fixed sentences Etc.), the above-mentioned rules are used as examples in the present application, and are not limited to these four possibilities.
  • two or three of the division methods are selected for tree transformation processing, because there are fewer types of division methods, although the division strength is weak, the matching speed is improved compared to the prior art.
  • Each of the 4 preset rules may include at least two categories, and each child node in each level corresponds to a category, wherein at least two child nodes in each level are divided according to at least two categories .
  • the black-and-white list rule and positioning and character rule division in the division method are selected for tree transformation processing
  • the black-and-white list division and string division are first performed for tree transformation processing
  • the selected is The positioning and preset matching rules, label attribute rules, and character rules in the division method are first performed by using the positioning and preset matching rules, then the label attribute rules are divided, and finally the character rules are processed by tree transformation.
  • the above 4 rules are selected, they should be arranged downward in order. If the selected rules do not include or include part of the above rules, the ranks should be arranged according to actual conditions.
  • the blacklist and whitelist rules include a whitelist category and a blacklist category
  • the first level of the m-level child nodes is divided into two according to the whitelist category and the blacklist category.
  • Child nodes (1a in FIG. 6 corresponds to the BLACK child node in FIG. 7, and 1b in FIG. 6 corresponds to the child node of WHITE in FIG. 7)
  • one of the two child nodes includes the second data belonging to A character string of the category of the white list (the content in the box below the child node of WHITE in FIG. 7)
  • another child node includes a character string of the category that belongs to the black list in the second data (in FIG. 7)
  • the browser server divides the second data into a first child node and a second child node according to the type of the white list and the type of the black list in the black and white list rule, wherein the first child node (for example, the 1a child node ) Includes a character string belonging to a blacklisted category, and the second child node (for example, 1b child node) includes a character string belonging to a whitelisted category.
  • the first child node for example, the 1a child node
  • the second child node for example, 1b child node
  • the second The level is divided into two sub-nodes (for example, 2a sub-node and 2b sub-node in FIG. 6), and one of the two sub-nodes includes a character string in the second data that belongs to the category of the positioning match, and A child node includes a character string belonging to the preset matching category in the second data, where two child nodes in the second level and a character string belonging to the blacklisted category in the first level The node is in a parent-child relationship.
  • the node where the character string belonging to the whitelisted category in the first level in the third child node (for example, the 2c child node in FIG. 6) in the second level is located is a parent relationship.
  • the child nodes in the second level are all nodes having a parent-child relationship with the nodes where the character strings belonging to the category of the blacklist are located.
  • the sub-nodes with the category of the positioning match include at least one of information that a character string exists at a first preset position or information that a delimiter exists at a second preset position; and the category of the preset matching includes It is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
  • the child nodes of the category with the positioning match are mainly divided according to the characters of the fixed position.
  • the first preset position has the character *, where * indicates that an arbitrary character string appears at the first preset position; or, There is a ⁇ in the second preset position, where ⁇ indicates that a separator appears at the second preset position (where the separator can be any character except letters, numbers, _,-,., Or%).
  • rule filtering ⁇ example.com ⁇ or ⁇ % D1% 82% D0% B5% D1% 81% D1% 82 ⁇ or ⁇ foo.bar ⁇ in the list of positioning matching rules can be matched with it.
  • the category of the preset match is divided according to a common pattern, and the category of the preset match may include at least one of a prefix match or a post match. The following is a description of both cases.
  • white.plain and white.glob refer to the prefix matching category in the preset matching category
  • white.plain and black.plain refer to the suffix matching category in the preset matching category.
  • Sina-related branch it becomes the branch scene shown in Figure 8. Due to the limited number, only three child nodes (for example: white.plain, black.plain, and black.glob) appear.
  • the boxes below white.plain, black.plain, and black.glob are strings with corresponding categories contained in each child node.
  • a second-level child node having a parent-child relationship with the first-level 1a node under the first-level 1a node may include 2a and 2b, and at the same time, it is below the first-level 1b node with the first-level 1b node.
  • the second-level child nodes whose nodes have a parent-child relationship may also include 2a and 2b or 2c (this possibility is not shown in FIG. 6).
  • the category of the preset match can be divided into 2 branches (that is, prefix matching and post matching). In a possible embodiment, it can be combined with the above first level into a layer, that is, under the root node (ROOT) at the same time. It can include 4 nodes, such as: white.plain, white.glob, black.plain, and black.glob.
  • ROOT root node
  • the above-mentioned second level may have four child nodes, and the four sub-nodes may include child nodes having information of a character string having a first preset position, and characters having a second preset position.
  • a child node may include a child node having information of a character string existing at a first preset position, a child node having information of a character string existing at a second preset position, a child node having information having a prefix, and information having a suffix existing.
  • Child node another group of 4 child nodes that are parent-child nodes with the first level 1b, 4 child nodes may include child nodes with information of a first preset position character string, and a child node with a second preset position character string A child node of information, a child node with information that has a prefix, and a child node with information that has a suffix.
  • the third level among the m-level child nodes is divided into the category with the tag and the category without the tag
  • Two child nodes for example, 3a and 3b
  • one of the two child nodes includes a character string belonging to the tagged category in the second data
  • the other child node includes a character string belonging to the second data
  • any child node in the third level is in a parent-child relationship with a child node in the second level child nodes.
  • 3a and The 3b child node has a parent-child relationship with the 2a child node
  • the 3c child node has a parent-child relationship with the 2b child node.
  • the tag attribute rule may include many types.
  • two types are provided, and one includes a category with a tag (such as Content), the other is the category without tags.
  • the categories with tags can specifically include: hostname-only categories, advertising attribute-only host information categories, host and domain name two-level classification categories, host and At least one of a category of the URL information of the advertisement's uniform resource locator, or a category of only the domain name and the URL information of the advertisement is different. Therefore, the categories with labels (such as those shown in Figure 9) are introduced first, as follows, for example:
  • the left column represents the classification of the character string according to the label category by the category with the label
  • the right column represents the corresponding number (the number is set in the standard) corresponding to the label category after the classification.
  • the above string is further divided according to the label category.
  • the above 4 sub-categories (such as "script”, "image” and “document” in Figure 10) can be used to divide this label category, as shown in Figure 10.
  • the display is only divided for black.plain as an example.
  • there is a category that does not include a label for example, the box under the “*” node in FIG. 10 is a character string without a category without a label).
  • the categories with tags can be further divided into: a category with only a host name, a category with only advertising information, a category with a two-level classification of the host and a domain name, a host and an advertisement At least one of the category of the URL information of the uniform resource locator or only the category of the URL information of the domain name and the advertisement is different.
  • the host may include the following four types:
  • the host name is included (that is, the description part of the serial number 2 in FIG. 9 only includes the host information), for example: (the example part in the serial number 2 in FIG. 9)
  • the second type Third classification
  • the host information of the advertising attribution is included (that is, the description part of the serial number 3 in FIG. 9 only includes the information of the third party website accessing the advertising attribution website), for example: (serial number in FIG. 9)
  • the party can be further divided according to the host name string under the classification.
  • character classification is subsequently performed according to the two-level classification of the host and the domain (that is, the description part of the serial number 4 in FIG. 9 includes the domain of the current web page and the information of the host of the advertising web page).
  • the fourth type Domain_Filter classification
  • the url information of the host and the advertisement is included (that is, the description part of the serial number 5 in FIG. 9 includes the information of the domain and the advertisement content), for example, the following five strings:
  • the string containing the hostname of the ad can be:
  • cdndm.com/12/2017/$domain 1kkk.com
  • the string without the file name of the host containing the ad can be:
  • the Domain_Filter can be further divided into 3 sub-nodes (such as shown in Figure 11), that is, the node containing the string of the advertising host name (111 in Figure 11), without the characters of the host name containing the advertising path Nodes (such as 113 in FIG. 11) and nodes (such as 112 in FIG. 11) that do not contain a character string containing the file name of the advertisement.
  • the above classification processing method can also do the same for the domain_filter under the home host attribute classification. For example, for ads containing images, you can use this classification method:
  • Type_filter can include two subclasses of Domain_filter and Third_filter.
  • the second data can be tree-transformed.
  • the transformed tree structure can be shown in FIG. 12, specifically, the black.plain node is
  • the example combines the matching pattern division and rule category division to form the tree structure of FIG. 12.
  • a character string can be divided for domain, the host of the advertisement, the path of the advertisement object, and the host name (name), specifically for at least one of the direct or third children in FIG. 12
  • the classification of the nodes can be divided into host names, for example, it can be further divided by at least one of the first characters 0-9, az, AZ, or other categories (as shown in Figure 13), divided into three sub-groups. Node, each child node divides the original 4 character strings into 3 child nodes according to different categories. In the embodiment of the present invention, only one example is given (the host name is divided), and the rest (domain, host of the advertisement, and The path of the advertising object) can also be divided as above, which will not be described in detail here.
  • the browser server can also divide the third level, that is, the 120-129 child nodes in FIG. 12 to the fourth level again.
  • the preset division rule selected can be a character rule. Specifically, when the character rule includes the first character, When the category of the string and the category of the preset character string are used, the fourth level of the m-level sub-nodes is divided into two sub-nodes according to the category of the first character string and the category of the preset character string. One of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes a character string belonging to the category of the preset character string in the second data. , Wherein any one of the child nodes in the fourth level is in a parent-child relationship with one of the child nodes in the third level.
  • each child node in the k-th child node has a parent-child relationship with one child node in k-1 level, and the k-level child node is any one-level child node in the m-level child node,
  • the k is an integer greater than or equal to 1.
  • each child node may have a child node having a parent-child relationship with the node at the next level.
  • each child node of the above tree structure includes: matching at least one of the URL of the user's access page or the URL of each element of the web page according to the character string included in each child node; and according to the URL of the user's access page or The character string contained in the URL of each element of the visited webpage is assigned a child node matching the next level.
  • this step includes S1410-S1440, as follows:
  • the method may further include receiving the first data.
  • the first data may be downloaded from a browser server. Generally speaking, the download is performed periodically (for example, it is automatically downloaded when the network is connected at 12:00 every day).
  • the first data is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of a browser, where the valid character string is obtained through an open source website.
  • the open source character string in the filter and the historical data reported by the terminal within a preset time period are filtered to determine a character string whose usage rate is greater than a preset threshold.
  • S1420 The browser client obtains information about the accessed web page.
  • the visited webpage may also refer to URL information.
  • the information for accessing the webpage may include the URL of the user accessing the page or the URL of accessing each element of the webpage.
  • the information for accessing the webpage may or may not include target information.
  • the target information in the embodiments provided in this application generally refers to advertisement information.
  • the browser client matches the information of the visited web page with first data arranged in a tree structure, where the first data is used to determine whether the information of the visited web page includes target information.
  • the specific matching process can be as follows:
  • the tree structure includes a plurality of Nodes, the plurality of nodes including a root node (ROOT) and at least one level of child nodes, each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level and the associated next level nodes With a parent-child relationship, the first data is distributed on the plurality of nodes in a tree structure according to a preset rule. Matching the information of the visited web page from the first data of the parent node of the tree structure to the first data of the child node in a parent-child relationship with the parent node until the information of the visited web page is determined Whether to include the target information.
  • ROOT root node
  • the tree structure may include m-level sub-nodes, and each level of the m-level sub-nodes is divided according to different preset rules among n preset rules, where n and m are both greater than or equal to 1.
  • n the n is greater than or equal to the m;
  • the j-th child node selects one preset rule from the f preset rules for division, and the f preset rules are the first of the n preset rules.
  • the j-1 level child node selects the remaining preset rules, the j-1 level child node is a level child node of the j level child node, and the j level child node is the m level child node.
  • the j and f are integers greater than or equal to 1; each of the n preset rules includes at least two categories of strings;
  • the first data includes a plurality of character strings, and the character string of the first data is divided according to the m-level child nodes.
  • Each of the m-level child nodes corresponds to a different one of the n preset rules.
  • the category of the character string, and each child node includes a plurality of character strings having different categories of the character string.
  • the n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule.
  • the embodiments provided in this application perform division and matching according to the rules shown in the four.
  • the first level in the tree structure includes two child nodes, where the first child node of the first level child nodes contains a plurality of character strings with a whitelisted category, and the second child node of the first level child nodes Contains multiple strings with blacklisted categories. Among them, the two child nodes are divided according to the black and white list rules.
  • the matching process if the information for accessing the webpage matches in the first child node, the matching ends directly, and there is no need to match a large number of strings in the second child node.
  • Sina related website is taken as an example.
  • (black) child nodes and (white) child nodes are determined.
  • (white) child nodes may include @@
  • black) child nodes can include:
  • the terminal determines that the information of the visited webpage does not include the target information, and the terminal does not intercept the target information, indicating the information It is not an advertisement, it jumps out of the tree structure without intercepting the information, and terminates the matching process.
  • the terminal When the information of the visited webpage includes a character string of the category of the blacklist, the terminal gradually classifies the information of the visited webpage with a child node of the character string of the category that belongs to the blacklist. The child nodes are matched until it is determined that the information of the visited web page is completely matched, and the terminal intercepts the target information in the information of the visited web page.
  • the second level in the tree structure includes two child nodes, where any one of the child nodes of the second level child node and the child nodes of the character string belonging to the category of the blacklist among the first level child nodes are Father-son relationship.
  • the first child node in level 2 includes a character string with a category that locates a match
  • the second child node in level 2 includes a character string with a category that has a preset match.
  • the category of the positioning match is used to filter at least one of information of a character string in a first preset position or information of a separator in a second preset position in the information of accessing the web page; the preset match
  • the category of is used to filter at least one of information that has a prefix or information that has a suffix in the information for accessing the webpage.
  • a child node with a positioning-matching category is mainly divided according to characters in a fixed position. Specifically, a character * exists in the first preset position, where * indicates that any character string appears in the first preset position; or ⁇ Exists in the second preset position, where ⁇ indicates that a separator appears in the second preset position (where the separator can be any character except letters, numbers, _,-,., Or%).
  • rule filtering ⁇ example.com ⁇ or ⁇ % D1% 82% D0% B5% D1% 81% D1% 82 ⁇ or ⁇ foo.bar ⁇ in the list of positioning matching rules can be matched with it.
  • the prefix information or suffix information is also the same. If the corresponding string appears in the preset position, it can be proved to match, for example: white.plain, black.plain, and black.glob. If you visit the web page including white.plain, black The characters with the same prefix or suffix of .plain and black.glob prove that they can be matched. When a match is made, it is necessary to determine whether the webpage of the visited page has been matched. If it has not been matched, then continue to the third level to continue the match. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
  • the first child node in level 3 includes a character string with a label category, and the second child node includes a character string without a label.
  • any child node of the third-level child node is in a parent-child relationship with one child node of the second-level child node.
  • the category with a tag is used to filter the information of the visited web page including the information of the tag attribute
  • the category without the tag is used to filter the information of the visited web page without the information of the tag attribute.
  • the categories with tags can be further divided into: categories with only host names, categories with only host information for advertising attributes, categories with two levels of classification for hosts and domain names, and categories for URLs for uniform resource locators for hosts and advertisements Or at least one of the categories where the domain name and the URL information of the advertisement are different.
  • the specific matching process can be based on the following classification methods:
  • the host name is included (that is, the description part of the serial number 2 in FIG. 9 only includes the host information), for example: (the example part in the serial number 2 in FIG. 9)
  • the name string is further matched.
  • the host information of the advertising attribution is included (that is, the description part of the serial number 3 in FIG. 9 only includes the information of the third-party website accessing the advertising attribution website), for example: (serial number in FIG. 9)
  • the party can further match under the classification based on the host name string.
  • subsequent character matching is performed according to the two-level classification of the host and the domain (that is, the description part of the serial number 4 in FIG. 9 includes the domain of the current web page and the information of the host of the advertising web page).
  • the url information of the host and the advertisement is included (that is, the description part of the serial number 5 in FIG. 9 includes the information of the domain and the advertisement content), for example, the following five strings:
  • the string containing the hostname of the ad can be:
  • cdndm.com/12/2017/$domain 1kkk.com
  • the string without the file name of the host containing the ad can be:
  • the above matching method can also do the same for the domain_filter under the home host attribute classification. For example, for ads containing images, you can use this matching method:
  • Type_filter can include two subclasses of Domain_filter and Third_filter.
  • the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
  • the first child node in the fourth level includes a character string of the category of the first character string
  • the second child node includes the character string of the category of the preset character string.
  • any child node of the fourth-level child node is in a parent-child relationship with one child node of the third-level child node.
  • the category of the first character string is used to filter information for accessing the webpage and the character string of the category of the first character string has the same first character; the category of the preset character string is used to filter all information.
  • the information about accessing the webpage and the character string of the category of the preset character string have the same information as the preset character string.
  • the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
  • each child node of the above tree structure includes: matching at least one of the URL of the user's access page or the URL of each element of the web page according to the character string included in each child node; and according to the user The character string contained in the URL of the visited page or the URL of each element of the visited web page is assigned a child node matching the next level.
  • the browser client can directly display it to the user.
  • the target information in the browser page is intercepted by the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the visited web page and the first data.
  • the number of matching times thereby avoiding the problem of increasing the number of matching times due to a large number of strings of intercepted target information and no rationalized matching method.
  • the purpose is to remove invalid or rarely accessed strings, and reduce the number of rules in order to effectively reduce the number of subsequent matches.
  • FIG. 15 is a schematic structural diagram of an information interception terminal according to an embodiment of the present invention.
  • the terminal 15 may include: one or more processors 1502, a transceiver 1501, a plurality of application programs (not shown in the figure) in the memory 1503, and one or more computer programs.
  • One or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the terminal, cause the terminal to perform the following steps:
  • the target information is intercepted.
  • the tree structure may include: including multiple nodes, multiple nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level are associated with The nodes at the next level have a parent-child relationship, and the first data is distributed on a plurality of nodes in a tree structure according to a preset rule.
  • the terminal can perform the following steps:
  • the information of accessing the webpage is matched step by step from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until it is determined whether the information of the accessing the webpage includes the target information.
  • the above tree structure may specifically include m-level sub-nodes, and each level of the m-level sub-nodes is divided according to different preset rules among n preset rules, where n and m are integers greater than or equal to 1, and n is greater than Equal to m; the j-th child selects one of the f preset rules to divide, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules.
  • the j-1 level child node is a level child node of the j level child node, the j level child node is any level child node of the m level child node, and j and f are integers greater than or equal to 1; n kinds of presets
  • Each of the rules includes categories of at least two character strings; the first data includes multiple character strings, and the character string of the first data is divided by m-level child nodes, and each of the m-level child nodes corresponds to n Different types of character strings in the preset rule, and each child node includes multiple character strings with different types of character strings.
  • the n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule.
  • the blacklist and whitelist rules may include the categories of the whitelist and the blacklist, the first-level child nodes of the m-level child nodes are divided according to the blacklist and whitelist rules, and the first character string and The strings that belong to the blacklist category correspond to one child node in the first-level child node, respectively.
  • the terminal may perform the following steps: match the information of the visited web page with the character string of the white list category, and when the information of the visited web page includes the character string of the white list category, the terminal determines that the information of the visited web page does not include the target information, and the terminal does not Intercept target information.
  • the terminal may also perform the following steps: when the information of the accessed webpage does not include the character string of the whitelisted category, match the information of the accessed webpage with the characterlist of the blacklisted category; when the information of the accessed webpage does not include the blacklisted category
  • the terminal does not intercept the target information; when the information of the visited webpage includes the character string of the blacklist category, the terminal will gradually access the information of the visited webpage to the information that belongs to the blacklist.
  • the child nodes of the category string match the child nodes of the parent-child relationship until it is determined that the information for accessing the webpage is matched, and the terminal intercepts the target information in the information for accessing the webpage.
  • the positioning and preset matching rules may specifically include: the category of the positioning match and the category of the preset matching, the second-level child nodes of the m-level sub-nodes are divided according to the positioning and the preset matching rules, and the first data belongs to the positioning
  • the strings of the matching category and the strings belonging to the preset matching category respectively correspond to a child node of the second-level child node, where any one of the second-level child nodes and the first-level child node are black.
  • the child nodes of the strings of the category of the list are in a parent-child relationship.
  • the category of the positioning match can be used to filter at least one of information that a character string exists in the first preset position or information that has a separator in the second preset position in the information for accessing the web page. There is at least one of information that has a prefix or information that has a suffix in filtering the information for accessing the webpage.
  • the above-mentioned label attribute rules may specifically include: a category with a label and a category without a label, a third-level child node of the m-level child node is divided according to the label attribute rule, and a string in the first data that belongs to the category with the label.
  • the character string and the category without the label respectively correspond to one child node of the third-level child node, and any child node of the third-level child node is in a parent-child relationship with one child node of the second-level child node.
  • the categories with tags can be used to filter the information of visiting web pages to include the information of tag attributes, and the categories without tags can be used to filter the information of visiting web pages to not include the information of tag attributes.
  • the categories with tags include: Host-only category, advertising-only host information category, host and domain name two-level classification category, host and advertisement uniform resource locator URL information category, or only domain and advertisement URL information categories that differ At least one of.
  • the above character rules may include: a category of the first character string and a category of a preset character string, a fourth-level child node among the m-level child nodes is divided according to the character rule, and a character string belonging to the category of the first character string in the first data and
  • the character strings of the types of the preset character strings correspond to one child node of the fourth-level child node, and any one of the child nodes of the fourth-level child node is in a parent-child relationship with one of the child nodes of the third-level child node.
  • the category of the first string can be used to filter the information of the visited web page and the string of the category of the first string has the same information as the first character; the category of the preset string is used to filter the information of the visited web page and the preset string.
  • the category strings have the same information as the preset strings.
  • the information for accessing the webpage may include the URL of the user accessing the page or the URL of each element of the webpage, and the target information is advertisement information.
  • the first data is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid string and a custom string of the browser, where the valid string is an open source string and
  • the historical data reported by the terminal within a preset time period is filtered to determine a character string whose usage rate is greater than a preset threshold.
  • the overall matching process described above is performed in the terminal, so this method greatly improves the speed of information matching by the terminal and avoids the need for the server to have The problem of high performance can quickly complete the processing of the page content.
  • the terminal intercepts the target information in the browser page through the first data with a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the web page access and the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to a large number of strings of intercepted target information and no reasonable matching method.
  • FIG. 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention.
  • the server 16 may include: one or more processors 1601, a transceiver 1602, and a memory 1603; a plurality of application programs; and one or more computer programs, where the one or more computer programs are stored in the memory
  • the instructions include instructions that, when executed by the server, cause the server to perform the following steps:
  • the server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains the target information according to the determination.
  • the target information may be advertisement information; the information for accessing the webpage includes at least one of a URL of a user accessing the page or a URL of accessing each element of the webpage.
  • the above server may perform the following specific steps: periodically obtain at least one open source string from an open source website; and select at least one open source string and historical data reported by the client within a preset period of time.
  • Each string is a valid string; obtain a custom string of the browser server; determine the second data according to the valid string and the custom string, and each of the valid string and the custom string includes at least one string.
  • the above server may perform the following specific steps: divide multiple child nodes into m levels according to n preset rules, and the preset rules of each level of the m-level child nodes are different; each of the n preset rules includes The category of at least two strings, each layer in the m level is divided into at least two child nodes according to the category of the strings; the second data includes a plurality of strings, and each child node includes a plurality of categories belonging to different strings.
  • n and m are integers greater than or equal to 1, and n is greater than or equal to m; each child node in the k-th child node has a parent-child relationship with one child node in k-1 level, and the child node in k level is m For any one-level child node of the first-level child node, k is an integer greater than or equal to 1.
  • n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, tag attribute rules or character rules; the server performs the following steps: according to the black and white list rules, positioning and preset matching rules, Label attribute rules and character rules divide multiple child nodes into m-level child nodes.
  • the above server may perform the following specific steps: When the whitelist and blacklist categories are included in the blacklist and whitelist rules, the first level of the m-level subnodes is divided into two subnodes according to the whitelist categories and blacklist categories. One of the two child nodes includes a character string of a category that belongs to the white list in the second data, and the other child node includes a character string of a category that belongs to the black list in the second data.
  • the above server may perform the following specific steps: when the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the The level is divided into two sub-nodes. One of the two sub-nodes includes a character string belonging to the category of the positioning match in the second data, and the other sub-node includes a character string belonging to the category of the preset match in the second data.
  • the two child nodes in level 2 are in a parent-child relationship with the nodes where the strings belonging to the blacklisted category in level 1 are located.
  • the above server may perform the following specific steps: When the label attribute rule includes a category with a label and a category without a label, according to the category with the label and the category without the label, the third level of the m-level child node is divided into Two child nodes, one of the two child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string belonging to a category without a label in the second data. Any child node of the parent-child relationship with a child node in the second-level child node.
  • the categories with tags may include: a category with only a host name, a category with only advertising information, a category with two levels of hosting and domain name, a category with uniform resource locator URL information for the hosting and advertising, or only
  • the domain name and the URL information of the advertisement are at least one of different categories.
  • the above server may perform the following specific steps: When the character rule includes the category of the first string and the category of the preset string, according to the category of the first string and the category of the preset string, The level is divided into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the character string belonging to the category of the preset character string in the second data. Any child node in the fourth level is in a parent-child relationship with a child node in the third level.
  • the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
  • FIG. 17 is a schematic structural diagram of an information interception apparatus according to an embodiment of the present invention. As shown in FIG. 17, the device 17 may include:
  • a processing module 1702 configured to start a browser to access a web page
  • the transceiver module 1701 is configured to obtain information about accessing a web page
  • the processing module is further configured to match the information of the visited web page with the first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes the target information; When target information is included, the target information is intercepted.
  • the tree structure may include multiple nodes, the multiple nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level and the associated next node
  • the hierarchical nodes have a parent-child relationship, and the first data is distributed on a plurality of nodes in a tree structure according to a preset rule.
  • the above processing module may be specifically configured to match the information for accessing the web page from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until the information for accessing the web page is determined. Whether to include target information.
  • the above tree structure may include m-level child nodes, and each level of the m-level child nodes is divided according to different preset rules among the n types of preset rules, where n and m are integers greater than or equal to 1, and n is greater than or equal to m; the j-th level child node selects one preset rule from the f preset rules to divide, and the f preset rules are the remaining j-1 level sub-nodes from the n preset rules to select the remaining preset rules, j
  • the -1 level child node is a level child node of the j level child node, the j level child node is any level child node of the m level child node, and j and f are integers greater than or equal to 1;
  • n kinds of preset rules Each of them includes categories of at least two character strings; the first data includes multiple character strings, and the character string of the first data is divided by m-level child nodes, and each of the m-level child nodes
  • the n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule.
  • the above black and white list rules may include the types of the white list and the black list.
  • the first-level child nodes of the m-level sub-nodes are divided according to the black and white list rules.
  • the character string of the category corresponds to a child node in the first-level child node.
  • the processing module can be specifically used to match the information of the visited web page with the strings of the white list category.
  • the information of the visited web page includes the strings of the white list category, it is determined that the information of the visited web page does not include the target information and is not intercepted. Target information.
  • the processing module may be specifically configured to match the information of the visited web page with the character string of the blacklisted category when the information of the visited webpage does not include the whitelisted category strings; when the information of the visited webpage does not include the blacklisted category strings When the information of the visited webpage does not include the target information, the target information is not intercepted; when the information of the visited webpage includes the character string of the blacklist category, the information of the visited webpage is gradually ranked with the characters belonging to the blacklisted category.
  • the child nodes of the string match the child nodes of the parent-child relationship until it is determined that the information for accessing the webpage is matched, and the target information in the information for accessing the webpage is intercepted.
  • the above positioning and preset matching rules may include the positioning matching category and the preset matching category.
  • the second-level child nodes in the m-level child nodes are divided according to the positioning and preset matching rules.
  • the first data belongs to the positioning matching category.
  • the strings that belong to the preset matching category correspond to one child node of the second-level child nodes, in which any one of the second-level child nodes and the first-level child nodes belong to the blacklisted category.
  • the child nodes of the string are in a parent-child relationship.
  • the category of the positioning match can be used to filter at least one of information that a character string exists in the first preset position or information that has a separator in the second preset position in the information for accessing the web page. There is at least one of information that has a prefix or information that has a suffix in filtering the information for accessing the webpage.
  • the above-mentioned label attribute rule may include a category with a label and a category without a label.
  • the third-level child node of the m-level child node is divided according to the label attribute rule.
  • the character string of the label category corresponds to a child node of the third-level child node, and any child node of the third-level child node is in a parent-child relationship with a child node of the second-level child node.
  • the categories with tags can be used to filter the information of visiting web pages to include the information of tag attributes, and the categories without tags can be used to filter the information of visiting web pages to not include the information of tag attributes.
  • the categories with tags include: Host-only category, advertising-only host information category, host and domain name two-level classification category, host and advertisement uniform resource locator URL information category, or only domain and advertisement URL information categories that differ At least one of.
  • the above character rules may include the category of the first character string and the category of the preset character string.
  • the fourth-level child nodes of the m-level child nodes are divided according to the character rules.
  • the string of the category of the set string corresponds to a child node of the fourth-level child node, in which any child node of the fourth-level child node and a child node of the third-level child node are in a parent-child relationship.
  • the category of the first string can be used to filter the information of the visited web page and the string of the category of the first string has the same information as the first character; the category of the preset string is used to filter the information of the visited web page and the preset string.
  • the category strings have the same information as the preset strings.
  • the above-mentioned information for accessing the webpage may include the URL of the page accessed by the user or the URL of each element of the webpage, and the target information is advertisement information.
  • the first data may be obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid string and a custom string of the browser, where the valid string is an open source string in an open source website. Filter the reported historical data within a preset time period to determine a character string with a usage rate greater than a preset threshold.
  • the device intercepts the target information in the browser page through the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of accessing the web page from the first
  • the number of matching times of the data thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method.
  • the overall matching speed can be increased by more than 40%.
  • FIG. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 18, the device 18 includes:
  • the processing module 1802 performs tree transformation processing on the second data to determine the first data
  • the transceiver module 1801 sends the first data to the terminal, so that the terminal determines whether the accessed webpage contains target information according to the determination.
  • the target information may be advertisement information; the information for accessing the webpage includes at least one of a URL of a user accessing the page or a URL of accessing each element of the webpage.
  • the foregoing transceiver module may also be used to periodically obtain at least one open source string from an open source website; and obtain a custom string of a browser server.
  • the processing module may also be used to select a plurality of strings with a visit amount greater than the first threshold as valid strings from at least one open source string and historical data reported by the client within a preset time period; the processing module may also use Therefore, the second data is determined according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string; the plurality of child nodes are divided into m levels and m levels according to n preset rules.
  • the preset rules at each level of the child node are different; each of the n preset rules includes at least two categories of strings, and each layer in the m level is divided into at least two children according to the category of the strings Node; the second data includes multiple character strings, each of which includes multiple character strings belonging to different character string categories, n and m are integers greater than or equal to 1, n is greater than or equal to m; the k-th child node Each child node in the parent node has a child relationship with one child node in the k-1 level.
  • the k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
  • the n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule; the processing module may also be used to, according to the black and white list rule, positioning and preset Matching rules, label attribute rules, and character rules divide multiple child nodes into m-level child nodes.
  • the processing module may be specifically used to divide the first level of the m-level sub-nodes into two sub-nodes according to the types of the white list and the black list when the black and white list rules include the types of the white list and the types of the black list.
  • One of the two child nodes includes a character string belonging to a white list category in the second data, and the other child node includes a character string belonging to a black list category in the second data.
  • the processing module may be specifically configured to: when the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the second level of the m-level subnodes Divided into two sub-nodes, one of the two sub-nodes includes the character string belonging to the category of the positioning match in the second data, and the other sub-node includes the character string belonging to the category of the preset match in the second data, in which the second The two child nodes in the level are in a parent-child relationship with the nodes where the strings belonging to the blacklisted category in the first level are located.
  • the processing module can be specifically used to divide the third level of the m-level sub-nodes into two according to the categories with and without labels when the label attribute rule includes the categories with and without the labels.
  • Child nodes one of the two child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string belonging to a category without a label in the second data.
  • Any child node is in a parent-child relationship with a child node in the second-level child node.
  • the categories with labels may include: a category with only a host name, a category with only advertising information, a category with two levels of hosting and domain names, a category with uniform resource locator URL information for hosts and advertisements, or only It is at least one of categories in which the domain name and the URL information of the advertisement are different.
  • the processing module can be specifically used to: when the character rule includes the category of the first string and the category of the preset string, according to the category of the first string and the category of the preset string, the 4th level of the m-level child node Divided into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the character string belonging to the category of the preset character string in the second data, where Any child node in the fourth level is in a parent-child relationship with a child node in the third level.
  • the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
  • An information interception method, device and terminal provided by the embodiments of the present invention.
  • the second data is converted into the first data with a tree structure. Used to intercept the target information in the browser page.
  • the tree structure can deeply distinguish the character string in the first data, effectively reducing the number of times the information on the web page is accessed and the first data, thereby avoiding the characters that intercept the target information.
  • the second data is divided by using a tree analysis of the character string, using black and white list rules, positioning and preset matching rules, label attribute rules, or character rules.
  • Strings are distinguished in depth and converted into a highly distinguished tree structure, which greatly improves the speed at which the browser client intercepts advertisements and effectively improves the user experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Provided by the embodiments of the present invention is an information blocking terminal, wherein the terminal may comprise: a processor, a transceiver, a memory, and multiple application programs which allow the terminal to perform the following steps: starting up a browser so as to access a web page; acquiring information of the accessed web page; matching the information of the accessed web page with first data arranged in a tree structure, wherein the first data is used to determine whether the information of the accessed web page comprises target information; and when the information of the accessed web page comprises the target information, blocking the target information. In the present solution, the terminal blocks target information in a browser page by means of the first data having the tree structure, and the tree structure may deeply distinguish character strings in the first data, thus effectively reducing the instances of matching between the information of an accessed web page and the first data, and thereby avoiding the problem of the instances of matching increasing due to a large number of character strings in target information to be blocked and a lack of suitable matching mode.

Description

一种信息拦截的方法、装置及终端Method, device and terminal for information interception 技术领域Technical field
本发明实施例涉及网页分析和拦截技术领域,特别涉及一种信息拦截的方法、装置及终端。Embodiments of the present invention relate to the field of webpage analysis and interception technologies, and in particular, to an information interception method, device, and terminal.
背景技术Background technique
随着互联网的蓬勃发展,越来越多的网页被***各式各样的广告。为了避免这些广告给用户在浏览器中浏览网页的过程中带来不便,有必要对网页中的广告进行拦截。With the vigorous development of the Internet, more and more web pages are being inserted into various advertisements. In order to avoid the inconvenience of these advertisements in the process of browsing the webpage in the browser, it is necessary to block the advertisements on the webpage.
目前,一般的用户网页访问请求都是发往服务器处理,服务器在缓存页面内容的同时加载了Easylist规则列表,通过该规则列表将广告元素隐藏,然后将广告元素隐藏后的页面内容返回给客户端进行展示。其中,该Easylist规则列表Easylist规则列表中包含多个字符串,是由开源组织开放的一个广告拦截的规则集,定义了网页中哪些元素是广告,应该被拦截掉。At present, general user web page access requests are sent to the server for processing. The server loads the Easylist rule list while caching the page content, hides the advertising elements through the rule list, and then returns the hidden page content to the client. Make a presentation. The Easylist rule list contains multiple strings. It is an ad blocking rule set opened by an open source organization. It defines which elements in a web page are ads and should be blocked.
发明内容Summary of the Invention
本发明实施例提供了一种信息拦截的方法、装置及终端,基于终端实施广告拦截的方式,通过对规则匹配方式的优化,用以解决广告拦截的规则较多且没有合理化的匹配方式导致匹配次增多的问题。The embodiments of the present invention provide a method, a device and a terminal for information interception. Based on the terminal's implementation of the method of advertising interception, through optimization of the rule matching method, there are many rules for advertising interception and there is no reasonable matching method leading to matching Increased number of issues.
第一方面,本发明实施例提供了一种信息拦截的终端,该终端可以包括:一个或多个处理器、收发器、存储器、多个应用程序,以及一个或多个计算机程序,其中,一个或多个计算机程序被存储在存储器中,一个或多个计算机程序包括指令,当指令被终端执行时,使得终端执行以下步骤:In a first aspect, an embodiment of the present invention provides an information interception terminal. The terminal may include one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs. One or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the terminal, cause the terminal to perform the following steps:
启动浏览器以访问网页;Launch a browser to access a web page;
获取访问网页的信息;Get information about visiting web pages;
将访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,第一数据用于确定访问网页的信息中是否包括目标信息;Matching information of a visited web page with first data arranged in a tree structure, where the first data is used to determine whether the information of the visited web page includes target information;
当访问网页的信息中包括目标信息时,拦截目标信息。When the information of the visited web page includes the target information, the target information is intercepted.
本方案中,该终端通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题。In this solution, the terminal intercepts the target information in the browser page through the first data with a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the web page access and the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to a large number of strings of intercepted target information and no reasonable matching method.
在一个可选的实现方式中,上述“树形结构”中可以包括:In an optional implementation manner, the foregoing "tree structure" may include:
包括多个节点,多个节点包括根节点和至少一级子节点,至少一级子节点中的每一级包括至少两个子节点;Including multiple nodes, multiple nodes including a root node and at least one level of child nodes, each level of at least one level of child nodes including at least two child nodes;
每一级的节点与关联的下一级节点具有父子关系,第一数据根据预设规则分布在成树形结构的多个节点上。The nodes of each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on multiple nodes in a tree structure according to a preset rule.
在另一种可选的实现方式中,终端可以具体执行以下步骤:In another optional implementation manner, the terminal may specifically perform the following steps:
将访问网页的信息从树形结构的父节点的第一数据逐级向与父节点呈父子关系的子节点的第一数据进行匹配,直至确定访问网页的信息中是否包括目标信息。The information of accessing the webpage is matched step by step from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until it is determined whether the information of the accessing the webpage includes the target information.
由于访问网页的信息会存在长短的差别,所以较长的访问网页的信息不能直接匹配出来, 所以将该访问网页的信息进行逐级匹配,保证访问网页的信息能够完整的匹配到,提升拦截目标信息的准确度。Because there is a difference in the length of the information on the visited webpage, the information on the longer visited webpage cannot be directly matched. Therefore, the information on the visited webpage is matched step by step to ensure that the information on the visited webpage can be completely matched and the interception target is improved. The accuracy of the information.
在又一种可选的实现方式中,在上述“树形结构”具体可以包括m级子节点,m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,n、m均为大于等于1的整数,n大于等于m;In another optional implementation manner, the above-mentioned "tree structure" may specifically include m-level child nodes, and each level of the m-level child nodes is divided according to different preset rules among the n preset rules. , N, m are integers greater than or equal to 1, n is greater than or equal to m;
第j级子节点从f种预设规则中选择1种预设规则进行划分,f种预设规则为n种预设规则中前j-1级子节点选择剩余的预设规则,j-1级子节点为j级子节点的上一级子节点,j级子节点为m级子节点中的任意一级子节点,j和f均为大于等于1的整数;The jth level child node selects one preset rule from the f preset rules for division, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules, j-1 The level child node is a level child node of the level j child node, the level j child node is any level child node of the level m child node, and j and f are integers greater than or equal to 1.
n种预设规则中的每一种分别包括至少两个字符串的类别;each of the n preset rules includes at least two categories of strings;
第一数据包括多个字符串,第一数据的字符串按m级子节点划分,m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,每个子节点包括具有不同的字符串的类别的多个字符串。The first data includes a plurality of character strings. The character string of the first data is divided into m-level child nodes. Each child node in the m-level child node corresponds to a different character string type in n preset rules. Each child node Include multiple strings with categories with different strings.
由于每个终端或者运营商对目标信息的定义不同,所以本申请提供了多种预设规则和类别,可以根据需求选择预设规则,该步骤可以提升树形结构的灵活度,适用于更多的场景。Because each terminal or operator has different definitions of target information, this application provides a variety of preset rules and categories. You can select preset rules according to your needs. This step can improve the flexibility of the tree structure and is applicable to more Scene.
在再一种可选的实现方式中,上述“n种预设规则”可以包括下述至少一种规则:In yet another optional implementation manner, the "n preset rules" may include at least one of the following rules:
黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。Black and white list rules, positioning and preset matching rules, tag attribute rules, or character rules.
在再一种可选的实现方式中,在上述“黑白名单规则”可以包括:In yet another optional implementation manner, the foregoing "black and white list rule" may include:
白名单的类别和黑名单的类别,m级子节点中的第1级子节点根据黑白名单规则进行划分,第一数据中属于白名单的类别的字符串和属于黑名单的类别的字符串分别对应第1级子节点中的一个子节点。在再一种可选的实现方式中,终端可以执行以下步骤:The categories of the white list and the category of the black list. The first-level children of the m-level sub-nodes are divided according to the black-and-white list rules. Corresponds to one of the child nodes in the first level. In another optional implementation manner, the terminal may perform the following steps:
将访问网页的信息与白名单的类别的字符串进行匹配,当访问网页的信息包括白名单的类别的字符串时,终端确定访问网页的信息不包括目标信息。The information of the visited web page is matched with the character string of the white list category. When the information of the visited web page includes the character string of the white list category, the terminal determines that the information of the visited web page does not include the target information.
由于有些访问网页的信息可能会带有“ad”,但是可能对于一些运营商来说,并不是目标信息,所以设置具有白名单类别的字符串将带有“ad”但不是目标信息(即广告)的这一种可能性排除,提升了拦截的精准度。Because some of the information for accessing the webpage may carry "ad", but it may not be the target information for some operators, so a string with a whitelist category will be set with "ad" but not the target information (that is, advertising This possibility is eliminated, which improves the accuracy of the interception.
在再一种可选的实现方式中,终端还可以执行以下步骤:In another optional implementation manner, the terminal may further perform the following steps:
当访问网页的信息不包括白名单的类别的字符串时,将访问网页的信息与黑名单的类别的字符串进行匹配;When the information of the visited webpage does not include the character string of the white list category, the information of the visited web page is matched with the character string of the black list category;
当访问网页的信息不包括黑名单的类别的字符串时,终端确定访问网页的信息不包括目标信息;When the information of accessing the webpage does not include the character string of the blacklist category, the terminal determines that the information of accessing the webpage does not include the target information;
当访问网页的信息包括黑名单的类别的字符串时,终端将访问网页的信息逐级与属于黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定访问网页的信息被匹配完毕,终端拦***问网页的信息中的目标信息。When the information of the visited webpage includes the character string of the blacklist category, the terminal matches the information of the visited webpage level by level with the child nodes of the parent node of the child node of the character string belonging to the blacklisted category until the information of the visited webpage is determined. After being matched, the terminal intercepts the target information in the information of accessing the web page.
在再一种可选的实现方式中,在上述“定位和预设匹配规则”可以具体包括:In yet another optional implementation manner, the above “positioning and preset matching rule” may specifically include:
定位匹配的类别和预设匹配的类别,m级子节点中的第2级子节点根据定位和预设匹配规则进行划分,第一数据中属于定位匹配的类别的字符串和属于预设匹配的类别的字符串分别对应第2级子节点中的一个子节点,其中第2级子节点中的任意一个子节点与第1级子节点中属于黑名单的类别的字符串的子节点呈父子关系。The category of the positioning match and the category of the preset match. The second-level child nodes of the m-level sub-nodes are divided according to the positioning and preset matching rules. The character string of the category corresponds to a child node of the second-level child node, and any child node of the second-level child node has a parent-child relationship with the child nodes of the black-listed category of the first-level child node. .
在再一种可选的实现方式中,上述“定位匹配的类别”可以用于筛选访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;In still another optional implementation manner, the above “locating matching category” may be used to filter information for accessing a webpage for information where a character string exists at a first preset position, or information where a separator exists at a second preset position. At least one of
预设匹配的类别用于筛选访问网页的信息中存在前缀的信息,或者具有后缀的信息中的 至少一种。The preset matching category is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
在再一种可选的实现方式中,上述“标签属性规则”中可以具体包括:In yet another optional implementation manner, the above-mentioned "label attribute rule" may specifically include:
具备标签的类别和不具备标签的类别,m级子节点中的第3级子节点根据标签属性规则进行划分,第一数据中属于具备标签的类别的字符串和不具备标签的类别的字符串分别对应第3级子节点中的一个子节点,其中第3级子节点中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。Categories with and without labels, the third-level children among the m-level child nodes are divided according to the label attribute rules, and the first data is a character string that belongs to a category with a label and a character string that does not have a label. Corresponds to a child node of the third-level child node, in which any child node of the third-level child node and a child node of the second-level child node are in a parent-child relationship.
在再一种可选的实现方式中,在上述“具备标签的类别”可以用于筛选访问网页的信息中包括标签属性的信息,不具备标签的类别用于筛选访问网页的信息中不包括标签属性的信息;其中,In yet another optional implementation manner, the above-mentioned "categories with tags" can be used to filter information of visited webpages including tag attribute information, and categories without tags are used to filter information about webpages visited without tags. Attribute information; where,
具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with tags include: categories with only host names, categories with only host information for advertising attributes, categories with two levels of classification for hosts and domain names, categories for URLs with uniform resource locators for hosts and advertisements, or domains and The URL information of the advertisement is at least one of different categories.
由于访问网页的信息较为多样化,所以该方式可以提供更多的可能性,更为精确的拦截目标信息。Since the information for accessing web pages is more diverse, this method can provide more possibilities and more accurately intercept target information.
在再一种可选的实现方式中,上述“字符规则”可以包括:In yet another optional implementation manner, the foregoing "character rule" may include:
首字符串的类别和预置字符串的类别,m级子节点中的第4级子节点根据字符规则进行划分,第一数据中属于首字符串的类别的字符串和预置字符串的类别的字符串分别对应第4级子节点中的一个子节点,其中第4级子节点中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。The category of the first string and the category of the preset string. The fourth-level children of the m-level children are divided according to the character rules. The first data belongs to the category of the first string and the category of the preset string. The character strings correspond to one child node of the fourth-level child node, and any child node of the fourth-level child node is in a parent-child relationship with one child node of the third-level child node.
在再一种可选的实现方式中,在上述“首字符串的类别”可以用于筛选访问网页的信息与首字符串的类别的字符串具有首字符相同的信息;In yet another optional implementation manner, the "category of the first character string" in the foregoing may be used to filter information that the information of the visited web page and the character string of the type of the first character string have the same first character information;
预置字符串的类别用于筛选访问网页的信息与预置字符串的类别的字符串具有预置字符串相同的信息。The category of the preset character string is used to filter information for accessing the webpage, and the character string of the category of the preset character string has the same information as the preset character string.
在再一种可选的实现方式中,上述“访问网页的信息”可以包括:用户访问页面的URL或者访问网页各个元素的URL,目标信息为广告信息。In yet another optional implementation manner, the above-mentioned "information for accessing a webpage" may include a URL of a user accessing a page or a URL of each element of a webpage, and the target information is advertisement information.
在再一种可选的实现方式中,上述“第一数据”为服务端根据第二数据进行树形转化处理之后得到,第二数据包括有效字符串和浏览器的自定义字符串,其中,有效字符串为通过对开源网站中的开源字符串和终端上报的在预设时间段内的历史数据进行筛选,确定使用率大于预设阈值的字符串。In yet another optional implementation manner, the above-mentioned "first data" is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of the browser. The valid string is a string that is determined by filtering the open source string in the open source website and the historical data reported by the terminal within a preset period of time to determine a usage rate greater than a preset threshold.
由于第一数据是该终端向服务端下载的,整体上述匹配的过程是在终端中进行,所以,该方式极大的提升了终端进行信息的匹配速度以及避免了现有技术中需要服务端有较高的性能才能快速完成页面内容的处理的问题。Since the first data is downloaded by the terminal to the server, the overall matching process described above is performed in the terminal, so this method greatly improves the speed of information matching by the terminal and avoids the need for the server to have The problem of high performance can quickly complete the processing of the page content.
第二方面,本发明实施例提供了一种数据处理的服务器,其特征在于,包括:一个或多个处理器、收发器和存储器多个应用程序;以及一个或多个计算机程序,其中,一个或多个计算机程序被存储在存储器中,一个或多个计算机程序包括指令,当指令被服务器执行时,使得服务器执行以下步骤:In a second aspect, an embodiment of the present invention provides a data processing server, which is characterized by comprising: one or more processors, a transceiver, and a plurality of application programs; and one or more computer programs, of which one One or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the server, cause the server to perform the following steps:
将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
服务器将第一数据发送给终端,以便于终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains the target information according to the determination.
本方案中,通过对第二数据进行树形转化处理,该树形结构可以对第二数据中的字符串进行深度区分,转化为区分度非常高的树形结构,有效减少访问网页的信息与第一数据的匹 配次数。In this solution, by performing a tree transformation process on the second data, the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
在一个可选的实现方式中,上述“目标信息”可以为广告信息;In an optional implementation manner, the above “target information” may be advertisement information;
访问网页的信息包括:用户访问页面的URL或者访问网页各个元素的URL中的至少一种。The information for accessing the webpage includes at least one of a URL of the user accessing the page or a URL of accessing each element of the webpage.
在另一个可选的实现方式中,上述服务器可以执行具体以下步骤:从开源网站周期性获取至少一个开源字符串;In another optional implementation manner, the above server may perform the following specific steps: periodically obtaining at least one open source string from an open source website;
在至少一个开源字符串和客户端上报的在预设时间段内的历史数据中选取访问量大于第一阈值的多个字符串为有效字符串;Selecting, from at least one open source character string and historical data reported by the client within a preset period of time, a plurality of character strings with a visit amount greater than a first threshold as valid character strings;
获取浏览器服务器的自定义字符串;Get the custom string of the browser server;
根据有效字符串和自定义字符串,确定第二数据,有效字符串和自定义字符串中分别包括至少一个字符串。The second data is determined according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
由于每个浏览器服务器一般具有不同的标准,即目标信息可能在A网站可能定义为广告信息,但是在B网站就没有定义为广告信息,所以,在生成第二数据时,加入了浏览器服务器的自定义字符串,以使匹配的第二数据具有灵活性,可以广泛使用。Because each browser server generally has different standards, that is, the target information may be defined as advertisement information on A site, but it is not defined as advertisement information on B site. Therefore, a browser server was added when generating the second data. Custom strings to make matching second data flexible and can be widely used.
在又一个可选的实现方式中,上述服务器可以执行具体以下步骤:In another optional implementation manner, the foregoing server may perform the following specific steps:
根据n种预设规则将多个子节点划分为m级,m级子节点中每一级的预设规则都不同;Divide multiple child nodes into m levels according to n preset rules, and the preset rules for each level of m-level child nodes are different;
n种预设规则中的每一种分别包括至少两个字符串的类别,根据字符串的类别将m级中的每层划分为至少两个子节点;Each of the n preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
第二数据包括多个字符串,每个子节点中分别包括属于不同字符串的类别的多个字符串,n、m均为大于等于1的整数,n大于等于m;The second data includes multiple character strings, each of which includes multiple character strings belonging to different types of character strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,k级子节点为m级子节点中的任意一级子节点,k为大于等于1的整数。Each child node in the k-th child node has a parent-child relationship with one child node in the k-1 level. The k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
在再一个可选的实现方式中,上述“n种预设规则”可以包括下述至少一种规则:In yet another optional implementation manner, the "n preset rules" may include at least one of the following rules:
黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则;Black and white list rules, positioning and preset matching rules, tag attribute rules or character rules;
服务器执行以下步骤:The server performs the following steps:
根据黑白名单规则、定位和预设匹配规则、标签属性规则和字符规则将多个子节点划分为m级子节点。Multiple child nodes are divided into m-level child nodes according to black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
在再一个可选的实现方式中,上述服务器可以执行具体以下步骤:In another optional implementation manner, the foregoing server may perform the following specific steps:
当黑白名单规则中包括白名单的类别和黑名单的类别时,按照白名单的类别和黑名单的类别将m级子节点中的第1级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于白名单的类别的字符串,另一个子节点包括第二数据中属于黑名单的类别的字符串。When the blacklist and whitelist rules are included in the blacklist and whitelist rules, the first level of the m-level subnodes is divided into two subnodes according to the whitelist and blacklist categories, and one of the two subnodes The node includes a character string that belongs to the white list category in the second data, and the other child node includes the character string that belongs to the black list category in the second data.
在再一个可选的实现方式中,上述服务器可以执行具体以下步骤:In another optional implementation manner, the foregoing server may perform the following specific steps:
当定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照定位匹配的类别和预设匹配的类别,将m级子节点中的第2级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于定位匹配的类别的字符串,另一个子节点包括第二数据中属于预设匹配的类别的字符串,其中第2级中的两个子节点与第1级中属于黑名单的类别的字符串所在的节点呈父子关系。When the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the second level of the m-level sub-nodes is divided into two sub-nodes, two One of the child nodes includes a character string belonging to the category of the positioning match in the second data, and the other child node includes a character string belonging to the category of the preset match in the second data. Two child nodes in the second level and The nodes where the strings belonging to the blacklisted category in the first level are located in a parent-child relationship.
在再一个可选的实现方式中,上述服务器可以执行具体以下步骤:当标签属性规则中包括具备标签的类别和不具备标签的类别时,按照具备标签的类别和不具备标签的类别,将m级子节点中的第3级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于具备标签的类别的字符串,另一个子节点包括第二数据中属于不具备标签的类别的字符串,其中第3级中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the above server may perform the following specific steps: When the tag attribute rule includes a category with a tag and a category without a tag, m is classified according to the category with the tag and the category without the tag. The third level of the sub-node is divided into two sub-nodes. One of the two sub-nodes includes a character string belonging to a category with a label in the second data, and the other sub-node includes a sub-label in the second data. A character string of a category, where any child node in the third level is in a parent-child relationship with a child node in the second level.
在再一个可选的实现方式中,上述“具备标签的类别”可以包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。In still another optional implementation manner, the above-mentioned "labeled category" may include: a category with only a host name, a category with only host information of an advertisement attribute, a category with two levels of classification of a host and a domain name, a category of the host and an advertisement There is at least one of a category of the uniform resource locator URL information or a category in which only the domain name and the URL information of the advertisement are different.
在再一个可选的实现方式中,上述服务器可以执行具体以下步骤:In another optional implementation manner, the foregoing server may perform the following specific steps:
当字符规则中包括首字符串的类别和预置字符串的类别时,按照首字符串的类别和预置字符串的类别,将m级子节点中的第4级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于首字符串的类别的字符串,另一个子节点包括第二数据中属于预置字符串的类别的字符串,其中第4级中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。When the character rule includes the type of the first character string and the type of the preset character string, the fourth level of the m-level child node is divided into two child nodes according to the type of the first character string and the type of the preset character string. One of the child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes a character string belonging to the category of the preset character string in the second data, in which any one of the fourth levels The child node is in a parent-child relationship with one of the child nodes in the third level.
第三方面,本发明实施例提供了一种信息拦截的方法,该方法可以基于终端执行,该方法可以包括以下步骤:In a third aspect, an embodiment of the present invention provides a method for intercepting information. The method may be executed based on a terminal. The method may include the following steps:
启动浏览器以访问网页;Launch a browser to access a web page;
获取访问网页的信息;Get information about visiting web pages;
将访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,第一数据用于确定访问网页的信息中是否包括目标信息;Matching information of a visited web page with first data arranged in a tree structure, where the first data is used to determine whether the information of the visited web page includes target information;
当访问网页的信息中包括目标信息时,拦截目标信息。When the information of the visited web page includes the target information, the target information is intercepted.
本方案中,该方法通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题,整体可以提升匹配速度40%以上。In this solution, the method intercepts the target information in the browser page through the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the web page and the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method. The overall matching speed can be increased by more than 40%.
在一个可选的实现方式中,上述“树形结构”可以包括多个节点,多个节点包括根节点和至少一级子节点,至少一级子节点中的每一级包括至少两个子节点;In an optional implementation manner, the above-mentioned "tree structure" may include multiple nodes, the multiple nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
每一级的节点与关联的下一级节点具有父子关系,第一数据根据预设规则分布在成树形结构的多个节点上。The nodes of each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on multiple nodes in a tree structure according to a preset rule.
在另一个可选的实现方式中,在上述“将访问网页的信息与呈树形结构排布的第一数据进行匹配”的步骤中,具体可以包括:In another optional implementation manner, in the above-mentioned step of "matching the information for accessing the webpage with the first data arranged in a tree structure", the method may specifically include:
将访问网页的信息从树形结构的父节点的第一数据逐级向与父节点呈父子关系的子节点的第一数据进行匹配,直至确定访问网页的信息中是否包括目标信息。The information of accessing the webpage is matched step by step from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until it is determined whether the information of the accessing the webpage includes the target information.
由于访问网页的信息会存在长短的差别,所以较长的访问网页的信息不能直接匹配出来,所以将该访问网页的信息进行逐级匹配,保证访问网页的信息能够完整的匹配到,提升拦截目标信息的准确度。Because there is a difference in the length of the information on the visited webpage, the information on the longer visited webpage cannot be directly matched. Therefore, the information on the visited webpage is matched step by step to ensure that the information on the visited webpage can be completely matched and the interception target is improved. The accuracy of the information.
在又一个可选的实现方式中,上述“树形结构”可以包括:In yet another optional implementation manner, the foregoing "tree structure" may include:
m级子节点,m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,n、m均为大于等于1的整数,n大于等于m;m-level child nodes, each level of the m-level child nodes is divided according to different preset rules among n preset rules, n and m are integers greater than or equal to 1, n is greater than or equal to m;
第j级子节点从f种预设规则中选择1种预设规则进行划分,f种预设规则为n种预设规则中前j-1级子节点选择剩余的预设规则,j-1级子节点为j级子节点的上一级子节点,j级子节点为m级子节点中的任意一级子节点,j和f均为大于等于1的整数;The jth level child node selects one preset rule from the f preset rules for division, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules, j-1 The level child node is a level child node of the level j child node, the level j child node is any level child node of the level m child node, and j and f are integers greater than or equal to 1.
n种预设规则中的每一种分别包括至少两个字符串的类别;each of the n preset rules includes at least two categories of strings;
第一数据包括多个字符串,第一数据的字符串按m级子节点划分,m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,每个子节点包括具有不同的字符串 的类别的多个字符串。The first data includes a plurality of character strings. The character string of the first data is divided into m-level child nodes. Each child node in the m-level child node corresponds to a different character string type in n preset rules. Each child node Include multiple strings with categories with different strings.
由于每个终端或者运营商对目标信息的定义不同,所以本申请提供了多种预设规则和类别,可以根据需求选择预设规则,该步骤可以提升树形结构的灵活度,适用于更多的场景。Because each terminal or operator has different definitions of target information, this application provides a variety of preset rules and categories. You can select preset rules according to your needs. This step can improve the flexibility of the tree structure and is applicable to more Scene.
在再一种可选的实现方式中,上述“n种预设规则”可以包括下述至少一种规则:In yet another optional implementation manner, the "n preset rules" may include at least one of the following rules:
黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。Black and white list rules, positioning and preset matching rules, tag attribute rules, or character rules.
在再一个可选的实现方式中,上述“黑白名单规则”可以包括白名单的类别和黑名单的类别,m级子节点中的第1级子节点根据黑白名单规则进行划分,第一数据中属于白名单的类别的字符串和属于黑名单的类别的字符串分别对应第1级子节点中的一个子节点。In yet another optional implementation manner, the foregoing "black and white list rules" may include categories of white lists and black lists. The first-level child nodes of the m-level child nodes are divided according to the black and white list rules. The character string belonging to the whitelisted category and the character string belonging to the blacklisted category respectively correspond to one child node of the first-level child node.
在再一个可选的实现方式中,在上述“将访问网页的信息与呈树形结构排布的第一数据进行匹配”的步骤中,具体可以包括:In still another optional implementation manner, in the above-mentioned step of "matching the information for accessing the webpage with the first data arranged in a tree structure", the method may specifically include:
将访问网页的信息与白名单的类别的字符串进行匹配,当访问网页的信息包括白名单的类别的字符串时,确定访问网页的信息不包括目标信息。The information of the visited web page is matched with the character string of the white list category. When the information of the visited web page includes the character string of the white list category, it is determined that the information of the visited web page does not include the target information.
由于有些访问网页的信息可能会带有“ad”,但是可能对于一些运营商来说,并不是目标信息,所以设置具有白名单类别的字符串将带有“ad”但不是目标信息(即广告)的这一种可能性排除,提升了拦截的精准度。Because some of the information for accessing the webpage may carry "ad", but it may not be the target information for some operators, so a string with a whitelist category will be set with "ad" but not the target information (that is, advertising This possibility is eliminated, which improves the accuracy of the interception.
在再一个可选的实现方式中,在上述“将访问网页的信息与呈树形结构排布的第一数据进行匹配”的步骤中,具体可以包括:当访问网页的信息不包括白名单的类别的字符串时,将访问网页的信息与黑名单的类别的字符串进行匹配;In still another optional implementation manner, in the step of “matching the information of the visited webpage with the first data arranged in a tree structure”, the method may specifically include: when the information of the visited webpage does not include the whitelist When the character string of the category is matched, the information of the visited webpage is matched with the character string of the blacklist category;
当访问网页的信息不包括黑名单的类别的字符串时,确定访问网页的信息不包括目标信息;When the information of the visited webpage does not include the character string of the blacklist category, it is determined that the information of the visited webpage does not include the target information;
当访问网页的信息包括黑名单的类别的字符串时,将访问网页的信息逐级与属于黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定访问网页的信息被匹配完毕,拦***问网页的信息中的目标信息。When the information of the visited webpage includes the character string of the blacklisted category, the information of the visited webpage is gradually matched with the child nodes of the parent-child relationship of the child nodes of the character string belonging to the blacklisted category until it is determined that the information of the visited webpage is After the matching is completed, the target information in the information of visiting the webpage is intercepted.
在再一个可选的实现方式中,上述“定位和预设匹配规则”可以具体包括定位匹配的类别和预设匹配的类别,m级子节点中的第2级子节点根据定位和预设匹配规则进行划分,第一数据中属于定位匹配的类别的字符串和属于预设匹配的类别的字符串分别对应第2级子节点中的一个子节点,其中第2级子节点中的任意一个子节点与第1级子节点中属于黑名单的类别的字符串的子节点呈父子关系。In yet another optional implementation manner, the foregoing "positioning and preset matching rules" may specifically include the category of the positioning match and the category of the preset match, and the second-level child node of the m-level child node matches according to the positioning and the preset match. The rules are divided, and the strings belonging to the category of the positioning match and the strings belonging to the category of the preset match in the first data respectively correspond to a child node of the second-level child node, and any one of the second-level child nodes A node has a parent-child relationship with a child node of a character string belonging to a blacklisted category among the first-level child nodes.
在再一个可选的实现方式中,上述“定位匹配的类别”可以用于筛选访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;In yet another optional implementation manner, the above “locating matching categories” may be used to filter information for accessing a webpage where a character string exists in a first preset position, or in information where a separator exists in a second preset position. At least one of
预设匹配的类别用于筛选访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。The preset matching category is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
在再一个可选的实现方式中,上述“标签属性规则”可以包括具备标签的类别和不具备标签的类别,m级子节点中的第3级子节点根据标签属性规则进行划分,第一数据中属于具备标签的类别的字符串和不具备标签的类别的字符串分别对应第3级子节点中的一个子节点,其中第3级子节点中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the above-mentioned "label attribute rule" may include a category with a label and a category without a label. The third-level child node of the m-level child node is divided according to the label attribute rule. The first data The character string belonging to the category with the label and the character string without the label corresponds to one child node of the third-level child node, and any one of the child nodes of the third-level child node and the second-level child node A child node of the parent-child relationship.
在再一个可选的实现方式中,上述“具备标签的类别”可以用于筛选访问网页的信息中包括标签属性的信息,不具备标签的类别用于筛选访问网页的信息中不包括标签属性的信息;In yet another optional implementation manner, the above-mentioned "category with tag" can be used to filter information of a visited web page that includes information of tag attributes, and the category without a tag is used to filter information that does not include tag attributes in the information of visited web pages information;
其中,具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。Among them, the categories with tags include: a category with only a host name, a category with only advertising information, a category with two levels of classification of the host and domain name, a category with the URL information of the URL of the host and the advertisement, or only The domain name and the URL information of the advertisement are at least one of different categories.
由于访问网页的信息较为多样化,所以该方式可以提供更多的可能性,更为精确的拦截目标信息。Since the information for accessing web pages is more diverse, this method can provide more possibilities and more accurately intercept target information.
在再一个可选的实现方式中,上述“字符规则”可以包括首字符串的类别和预置字符串的类别,m级子节点中的第4级子节点根据字符规则进行划分,第一数据中属于首字符串的类别的字符串和预置字符串的类别的字符串分别对应第4级子节点中的一个子节点,其中第4级子节点中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the foregoing "character rule" may include the category of the first character string and the category of the preset character string. The fourth-level child node of the m-level child node is divided according to the character rule. The first data The character string belonging to the category of the first character string and the character string of the preset character string respectively correspond to one child node of the fourth-level child node, and any one of the fourth-level child node and the third-level child node A child of a node is in a parent-child relationship.
在再一个可选的实现方式中,上述“首字符串的类别”可以用于筛选访问网页的信息与首字符串的类别的字符串具有首字符相同的信息;In still another optional implementation manner, the foregoing "category of the first character string" may be used to filter information that the information of the visited web page and the character string of the category of the first character string have the same first character;
预置字符串的类别用于筛选访问网页的信息与预置字符串的类别的字符串具有预置字符串相同的信息。The category of the preset character string is used to filter information for accessing the webpage, and the character string of the category of the preset character string has the same information as the preset character string.
在再一个可选的实现方式中,上述“访问网页的信息”可以包括用户访问页面的URL或者访问网页各个元素的URL,目标信息为广告信息。In yet another optional implementation manner, the foregoing "information for accessing a web page" may include a URL of a user accessing a page or a URL of each element of a web page, and the target information is advertisement information.
在再一个可选的实现方式中,上述“第一数据”为服务端根据第二数据进行树形转化处理之后得到,第二数据包括有效字符串和浏览器的自定义字符串,其中,有效字符串为通过对开源网站中的开源字符串和上报的在预设时间段内的历史数据进行筛选,确定使用率大于预设阈值的字符串。In yet another optional implementation manner, the foregoing "first data" is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of the browser. The character string is a character string determined by filtering the open source character string in the open source website and the historical data reported within a preset period of time to determine a usage rate greater than a preset threshold.
第四方面,本发明实施例提供了一种数据处理的方法,该方法可以基于服务端(即服务器)执行,该方法具体可以包括以下步骤:In a fourth aspect, an embodiment of the present invention provides a data processing method. The method may be executed based on a server (ie, a server). The method may specifically include the following steps:
将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
将第一数据发送给终端,以便于终端根据确定访问网页是否包含目标信息。Send the first data to the terminal, so that the terminal determines whether the accessed webpage contains target information according to the determination.
本方案中,通过对第二数据进行树形转化处理,该树形结构可以对第二数据中的字符串进行深度区分,转化为区分度非常高的树形结构,有效减少访问网页的信息与第一数据的匹配次数。In this solution, by performing a tree transformation process on the second data, the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
在一个可选的实现方式中,上述“目标信息”可以为广告信息;In an optional implementation manner, the above “target information” may be advertisement information;
访问网页的信息包括:用户访问页面的URL或者访问网页各个元素的URL中的至少一种。The information for accessing the webpage includes at least one of a URL of the user accessing the page or a URL of accessing each element of the webpage.
在另一个可选的实现方式中,在“将第二数据进行树形转化处理,确定第一数据”的步骤之前,还可以包括:从开源网站周期性获取至少一个开源字符串;In another optional implementation manner, before the step of “transforming the second data into a tree and determining the first data”, the method may further include: periodically obtaining at least one open source string from an open source website;
在至少一个开源字符串和终端上报的在预设时间段内的历史数据中选取访问量大于第一阈值的多个字符串为有效字符串;Selecting, from at least one open source character string and historical data reported by the terminal within a preset period of time, multiple character strings with a visit amount greater than a first threshold as valid character strings;
获取浏览器服务器的自定义字符串;Get the custom string of the browser server;
根据有效字符串和自定义字符串,确定第二数据,有效字符串和自定义字符串中分别包括至少一个字符串。The second data is determined according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
由于每个浏览器服务器一般具有不同的标准,即目标信息可能在A网站可能定义为广告信息,但是在B网站就没有定义为广告信息,所以,在生成第二数据时,加入了浏览器服务器的自定义字符串,以使匹配的第二数据具有灵活性,可以广泛使用。Because each browser server generally has different standards, that is, the target information may be defined as advertisement information on A site, but it is not defined as advertisement information on B site. Therefore, a browser server was added when generating the second data. Custom strings to make matching second data flexible and can be widely used.
在又一个可选的实现方式中,在“将第二数据进行树形转化处理,确定第一数据”的步骤中,具体可以包括:根据n种预设规则将多个子节点划分为m级,m级子节点中每一级的预设规则都不同;In yet another optional implementation manner, in the step of “transforming the second data into a tree to determine the first data”, the method may specifically include: dividing a plurality of child nodes into m levels according to n preset rules, Each level of the m-level child nodes has different preset rules;
n种预设规则中的每一种分别包括至少两个字符串的类别,根据字符串的类别将m级中的每层划分为至少两个子节点;Each of the n preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
第二数据包括多个字符串,每个子节点中分别包括属于不同字符串的类别的多个字符串,n、m均为大于等于1的整数,n大于等于m;The second data includes multiple character strings, each of which includes multiple character strings belonging to different types of character strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,k级子节点为m级子节点中的任意一级子节点,k为大于等于1的整数。Each child node in the k-th child node has a parent-child relationship with one child node in the k-1 level. The k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
在再一个可选的实现方式中,上述“n种预设规则”可以包括下述至少一种规则:In yet another optional implementation manner, the "n preset rules" may include at least one of the following rules:
黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则;Black and white list rules, positioning and preset matching rules, tag attribute rules or character rules;
服务器执行以下步骤:The server performs the following steps:
根据黑白名单规则、定位和预设匹配规则、标签属性规则和字符规则将多个子节点划分为m级子节点。Multiple child nodes are divided into m-level child nodes according to black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
在再一个可选的实现方式中,在“将第二数据进行树形转化处理,确定第一数据”的步骤中,具体可以包括:In still another optional implementation manner, in the step of “transforming the second data into a tree and determining the first data”, the method may specifically include:
当黑白名单规则中包括白名单的类别和黑名单的类别时,按照白名单的类别和黑名单的类别将m级子节点中的第1级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于白名单的类别的字符串,另一个子节点包括第二数据中属于黑名单的类别的字符串。When the blacklist and whitelist rules are included in the blacklist and whitelist rules, the first level of the m-level subnodes is divided into two subnodes according to the whitelist and blacklist categories, and one of the two subnodes The node includes a character string that belongs to the white list category in the second data, and the other child node includes the character string that belongs to the black list category in the second data.
在再一个可选的实现方式中,在“将第二数据进行树形转化处理,确定第一数据”的步骤中,具体可以包括:In still another optional implementation manner, in the step of “transforming the second data into a tree and determining the first data”, the method may specifically include:
当定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照定位匹配的类别和预设匹配的类别,将m级子节点中的第2级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于定位匹配的类别的字符串,另一个子节点包括第二数据中属于预设匹配的类别的字符串,其中第2级中的两个子节点与第1级中属于黑名单的类别的字符串所在的节点呈父子关系。When the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the second level of the m-level sub-nodes is divided into two sub-nodes, two One of the child nodes includes a character string belonging to the category of the positioning match in the second data, and the other child node includes a character string belonging to the category of the preset match in the second data. Two child nodes in the second level and The nodes where the strings belonging to the blacklisted category in the first level are located in a parent-child relationship.
在再一个可选的实现方式中,在“将第二数据进行树形转化处理,确定第一数据”的步骤中,具体可以包括:In still another optional implementation manner, in the step of “transforming the second data into a tree and determining the first data”, the method may specifically include:
当标签属性规则中包括具备标签的类别和不具备标签的类别时,按照具备标签的类别和不具备标签的类别,将m级子节点中的第3级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于具备标签的类别的字符串,另一个子节点包括第二数据中属于不具备标签的类别的字符串,其中第3级中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。When the label attribute rule includes a category with and without a label, the 3rd level of the m-level child node is divided into two child nodes according to the category with and without the label. One of the child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string belonging to a category without a label in the second data. Any one of the child nodes in the third level and the second One of the child nodes in the hierarchy has a parent-child relationship.
在再一个可选的实现方式中,上述“具备标签的类别”具体可以包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。In yet another optional implementation manner, the above-mentioned "labeled category" may specifically include: a category with only a host name, a category with only host information of an advertisement attribute, a category with two levels of classification of a host and a domain name, and a host and an advertisement At least one of the category of the URL information of the uniform resource locator or only the category of the URL information of the domain name and the advertisement is different.
在再一个可选的实现方式中,在“将第二数据进行树形转化处理,确定第一数据”的步骤中,具体可以包括:In still another optional implementation manner, in the step of “transforming the second data into a tree and determining the first data”, the method may specifically include:
当字符规则中包括首字符串的类别和预置字符串的类别时,按照首字符串的类别和预置字符串的类别,将m级子节点中的第4级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于首字符串的类别的字符串,另一个子节点包括第二数据中属于预置字符串的类别的字符串,其中第4级中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。When the character rule includes the type of the first character string and the type of the preset character string, the fourth level of the m-level child node is divided into two child nodes according to the type of the first character string and the type of the preset character string. One of the child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes a character string belonging to the category of the preset character string in the second data, in which any one of the fourth levels The child node is in a parent-child relationship with one of the child nodes in the third level.
第五方面,本发明实施例提供了一种装置,该装置可以包括:In a fifth aspect, an embodiment of the present invention provides a device, and the device may include:
处理模块,用于启动浏览器以访问网页;A processing module for starting a browser to access a web page;
收发模块,用于获取访问网页的信息;A transceiver module for acquiring information for accessing a web page;
该处理模块还用于,将访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,第一数据用于确定访问网页的信息中是否包括目标信息;当访问网页的信息中包括目标信息时,拦截目标信息。The processing module is further configured to match the information of the visited web page with the first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes the target information; When target information is included, the target information is intercepted.
本方案中,该装置通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题,整体可以提升匹配速度40%以上。In this solution, the device intercepts the target information in the browser page through the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of accessing the web page from the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method. The overall matching speed can be increased by more than 40%.
在一个可选的实现方式中,上述“树形结构”可以包括多个节点,多个节点包括根节点和至少一级子节点,至少一级子节点中的每一级包括至少两个子节点;In an optional implementation manner, the above-mentioned "tree structure" may include multiple nodes, the multiple nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
每一级的节点与关联的下一级节点具有父子关系,第一数据根据预设规则分布在成树形结构的多个节点上。The nodes of each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on multiple nodes in a tree structure according to a preset rule.
在另一个可选的实现方式中,上述“处理模块”具体可以用于,将访问网页的信息从树形结构的父节点的第一数据逐级向与父节点呈父子关系的子节点的第一数据进行匹配,直至确定访问网页的信息中是否包括目标信息。In another optional implementation manner, the above-mentioned "processing module" may be specifically used to step through the information of accessing the webpage from the first data of the parent node of the tree structure to the first of the child nodes in a parent-child relationship with the parent node. A piece of data is matched until it is determined whether the information of the visited webpage includes the target information.
由于访问网页的信息会存在长短的差别,所以较长的访问网页的信息不能直接匹配出来,所以将该访问网页的信息进行逐级匹配,保证访问网页的信息能够完整的匹配到,提升拦截目标信息的准确度。Because there is a difference in the length of the information on the visited webpage, the information on the longer visited webpage cannot be directly matched. Therefore, the information on the visited webpage is matched step by step to ensure that the information on the visited webpage can be completely matched and the interception target is improved. The accuracy of the information.
在又一个可选的实现方式中,上述“树形结构”可以包括m级子节点,m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,n、m均为大于等于1的整数,n大于等于m;In yet another optional implementation manner, the above-mentioned "tree structure" may include m-level child nodes, and each level of the m-level child nodes is divided according to different preset rules among the n types of preset rules. m is an integer greater than or equal to 1, n is greater than or equal to m;
第j级子节点从f种预设规则中选择1种预设规则进行划分,f种预设规则为n种预设规则中前j-1级子节点选择剩余的预设规则,j-1级子节点为j级子节点的上一级子节点,j级子节点为m级子节点中的任意一级子节点,j和f均为大于等于1的整数;The jth level child node selects one preset rule from the f preset rules for division, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules, j-1 The level child node is a level child node of the level j child node, the level j child node is any level child node of the level m child node, and j and f are integers greater than or equal to 1.
n种预设规则中的每一种分别包括至少两个字符串的类别;each of the n preset rules includes at least two categories of strings;
第一数据包括多个字符串,第一数据的字符串按m级子节点划分,m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,每个子节点包括具有不同的字符串的类别的多个字符串。The first data includes a plurality of character strings. The character string of the first data is divided into m-level child nodes. Each child node in the m-level child node corresponds to a different character string type in n preset rules. Each child node Include multiple strings with categories with different strings.
由于每个终端或者运营商对目标信息的定义不同,所以本申请提供了多种预设规则和类别,可以根据需求选择预设规则,该步骤可以提升树形结构的灵活度,适用于更多的场景。Because each terminal or operator has different definitions of target information, this application provides a variety of preset rules and categories. You can select preset rules according to your needs. This step can improve the flexibility of the tree structure and is applicable to more Scene.
在再一个可选的实现方式中,上述“n种预设规则”可以包括下述至少一种规则:In yet another optional implementation manner, the "n preset rules" may include at least one of the following rules:
黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。Black and white list rules, positioning and preset matching rules, tag attribute rules, or character rules.
在再一个可选的实现方式中,上述“黑白名单规则”可以包括白名单的类别和黑名单的类别,m级子节点中的第1级子节点根据黑白名单规则进行划分,第一数据中属于白名单的类别的字符串和属于黑名单的类别的字符串分别对应第1级子节点中的一个子节点。In yet another optional implementation manner, the foregoing "black and white list rules" may include categories of white lists and black lists. The first-level child nodes of the m-level child nodes are divided according to the black and white list rules. The character string belonging to the whitelisted category and the character string belonging to the blacklisted category respectively correspond to one child node of the first-level child node.
在再一个可选的实现方式中,上述“处理模块”具体可以用于,将访问网页的信息与白名单的类别的字符串进行匹配,当访问网页的信息包括白名单的类别的字符串时,确定访问网页的信息不包括目标信息。In yet another optional implementation manner, the above-mentioned "processing module" may be specifically configured to match information of a visited web page with a character string of a white list category, and when the information of a visited web page includes a character string of a white list category , Make sure that the information you visit the web page does not include the target information.
由于有些访问网页的信息可能会带有“ad”,但是可能对于一些运营商来说,并不是目标信息,所以设置具有白名单类别的字符串将带有“ad”但不是目标信息(即广告)的这一种可能性排除,提升了拦截的精准度。Because some of the information for accessing the webpage may carry "ad", but it may not be the target information for some operators, so a string with a whitelist category will be set with "ad" but not the target information (that is, advertising This possibility is eliminated, which improves the accuracy of the interception.
在再一个可选的实现方式中,上述“处理模块”具体可以用于,当访问网页的信息不包括白名单的类别的字符串时,将访问网页的信息与黑名单的类别的字符串进行匹配;In yet another optional implementation manner, the above-mentioned "processing module" may be specifically configured to: when the information for accessing the webpage does not include the character string of the whitelist category, perform the processing of the information of the webpage access with the character string of the blacklist category match;
当访问网页的信息不包括黑名单的类别的字符串时,确定访问网页的信息不包括目标信息;When the information of the visited webpage does not include the character string of the blacklist category, it is determined that the information of the visited webpage does not include the target information;
当访问网页的信息包括黑名单的类别的字符串时,将访问网页的信息逐级与属于黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定访问网页的信息被匹配完毕,拦***问网页的信息中的目标信息。When the information of the visited webpage includes the character string of the blacklisted category, the information of the visited webpage is gradually matched with the child nodes of the parent-child relationship of the child nodes of the character string belonging to the blacklisted category until it is determined that the information of the visited webpage is After the matching is completed, the target information in the information of visiting the webpage is intercepted.
在再一个可选的实现方式中,上述“定位和预设匹配规则”可以包括定位匹配的类别和预设匹配的类别,m级子节点中的第2级子节点根据定位和预设匹配规则进行划分,第一数据中属于定位匹配的类别的字符串和属于预设匹配的类别的字符串分别对应第2级子节点中的一个子节点,其中第2级子节点中的任意一个子节点与第1级子节点中属于黑名单的类别的字符串的子节点呈父子关系。In yet another optional implementation manner, the foregoing "positioning and preset matching rules" may include a positioning matching category and a preset matching category, and the second-level child nodes of the m-level child nodes according to the positioning and preset matching rules Divide, in the first data, the character string belonging to the category of the positioning match and the character string belonging to the category of the preset match correspond to one child node of the second-level child node, and any one of the second-level child nodes A child relationship with a child node of a character string belonging to a blacklisted category among the first-level child nodes.
在再一个可选的实现方式中,上述“定位匹配的类别”可以用于筛选访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;In yet another optional implementation manner, the above “locating matching categories” may be used to filter information for accessing a webpage where a character string exists in a first preset position, or in information where a separator exists in a second preset position. At least one of
预设匹配的类别用于筛选访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。The preset matching category is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
在再一个可选的实现方式中,上述“标签属性规则”可以包括具备标签的类别和不具备标签的类别,m级子节点中的第3级子节点根据标签属性规则进行划分,第一数据中属于具备标签的类别的字符串和不具备标签的类别的字符串分别对应第3级子节点中的一个子节点,其中第3级子节点中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the above-mentioned "label attribute rule" may include a category with a label and a category without a label. The third-level child node of the m-level child node is divided according to the label attribute rule. The first data The character string belonging to the category with the label and the character string without the label corresponds to one child node of the third-level child node, and any one of the child nodes of the third-level child node and the second-level child node A child node of the parent-child relationship.
在再一个可选的实现方式中,上述“具备标签的类别”可以用于筛选访问网页的信息中包括标签属性的信息,不具备标签的类别用于筛选访问网页的信息中不包括标签属性的信息;其中,In yet another optional implementation manner, the above-mentioned "category with tag" can be used to filter information of a visited web page that includes information of tag attributes, and the category without a tag is used to filter information that does not include tag attributes in the information of visited web pages Information; of which
具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with tags include: categories with only host names, categories with only host information for advertising attributes, categories with two levels of classification for hosts and domain names, categories for URLs with uniform resource locators for hosts and advertisements, or domains and The URL information of the advertisement is at least one of different categories.
由于访问网页的信息较为多样化,所以该方式可以提供更多的可能性,更为精确的拦截目标信息。Since the information for accessing web pages is more diverse, this method can provide more possibilities and more accurately intercept target information.
在再一个可选的实现方式中,上述“字符规则”可以包括首字符串的类别和预置字符串的类别,m级子节点中的第4级子节点根据字符规则进行划分,第一数据中属于首字符串的类别的字符串和预置字符串的类别的字符串分别对应第4级子节点中的一个子节点,其中第4级子节点中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the foregoing "character rule" may include the category of the first character string and the category of the preset character string. The fourth-level child node of the m-level child node is divided according to the character rule. The first data The character string belonging to the category of the first character string and the character string of the preset character string respectively correspond to one child node of the fourth-level child node, and any one of the fourth-level child node and the third-level child node A child of a node is in a parent-child relationship.
在再一个可选的实现方式中,上述“首字符串的类别”可以用于筛选访问网页的信息与首字符串的类别的字符串具有首字符相同的信息;In still another optional implementation manner, the foregoing "category of the first character string" may be used to filter information that the information of the visited web page and the character string of the category of the first character string have the same first character;
预置字符串的类别用于筛选访问网页的信息与预置字符串的类别的字符串具有预置字符串相同的信息。The category of the preset character string is used to filter information for accessing the webpage, and the character string of the category of the preset character string has the same information as the preset character string.
在再一个可选的实现方式中,上述“访问网页的信息”可以包括用户访问页面的URL或者访问网页各个元素的URL,目标信息为广告信息。In yet another optional implementation manner, the foregoing "information for accessing a web page" may include a URL of a user accessing a page or a URL of each element of a web page, and the target information is advertisement information.
在再一个可选的实现方式中,上述“第一数据”可以为服务端根据第二数据进行树形转化处理之后得到,第二数据包括有效字符串和浏览器的自定义字符串,其中,有效字符串为通过对开源网站中的开源字符串和上报的在预设时间段内的历史数据进行筛选,确定使用率 大于预设阈值的字符串。In still another optional implementation manner, the foregoing "first data" may be obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of the browser, where: The valid string is a string that is determined by filtering the open source string in the open source website and the historical data reported within a preset period of time to determine that the usage rate is greater than a preset threshold.
第六方面,本发明实施例提供了一种数据处理的装置,其特征在于,包括:According to a sixth aspect, an embodiment of the present invention provides a data processing apparatus, which is characterized by including:
处理模块,将第二数据进行树形转化处理,确定第一数据;A processing module that performs tree transformation processing on the second data to determine the first data;
收发模块,将第一数据发送给终端,以便于终端根据确定访问网页是否包含目标信息。The transceiver module sends the first data to the terminal, so that the terminal determines whether the accessed webpage contains target information according to the determination.
本方案中,通过对第二数据进行树形转化处理,该树形结构可以对第二数据中的字符串进行深度区分,转化为区分度非常高的树形结构,有效减少访问网页的信息与第一数据的匹配次数。In this solution, by performing a tree transformation process on the second data, the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
在一个可选的实现方式中,上述“目标信息”可以为广告信息;In an optional implementation manner, the above “target information” may be advertisement information;
访问网页的信息包括:用户访问页面的URL或者访问网页各个元素的URL中的至少一种。The information for accessing the webpage includes at least one of a URL of the user accessing the page or a URL of accessing each element of the webpage.
在另一个可选的实现方式中,上述“收发模块”还可以用于,从开源网站周期性获取至少一个开源字符串;In another optional implementation manner, the foregoing "transceiving module" may also be used to periodically obtain at least one open source string from an open source website;
上述“处理模块”还可以用于,在至少一个开源字符串和客户端上报的在预设时间段内的历史数据中选取访问量大于第一阈值的多个字符串为有效字符串;The above “processing module” may also be used to select, from at least one open source character string and historical data reported by the client within a preset period of time, a plurality of character strings with a visit amount greater than a first threshold as valid character strings;
上述“收发模块”还可以用于,获取浏览器服务器的自定义字符串;The aforementioned "transceiving module" can also be used to obtain a custom string of the browser server;
上述“处理模块”还可以用于,根据有效字符串和自定义字符串,确定第二数据,有效字符串和自定义字符串中分别包括至少一个字符串。The above-mentioned "processing module" may also be used to determine the second data according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
由于每个浏览器服务器一般具有不同的标准,即目标信息可能在A网站可能定义为广告信息,但是在B网站就没有定义为广告信息,所以,在生成第二数据时,加入了浏览器服务器的自定义字符串,以使匹配的第二数据具有灵活性,可以广泛使用。Because each browser server generally has different standards, that is, the target information may be defined as advertisement information on A site, but it is not defined as advertisement information on B site. Therefore, a browser server was added when generating the second data. Custom strings to make matching second data flexible and can be widely used.
在又一个可选的实现方式中,上述“处理模块”具体可以用于,根据n种预设规则将多个子节点划分为m级,m级子节点中每一级的预设规则都不同;In yet another optional implementation manner, the above-mentioned "processing module" may be specifically used to divide multiple child nodes into m levels according to n preset rules, and the preset rules of each level of the m-level child nodes are different;
n种预设规则中的每一种分别包括至少两个字符串的类别,根据字符串的类别将m级中的每层划分为至少两个子节点;Each of the n preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
第二数据包括多个字符串,每个子节点中分别包括属于不同字符串的类别的多个字符串,n、m均为大于等于1的整数,n大于等于m;The second data includes multiple character strings, each of which includes multiple character strings belonging to different types of character strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,k级子节点为m级子节点中的任意一级子节点,k为大于等于1的整数。Each child node in the k-th child node has a parent-child relationship with one child node in the k-1 level. The k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
在另一个可选的实现方式中,上述“n种预设规则”可以包括下述至少一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则;In another optional implementation manner, the "n preset rules" may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule;
上述“处理模块”还可以用于,根据黑白名单规则、定位和预设匹配规则、标签属性规则和字符规则将多个子节点划分为m级子节点。The aforementioned "processing module" may also be used to divide multiple child nodes into m-level child nodes according to black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
在再一个可选的实现方式中,上述“处理模块”具体可以用于,当黑白名单规则中包括白名单的类别和黑名单的类别时,按照白名单的类别和黑名单的类别将m级子节点中的第1级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于白名单的类别的字符串,另一个子节点包括第二数据中属于黑名单的类别的字符串。In yet another optional implementation manner, the above-mentioned "processing module" may be specifically used to: when the blacklist and whitelist rules include a whitelist category and a blacklist category, rank m according to the whitelist category and the blacklist category. The first level of the child node is divided into two child nodes. One of the two child nodes includes a character string belonging to the white list category in the second data, and the other child node includes the black list category in the second data. String.
在再一个可选的实现方式中,上述“处理模块”具体可以用于,当定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照定位匹配的类别和预设匹配的类别,将m级子节点中的第2级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于定位匹配的类别的字符串,另一个子节点包括第二数据中属于预设匹配的类别的字符串,其中第2级中的两个子节点与第1级中属于黑名单的类别的字符串所在的节点呈父子关系。In yet another optional implementation manner, the foregoing "processing module" may be specifically used to: when the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching Category, divide the second level of the m-level sub-nodes into two sub-nodes, one of the two sub-nodes includes a string belonging to the category of the positioning match in the second data, and the other sub-node includes the second data A character string belonging to a preset matching category, where the two child nodes in the second level are in a parent-child relationship with the node where the character string belonging to the blacklisted category in the first level is located.
在再一个可选的实现方式中,上述“处理模块”具体可以用于,当标签属性规则中包括具备标签的类别和不具备标签的类别时,按照具备标签的类别和不具备标签的类别,将m级子节点中的第3级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于具备标签的类别的字符串,另一个子节点包括第二数据中属于不具备标签的类别的字符串,其中第3级中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the above-mentioned "processing module" may be specifically used, when the tag attribute rule includes a category with and without a tag, according to the category with and without the tag, Divide the third level of the m-level child nodes into two child nodes. One of the two child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string that does not have the second data. A string of the category of the label, where any child node in the third level is in a parent-child relationship with a child node in the second level.
在再一个可选的实现方式中,上述“具备标签的类别”具体可以包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。In yet another optional implementation manner, the above-mentioned "labeled category" may specifically include: a category with only a host name, a category with only host information of an advertisement attribute, a category with two levels of classification of a host and a domain name, and a host and an advertisement At least one of the category of the URL information of the uniform resource locator or only the category of the URL information of the domain name and the advertisement is different.
在再一个可选的实现方式中,上述“处理模块”具体可以用于,当字符规则中包括首字符串的类别和预置字符串的类别时,按照首字符串的类别和预置字符串的类别,将m级子节点中的第4级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于首字符串的类别的字符串,另一个子节点包括第二数据中属于预置字符串的类别的字符串,其中第4级中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。In yet another optional implementation manner, the foregoing "processing module" may be specifically used, when the character rule includes the category of the first character string and the category of the preset character string, according to the category of the first character string and the preset character string Category, divide the 4th level of the m-level child node into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the second data A character string belonging to a category of a preset character string in which any child node in the fourth level has a parent-child relationship with a child node in the third level.
第七方面,本发明实施例提供了一种计算机可读存储介质,可以包括指令,当其在计算机上运行时,使得计算机执行以下步骤:In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium, which may include instructions that, when run on a computer, cause the computer to perform the following steps:
启动浏览器以访问网页;Launch a browser to access a web page;
获取所述访问网页的信息;Obtaining information about the visited webpage;
将所述访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,所述第一数据用于确定所述访问网页的信息中是否包括目标信息;Matching information of the visited web page with first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes target information;
当所述访问网页的信息中包括所述目标信息时,拦截所述目标信息。When the target webpage information includes the target information, the target information is intercepted.
第八方面,本发明实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行以下步骤:In an eighth aspect, an embodiment of the present invention provides a computer-readable storage medium including instructions that, when run on a computer, cause the computer to perform the following steps:
将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
所述服务器将所述第一数据发送给终端,以便于所述终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
第九方面,本发明实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行以下步骤:In a ninth aspect, an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the following steps:
启动浏览器以访问网页;Launch a browser to access a web page;
获取所述访问网页的信息;Obtaining information about the visited webpage;
将所述访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,所述第一数据用于确定所述访问网页的信息中是否包括目标信息;Matching information of the visited web page with first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes target information;
当所述访问网页的信息中包括所述目标信息时,拦截所述目标信息。When the target webpage information includes the target information, the target information is intercepted.
第十方面,本发明实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行以下步骤:In a tenth aspect, an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the following steps:
将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
所述服务器将所述第一数据发送给终端,以便于所述终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为一种广告拦截的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of advertisement blocking;
图2为另一种广告拦截的应用场景示意图;FIG. 2 is a schematic diagram of another application scenario of advertisement blocking;
图3为本发明实施例提供的一种广告拦截的应用场景示意图;FIG. 3 is a schematic diagram of an application scenario of advertisement blocking according to an embodiment of the present invention; FIG.
图4为本发明实施例提供的一种数据处理的方法流程示意图;4 is a schematic flowchart of a data processing method according to an embodiment of the present invention;
图5为本发明实施例提供的一种浏览器客户端访问的元素的URL的匹配结果示意图;FIG. 5 is a schematic diagram of matching results of URLs of elements accessed by a browser client according to an embodiment of the present invention; FIG.
图6为本发明实施例提供的一种树形结构的示意图;6 is a schematic diagram of a tree structure according to an embodiment of the present invention;
图7为本发明实施例提供的一种基于黑白名单规则划分的树形结构示意图;7 is a schematic diagram of a tree structure based on a black-and-white list rule division according to an embodiment of the present invention;
图8为本发明实施例提供的一种基于定位和预设匹配规则划分的树形结构示意图;8 is a schematic diagram of a tree structure divided based on positioning and preset matching rules according to an embodiment of the present invention;
图9为本发明实施例提供的一种基于标签属性规则或字符规则划分的统计分类结构示意图;9 is a schematic diagram of a statistical classification structure based on label attribute rules or character rules provided by an embodiment of the present invention;
图10为本发明实施例提供的一种基于规则划分的树形结构示意图;FIG. 10 is a schematic diagram of a tree structure based on rule division according to an embodiment of the present invention; FIG.
图11为本发明实施例提供的一种基于子分类的树形结构示意图;11 is a schematic diagram of a tree structure based on sub-classification according to an embodiment of the present invention;
图12为本发明实施例提供的一种基于黑白名单规则、定位和预设匹配规则和标签属性规则划分的树形结构示意图;FIG. 12 is a schematic diagram of a tree structure based on a black-and-white list rule, a positioning and preset matching rule, and a tag attribute rule according to an embodiment of the present invention; FIG.
图13为本发明实施例提供的一种基于字符规则划分的树形结构示意图;13 is a schematic diagram of a tree structure based on character rule division according to an embodiment of the present invention;
图14为本发明实施例提供的一种信息拦截方法的流程图;14 is a flowchart of an information interception method according to an embodiment of the present invention;
图15为本发明实施例提供的一种信息拦截的终端结构示意图;15 is a schematic structural diagram of an information interception terminal according to an embodiment of the present invention;
图16为本发明实施例提供的一种数据处理的服务器的结构示意图;16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention;
图17为本发明实施例提供的一种信息拦截的装置结构示意图;17 is a schematic structural diagram of an information interception device according to an embodiment of the present invention;
图18为本发明实施例提供的一种数据处理的装置的结构示意图。FIG. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
具体实施方式detailed description
为便于对本发明实施例的理解,下面将结合附图以具体实施例做进一步的解释说明,实施例并不构成对本发明实施例的限定。In order to facilitate the understanding of the embodiments of the present invention, specific embodiments will be further explained below with reference to the accompanying drawings. The embodiments do not constitute a limitation on the embodiments of the present invention.
目前,用于广告拦截的技术可以是采用Opera的服务端进行拦截,如图1所示,该Opera的服务端可以包括:浏览器服务器、网页缓存库和页面处理服务器。具体地,当客户端(例如:手机、平板电脑等)使用Opera浏览器浏览网页时,客户端向服务端发送网页访问请求,浏览器服务器接收网页访问请求并向网页缓存库发送查询网页信息,网页缓存库将根据网页信息查找相应的数据,并发送至浏览器服务器,浏览器服务器再将网页内容返回。At present, the technology used for advertisement interception can be the use of Opera's server to block, as shown in Figure 1, the Opera's server can include: a browser server, a web cache library, and a page processing server. Specifically, when a client (for example, a mobile phone, a tablet computer, etc.) browses a webpage using an Opera browser, the client sends a webpage access request to the server, the browser server receives the webpage access request and sends webpage query information to the webpage cache library, The webpage cache library will find the corresponding data according to the webpage information and send it to the browser server, and the browser server will then return the webpage content.
其中,网页缓存库中存储的相关数据是由页面处理服务器周期性发送网页访问请求,并接收网页内容信息,对网页内容信息进行处理,处理内容可以包括:图片压缩、文本压缩或者广告过滤中的至少一种,再将处理后的内容信息压缩后发送至网页缓存库进行存储,以便于浏览服务器进行查询。由此可知,该方法是基于服务端对广告进行隐藏,然后再把已经隐藏广告后的网页内容返回给客户端进行展现。这种方法需要在服务端缓存大量的页面,并且对网页中的全部的内容进行解析,这个过程是需要服务端中的服务器有较高的性能才能快速完成页面内容的处理,对硬件的性能和存储要求非常高。The relevant data stored in the webpage cache library is the page processing server periodically sending webpage access requests and receiving webpage content information to process the webpage content information. The processed content may include: image compression, text compression or advertisement filtering. At least one, the processed content information is compressed and sent to a webpage cache library for storage, so that the browsing server can make a query. It can be known that the method is to hide the advertisement based on the server, and then return the webpage content after the advertisement has been hidden to the client for display. This method requires a large number of pages to be cached on the server and the entire content of the web page to be parsed. This process requires the server on the server to have high performance to quickly complete the processing of the page content. Storage requirements are very high.
另一种广告拦截的技术是应用于浏览器***中(如图2所示),浏览器服务端需要下载easylist规则列表,浏览器客户端定期到浏览器服务端下载广告拦截字符串(需要说明的是,前述easylist规则列表包括广告拦截字符串),当浏览器客户端访问网页时,将访问页面中 元素的统一资源定位符(uniform resource locator,URL)和广告拦截字符串进行匹配,根据匹配到的字符串,对该字符串对应的访问页面中的元素进行隐藏。Another ad blocking technology is applied to the browser system (as shown in Figure 2). The browser server needs to download the easylist rule list, and the browser client periodically downloads the ad blocking string to the browser server. The above list of easylist rules includes advertisement blocking strings. When the browser client accesses the webpage, the uniform resource locator (URL) of the elements in the visited page are matched with the advertisement blocking string, and the matching is performed based on the matching. To the string to hide the elements in the visited page corresponding to the string.
虽然,该方法是基于浏览器客户端对广告进行拦截,但是,该方法出现了至少两种问题。第一,通过浏览器服务器下载的easylist规则列表老化问题:例如,目前easylist规则列表中的URL大约为4.5W条,而且志愿者在持续增加中,志愿者只愿意增加新规则做“贡献”,不愿意做对他们不增值的事情例如:删除easylist规则列表中的老旧的URL,而删除老旧的规则具有风险,所以导致easylist规则列表中的URl不断增长,需要说明的是,上述广告拦截字符串由easylist规则列表中的URl确定。与此同时,很多easylist规则列表中的URl都是很早提出的,而原始网站已经修改了页面实现方式,easylist规则列表中的URl已经过时,所以过时的easylist规则列表中的URl不能为浏览器客户端提供有效广告进行拦截字符串进行拦截。第二,访问网页中的URL与easylist规则列表中的URL匹配性能低,例如,如前述,easylist规则列表中的URL规则大约为4.5W条,一些大型网站,首页的网络请求都超过100个,有些甚至达到430个,由此导致在对这样的网页进行广告拦截,要进行4.5W*100=400W次乃至千万次的广告匹配,那么必然对承载的浏览器客户端的设备的性能带来明显的影响。Although this method is based on the interception of advertisements by the browser client, at least two problems arise with this method. First, the aging problem of the easylist rule list downloaded through the browser server: for example, the URLs in the easylist rule list are currently about 4.5W, and volunteers are continuously increasing, and volunteers are only willing to add new rules to "contribute", Unwilling to do things that do not add value to them, such as: deleting old URLs in the easylist rule list, and deleting old rules has risks, so the URl in the easylist rule list continues to grow. It should be noted that the above ad block The string is determined by the URL in the easylist rule list. At the same time, many URLs in the easylist rule list were proposed very early, and the original website has modified the page implementation. The URLs in the easylist rule list are outdated, so the URLs in the outdated easylist rule list cannot be browsers The client provides a valid advertisement to intercept the string for interception. Second, the URLs in the visited web pages have a low matching performance with the URLs in the easylist rule list. For example, as mentioned above, the URL rules in the easylist rule list are about 4.5W. Some large websites have more than 100 web requests for the homepage. Some even reached 430, which resulted in ad blocking on such web pages. If 4.5W * 100 = 400W times or even 10 million times of ad matching is required, then the performance of the device of the browser client will be inevitably brought. Impact.
所以,基于上述问题,本发明实施例提供了一种基于客户端的信息拦截的方法、装置及终端,该终端通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题。Therefore, based on the above problems, embodiments of the present invention provide a method, device, and terminal for client-based information interception. The terminal intercepts target information in a browser page through a first data having a tree structure. The tree structure The string in the first data can be distinguished in depth, which effectively reduces the number of matches between the information on the web page and the first data, thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method. .
为了方便描述,本发明实施例将访问网页中的目标信息以广告信息举例,其中,本发明实施例提供的方法还可以用于除了广告信息以外的信息,例如:咨询和网页地址等。For the convenience of description, the embodiment of the present invention uses the target information in the visited webpage as an example of advertisement information. The method provided by the embodiment of the present invention can also be used for information other than advertisement information, such as consultation and webpage address.
图3为本发明实施例提供的一种广告拦截的应用场景示意图。如图3所示,该场景可以包括客户端和服务端,其中,客户端具体可以为浏览器客户端,服务端具体可以为浏览器服务端。FIG. 3 is a schematic diagram of an application scenario of advertisement blocking according to an embodiment of the present invention. As shown in FIG. 3, the scenario may include a client and a server. The client may be a browser client, and the server may be a browser server.
具体地,该方法可以包含两个过程,第一个过程可以是浏览器服务端确定第一数据,具体地,浏览器服务端获取大量的用户访问的页面的URL或者页面元素的至少一个URL中的至少一种,该页面元素可以包括:文字,连接或图片中的至少一个;浏览器服务端周期性获取开源网站的开源列表(例如:easylist规则列表)。浏览器服务端根据获取到的用户访问的页面的URL或者页面元素的至少一个URL中的至少一种以及开源网站获取开源列表(开源列表中可以包括广告拦截的字符串)采用浏览器服务端学习机制(例如:图3中云侧学习机制)进行学习,确定有效字符串(例如:在预设天数内访问量大于预设阈值的字符串,在图3中由Xdays访问量top1w的字符串表示),该步骤的目的是移除无效或者极少人访问的字符串,减少规则的数量,以便于有效减少后面匹配的次数。Specifically, the method may include two processes. The first process may be determining the first data by the browser server. Specifically, the browser server obtains at least one URL of a page accessed by a user or at least one URL of a page element. The page element may include at least one of a text, a link, or an image; a browser server periodically obtains an open source list of an open source website (for example, an easylist rule list). The browser server uses the browser server to learn based on at least one of the URL of the page accessed by the user or at least one URL of the page element and the open source website (the open source list can include strings for ad blocking). Mechanism (for example, the cloud-side learning mechanism in Figure 3) to determine a valid string (for example, a string whose visits are greater than a preset threshold within a predetermined number of days, which is represented by a string of Xdays visits top1w in Figure 3 ). The purpose of this step is to remove invalid or rarely accessed strings and reduce the number of rules in order to effectively reduce the number of subsequent matches.
浏览器服务端将有效字符串和浏览器的自定义字符串(例如:图3中自运营的拦截规则表示)合并,确定第二数据。其中,有效字符串和自定义字符串中分别包括至少一个字符串。浏览器服务端将第二数据转化为树形私有格式,生成第一数据,将该树形私有格式(具有树形结构的第一数据)存储至私有格式优选规则库,并同步到浏览器客户端。浏览器客户端周期性到浏览器服务端下载具有树形结构的第一数据,在当访问第三网页时,将访问第三网页的网页信息与具有树形结构的第一数据进行匹配,确定匹配的结果,若在具有树形结构的第一数据内匹配到了,则浏览器客户端将被匹配到的在访问第三网页的网页信息中的目标信息进行拦截,一般该目标信息为广告信息。The browser server end combines the valid character string and the custom character string of the browser (for example, the self-operating interception rule representation in FIG. 3) to determine the second data. Wherein, each of the valid character string and the custom character string includes at least one character string. The browser server converts the second data into a tree-like private format, generates the first data, stores the tree-like private format (the first data with a tree structure) to a private format optimization rule base, and synchronizes to the browser client end. The browser client periodically downloads the first data with a tree structure from the browser server. When the third web page is accessed, the web page information of the third web page is matched with the first data with the tree structure to determine As a result of the matching, if it is matched in the first data with a tree structure, the browser client intercepts the matched target information in the web page information of the third web page, which is generally advertising information .
综上,该方法一方面,通过统计大量用户访问的数据,将原始开源列表中失效的或者访问量低的字符串剔除掉,既保证了规则的有效性,同时有减少了匹配的目标。另外一方面,通过对第二数据的深入理解,按照相应的规则将字符串进行分类,形成为一颗树的结构,在匹配时极大的减少单个信息(即访问第三网页中各个元素的信息,元素一般是指文字、图片和视频等)的匹配次数。In summary, on the one hand, this method, by counting the data accessed by a large number of users, removes invalid or low-access strings from the original open source list, which not only guarantees the validity of the rule, but also reduces the number of matching targets. On the other hand, through in-depth understanding of the second data, the strings are classified according to the corresponding rules to form a tree structure, which greatly reduces a single piece of information during matching (that is, access to each element in the third web page). Information, elements generally refers to the number of matches of text, pictures, videos, etc.).
下面结合图4至图13进一步说明本发明实施例提供的信息拦截的方法,首先,需要介绍浏览器服务端数据处理的过程(即确定第一数据)的过程。如图4到图13所示:The information interception method provided by the embodiment of the present invention is further described below with reference to FIGS. 4 to 13. First, the process of data processing (ie, determining the first data) of the browser server needs to be introduced. As shown in Figure 4 to Figure 13:
图4为本发明实施例提供的一种数据处理的方法流程示意图。如图4所示,可以包括步骤S410-S470,具体如下所示:FIG. 4 is a schematic flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 4, steps S410-S470 may be included, as follows:
S410:浏览器服务端接收浏览器客户端访问网页的指令。S410: The browser server receives an instruction from the browser client to access a webpage.
其中,浏览器客户端访问网页的指令可以为大量用户通过浏览器访问多个网页的指令;或者,大量用户通过浏览器访问同一个网页的指令。具体地,浏览器客户端根据大量用户访问网页的指令,记录访问页面的URL或者页面元素的URL中的至少一个,在预设时间内将访问页面的URL或者页面元素的URL中的至少一个进行压缩,压缩后文件为浏览器客户端访问网页的指令。其中,该第二消息中不具有任何用户标识,其目的是为了保障用户的隐私。The instruction for the browser client to access the webpage may be an instruction for a large number of users to access multiple webpages through the browser; or, an instruction for a large number of users to access the same webpage through the browser. Specifically, the browser client records at least one of the URL of the accessed page or the URL of the page element according to the instruction of a large number of users to access the web page, and performs at least one of the URL of the accessed page or the URL of the page element within a preset time. Compression, the compressed file is the browser client's instruction to access the web page. The second message does not have any user identification, and the purpose is to protect the privacy of the user.
需要说明的是,因为有大量的用户进行访问,所以,浏览器客户端会多次将浏览器客户端访问网页的指令发送至浏览器服务端。It should be noted that, because there are a large number of users accessing, the browser client will send the browser client's instructions for accessing the webpage to the browser server multiple times.
S420:浏览器服务端周期性到开源网站获取最新的开源列表(例如:easylist字符串或者包含easylist字符串的列表),例如:服务端每天凌晨12点到开源网站获取最新的开源列表。S420: The browser server periodically obtains the latest open source list from the open source website (for example, the easylist string or a list containing the easylist string). For example, the server obtains the latest open source list from the open source website every day at 12 am.
S430:浏览器服务端根据开源列表和浏览器客户端访问网页的指令,确定所述有效字符串,即筛选高命中率的规则。S430: The browser server determines the valid string according to the open source list and the instruction of the browser client to access the webpage, that is, a rule for filtering high hit rates.
具体地,浏览器客户端每上报一次浏览器客户端访问网页的指令,服务端就将浏览器客户端访问网页的指令中的用户访问页面的URL或者页面元素的URL中的至少一个提取出来,然后对照开源列表中的字符串对两者进行匹配,若匹配到对应的字符串,则对该字符串进行计数加1,重复该步骤,直到浏览器服务端将用户访问页面的URL或者页面元素的URL中的至少一个中的记录全部匹配完成之后,即可以把该用户访问页面的URL或者页面元素的URL中的至少一个放置到备份目录里,在一种可能的实施方式中,可以设置在预设时间段之内删除该文件。Specifically, each time the browser client reports an instruction for the browser client to access the webpage, the server extracts at least one of the URL of the user access page or the URL of the page element in the instruction for the browser client to access the webpage, Then the two are matched against the strings in the open source list. If the corresponding strings are matched, the string is counted up by 1, and this step is repeated until the browser server sends the URL or page element of the user to the page. After all the records in at least one of the URLs have been matched, at least one of the URL of the page accessed by the user or the URL of the page element can be placed in the backup directory. In a possible implementation, it can be set in Delete the file within a preset time period.
该浏览器服务端将每条字符串的计数结果进行保存,然后统计在预设时间段内(例如:最新30天)每条字符串的访问计数结果,将这些访问计数结果值按照从大到小进行排列,确定访问计数结果大于第一阈值所对应的字符串(例如:排列出访问计数结果大于第一阈值N=2000以上对应的字符串),该访问计数结果大于第一阈值所对应的字符串就为有效字符串,而访问计数结果小于第二阈值(例如:第二阈值N=100)或者已经失效的字符串(例如:访问量为零的字符串)就直接剔除。The browser server saves the count result of each string, and then counts the access count result of each string within a preset time period (for example, the latest 30 days), and sets the value of these access count results from large to Make a small arrangement to determine the character string corresponding to the access count result greater than the first threshold (for example: arrange the character string corresponding to the access count result greater than the first threshold N = 2000 or more), and the access count result greater than the first threshold The character string is a valid character string, and the access count result is less than the second threshold (for example, the second threshold N = 100) or the invalid character string (for example, a string with zero access) is directly removed.
举个例子具体说明,例如:easylist字符串列表中,对于sina.cn有如下这几条规则:Take an example to explain, for example: in easylist string list, there are the following rules for sina.cn:
||mobile.sina.cn/public/files/image/600x150_|| mobile.sina.cn/public/files/image/600x150_
||mobile.sina.cn/public/files/image/620x300_|| mobile.sina.cn/public/files/image/620x300_
||sina.cn/api/article/news_banner?|| sina.cn/api/article/news_banner?
||sina.cn/cm/sinaads_|| sina.cn/cm/sinaads_
||sina.cn^*/impress?|| sina.cn ^ * / impress?
当打开当前sina.cn的首页,浏览器客户端收集到的需要访问页面的URL,服务端将这些URL与eaylist字符串列表中的字符串进行匹配,具体如图5所示。从图5中可以看到:||sina.cn^*/impress?这条规则被命中了4次,||sina.cn/cm/sinaads_被命中了一次。由此可知在easylist字符串列表中给出的这5条字符串哪些是经常被访问的,哪些是很少访问或者不访问的。如图5所示的是单次访问的结果,若收集了上百万用户的访问指令之后,就可以得到哪些是有效的,哪些是无效的。When the homepage of the current sina.cn is opened, the browser client collects the URLs that need to access the page, and the server matches these URLs with the strings in the eaylist string list, as shown in Figure 5. From Figure 5, you can see: || sina.cn ^ * / impress? This rule was hit 4 times and || sina.cn/cm/sinaads_ was hit once. From this we can know which of the 5 strings given in the easylist string list are frequently accessed and which are rarely or not accessed. Figure 5 shows the result of a single visit. After collecting the access instructions of millions of users, you can get what is valid and what is invalid.
S440:浏览器服务端将有效字符串和浏览器的自定义字符串合并,确定第二数据。S440: The browser server merges the valid character string and the custom character string of the browser to determine the second data.
具体地,有效字符串是从开源列表(例如:easylist规则列表)中筛选的,所以有效字符串是开源的。同时,不同的浏览器在运营时,会有一些自定义的规则,即浏览器的自定义字符串。Specifically, valid strings are filtered from an open source list (for example, an easylist rule list), so valid strings are open source. At the same time, when different browsers operate, there will be some custom rules, that is, the custom string of the browser.
在另一种可能的实施方式中,在S440步骤之前还可以包括,获取浏览器服务器的自定义字符串。In another possible implementation manner, before step S440, the method may further include: obtaining a custom character string of the browser server.
S450:浏览器服务端将第二数据进行树形转化处理,确定第一数据。S450: The browser server performs tree transformation processing on the second data to determine the first data.
其中,第一数据中可以包括用于匹配目标信息的字符串,该目标信息在本申请中指代广告信息,第一数据用于浏览器客户端根据第一数据拦***问页面中的广告信息。The first data may include a character string for matching target information. The target information refers to advertisement information in this application. The first data is used by a browser client to intercept advertisement information in a visited page according to the first data.
具体地,浏览器服务端根据n种预设规则将第二数据划分为m级,所述m级子节点中每一级的预设规则都不同;所述n种预设规则中的每一种分别包括至少两个字符串的类别,根据所述字符串的类别将所述m级中的每层划分为至少两个子节点;所述第二数据包括多个字符串,所述每个子节点中分别包括属于不同字符串的类别的多个字符串,所述n、m均为大于等于1的整数,所述n大于等于所述m。Specifically, the browser server divides the second data into m levels according to n types of preset rules, and each level of the m-level child nodes has different preset rules; each of the n types of preset rules A category that includes at least two character strings, and each layer in the m level is divided into at least two child nodes according to the category of the character strings; the second data includes multiple character strings, and each child node Each of them includes a plurality of character strings belonging to different types of character strings. The n and m are all integers greater than or equal to 1, and the n is greater than or equal to the m.
也可以理解为,浏览器服务端根据n个预设规则(n为大于0的正整数)将第二数据划分为m级(m为大于0的正整数,且n大于等于m),m级中的每一级中包括至少两个子节点,n个预设规则中的每个预设规则包括至少两个类别,每一级中的至少两个子节点按照至少两个类别划分(即每一级中的每个子节点都对应一个类别),每一级中的至少两个子节点中包含具有一个类别的多个字符串。It can also be understood that the browser server divides the second data into m levels (m is a positive integer greater than 0, and n is greater than or equal to m) according to n preset rules (n is a positive integer greater than 0), and m level Each level includes at least two child nodes, each of the n preset rules includes at least two categories, and at least two child nodes in each level are divided according to at least two categories (that is, each level Each child node in the class corresponds to a category), and at least two child nodes in each level include multiple character strings with one category.
在选取n种预设规则时,m级子节点中每一级的预设规则都不同,一种可能是按照n个预设规则中的顺序进行排列选取,另一种是在n个预设规则中随机选取任意两个或者三个,但最少不能低于两个。When selecting n types of preset rules, the preset rules of each level in the m-level child nodes are different. One may be arranged in the order of the n preset rules, and the other is the n preset rules. Randomly select any two or three in the rule, but at least two.
用一个例子说明,图6所示,浏览器服务端根据4个预设规则将第二数据划分为4级(第四级并未示出),4级中的每一级包括至少两个子节点,4个预设规则可以包括:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则,需要说明的是,该预设规则也可以包括其他的可能性(例如:标识、固定语句等),本申请中只是用上述规则进行举例,并不限于这4个可能性。当选取划分方式中的2个或者3个进行树形转化处理的情况下,因为划分方式的种类变少,虽然划分力度较弱,但是相比于现有技术匹配速度是有所提升的。Using an example, as shown in Figure 6, the browser server divides the second data into 4 levels according to 4 preset rules (the fourth level is not shown), and each of the 4 levels includes at least two child nodes The 4 preset rules can include: black and white list rules, positioning and preset matching rules, label attribute rules or character rules. It should be noted that the preset rules can also include other possibilities (for example: logos, fixed sentences Etc.), the above-mentioned rules are used as examples in the present application, and are not limited to these four possibilities. When two or three of the division methods are selected for tree transformation processing, because there are fewer types of division methods, although the division strength is weak, the matching speed is improved compared to the prior art.
4个预设规则中每个预设规则分别可以包括至少两个类别,每一级中的每个子节点都对应一个类别,其中,每一级中的至少两个子节点是按照至少两个类别划分。Each of the 4 preset rules may include at least two categories, and each child node in each level corresponds to a category, wherein at least two child nodes in each level are divided according to at least two categories .
举个例子,当选取的是划分方式中的黑白名单规则和定位和字符规则划分进行树形转化处理,首先进行的是采用黑白名单划分然后是字符串划分进行树形转化处理;当选取的是划分方式中的定位和预设匹配规则、标签属性规则划分和字符规则,首先进行的是采用定位和预设匹配规则,然后是标签属性规则划分,最后是字符规则进行树形转化处理。也可以理解为,当选取上述4个规则时,应按照顺序进行往下排列,若选取的不包括或者包含部分上述 规则是,应按照实际情况进行排列级数。For example, when the black-and-white list rule and positioning and character rule division in the division method are selected for tree transformation processing, the black-and-white list division and string division are first performed for tree transformation processing; when the selected is The positioning and preset matching rules, label attribute rules, and character rules in the division method are first performed by using the positioning and preset matching rules, then the label attribute rules are divided, and finally the character rules are processed by tree transformation. It can also be understood that when the above 4 rules are selected, they should be arranged downward in order. If the selected rules do not include or include part of the above rules, the ranks should be arranged according to actual conditions.
例如:当所述黑白名单规则中包括白名单的类别和黑名单的类别时,按照所述白名单的类别和所述黑名单的类别将所述m级子节点中的第1级划分为两个子节点(图6中1a对应图7中的BLACK子节点,图6中的1b对应图7中的WHITE的子节点),所述两个子节点中的一个子节点包括所述第二数据中属于所述白名单的类别的字符串(图7中的WHITE的子节点下面框中的内容),另一个子节点包括所述第二数据中属于所述黑名单的类别的字符串(图7中的BLACK的子节点下面框中的内容)。For example, when the blacklist and whitelist rules include a whitelist category and a blacklist category, the first level of the m-level child nodes is divided into two according to the whitelist category and the blacklist category. Child nodes (1a in FIG. 6 corresponds to the BLACK child node in FIG. 7, and 1b in FIG. 6 corresponds to the child node of WHITE in FIG. 7), and one of the two child nodes includes the second data belonging to A character string of the category of the white list (the content in the box below the child node of WHITE in FIG. 7), and another child node includes a character string of the category that belongs to the black list in the second data (in FIG. 7) The contents of the BLACK child node below the box).
具体地,浏览器服务端根据黑白名单规则中的白名单的类别和黑名单的类别将第二数据划分为第一子节点和第二子节点,其中,第一子节点(例如:1a子节点)包括属于黑名单的类别的字符串,第二子节点(例如:1b子节点)包括属于白名单的类别的字符串。Specifically, the browser server divides the second data into a first child node and a second child node according to the type of the white list and the type of the black list in the black and white list rule, wherein the first child node (for example, the 1a child node ) Includes a character string belonging to a blacklisted category, and the second child node (for example, 1b child node) includes a character string belonging to a whitelisted category.
当所述定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照所述定位匹配的类别和所述预设匹配的类别,将所述m级子节点中的第2级划分为两个子节点(例如:图6中2a子节点和2b子节点),所述两个子节点中的一个子节点包括所述第二数据中属于所述定位匹配的类别的字符串,另一个子节点包括所述第二数据中属于所述预设匹配的类别的字符串,其中所述第2级中的两个子节点与所述第1级中属于所述黑名单的类别的字符串所在的节点呈父子关系。在另一种可能的实施例中,第2级中的第3个子节点(例如,图6中2c子节点)所述第1级中属于所述白名单的类别的字符串所在的节点呈父子关系。需要说明的是,在本发明实施中,从第2级中的子节点都是与属于所述黑名单的类别的字符串所在的节点具有父子关系的节点。具体地,具有定位匹配的类别的子节点中包括在第一预设位置存在字符串的信息或者在第二预设位置存在分隔符的信息中的至少一种;预设匹配的类别的中包括用于筛选所述访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。When the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the second The level is divided into two sub-nodes (for example, 2a sub-node and 2b sub-node in FIG. 6), and one of the two sub-nodes includes a character string in the second data that belongs to the category of the positioning match, and A child node includes a character string belonging to the preset matching category in the second data, where two child nodes in the second level and a character string belonging to the blacklisted category in the first level The node is in a parent-child relationship. In another possible embodiment, the node where the character string belonging to the whitelisted category in the first level in the third child node (for example, the 2c child node in FIG. 6) in the second level is located is a parent relationship. It should be noted that in the implementation of the present invention, the child nodes in the second level are all nodes having a parent-child relationship with the nodes where the character strings belonging to the category of the blacklist are located. Specifically, the sub-nodes with the category of the positioning match include at least one of information that a character string exists at a first preset position or information that a delimiter exists at a second preset position; and the category of the preset matching includes It is used to filter at least one of information having a prefix or information having a suffix in the information for accessing the webpage.
下述对具有定位匹配的类别的子节点和具有预设匹配的类别子节点进行详细说明:The following is a detailed description of a child node with a category that has a positioning match and a child node with a category that has a preset match:
具有定位匹配的类别的子节点主要是根据固定位置的字符进行划分的,具体地,该第一预设位置存在字符*,其中,*表示在第一预设位置出现任意字符串;或者,在第二预设位置存在^,其中,^表示在第二预设位置出现分隔符(其中,分隔符可以是除了字母、数字、_、-、.或者%之外的任何字符)。The child nodes of the category with the positioning match are mainly divided according to the characters of the fixed position. Specifically, the first preset position has the character *, where * indicates that an arbitrary character string appears at the first preset position; or, There is a ^ in the second preset position, where ^ indicates that a separator appears at the second preset position (where the separator can be any character except letters, numbers, _,-,., Or%).
举个例子,在如下面的浏览器客户端访问页面的网址中://、:、/、?、&和=可以看做分隔符:For example, in the URL of a browser client access page such as the following: //,:, / ,? , & And = can be seen as separators:
http://example.com:8000/foo.bar?a=12&b=%D1%82%D0%B5%D1%81%D1%82http://example.com:8000/foo.bar? a = 12 & b =% D1% 82% D0% B5% D1% 81% D1% 82
所以,在定位匹配的规则列表中的规则过滤^example.com^或^%D1%82%D0%B5%D1%81%D1%82^或^foo.bar^就可以和它匹配上。Therefore, the rule filtering ^ example.com ^ or ^% D1% 82% D0% B5% D1% 81% D1% 82 ^ or ^ foo.bar ^ in the list of positioning matching rules can be matched with it.
另外,预设匹配的类别是按照普通模式划分,其中,该预设匹配的类别可以包括:前缀匹配或后置匹配中的至少一种。下面以两者都有的情况下,进行介绍。In addition, the category of the preset match is divided according to a common pattern, and the category of the preset match may include at least one of a prefix match or a post match. The following is a description of both cases.
由上述可知,white.plain和white.glob是指预设匹配的类别中的前缀匹配类别,white.plain和black.plain是指预设匹配的类别中的后缀匹配类别。例如:对于上面的sina相关的分支,就变成了如图8所示的分支场景,由于数量有限,只出现了3个子节点(例如:white.plain、black.plain和black.glob),在white.plain、black.plain和black.glob下面的框中内容为每个子节点中包含的具有对应类别的字符串。也就是说,在第一级1a节点下面与该第一级1a节点具有父子关系的第二级子节点可以包括2a和2b,于此同时,在第一级1b节点下面与该第一级1b节点具有父子关系的第二级子节点也可以包括2a和2b或者2c(该可能性在图6中并未示出)。It can be known from the above that white.plain and white.glob refer to the prefix matching category in the preset matching category, and white.plain and black.plain refer to the suffix matching category in the preset matching category. For example, for the above Sina-related branch, it becomes the branch scene shown in Figure 8. Due to the limited number, only three child nodes (for example: white.plain, black.plain, and black.glob) appear. The boxes below white.plain, black.plain, and black.glob are strings with corresponding categories contained in each child node. That is, a second-level child node having a parent-child relationship with the first-level 1a node under the first-level 1a node may include 2a and 2b, and at the same time, it is below the first-level 1b node with the first-level 1b node. The second-level child nodes whose nodes have a parent-child relationship may also include 2a and 2b or 2c (this possibility is not shown in FIG. 6).
预设匹配的类别可以划分为2个分支(即前缀匹配和后置匹配),在一种可能的实施例中,可以和上述第一级合并为一层,也就是根节点(ROOT)下面同时可以包括4个节点,例如:white.plain、white.glob、black.plain和black.glob。The category of the preset match can be divided into 2 branches (that is, prefix matching and post matching). In a possible embodiment, it can be combined with the above first level into a layer, that is, under the root node (ROOT) at the same time. It can include 4 nodes, such as: white.plain, white.glob, black.plain, and black.glob.
在另一种可能的实施例中,上述第二级可能会出现4个子节点,4可子节点可以包括具有第一预设位置存在字符串的信息的子节点、具有第二预设位置存在字符串的信息的子节点、具有存在前缀的信息的子节点和具有存在后缀的信息的子节点。In another possible embodiment, the above-mentioned second level may have four child nodes, and the four sub-nodes may include child nodes having information of a character string having a first preset position, and characters having a second preset position. A child node of the information of the string, a child node of the information having a prefix, and a child node of the information having a suffix.
在再一种可能的实施例中,上述第二级可能会出现8个子节点,8个子节点可以分为至少两组,其中一组为与第一级1a节点呈父子节点的4个子节点,4可子节点可以包括具有第一预设位置存在字符串的信息的子节点、具有第二预设位置存在字符串的信息的子节点、具有存在前缀的信息的子节点和具有存在后缀的信息的子节点,另一组与第一级1b呈父子节点的4个子节点,4可子节点可以包括具有第一预设位置存在字符串的信息的子节点、具有第二预设位置存在字符串的信息的子节点、具有存在前缀的信息的子节点和具有存在后缀的信息的子节点。In yet another possible embodiment, there may be eight child nodes in the second level, and the eight child nodes may be divided into at least two groups, one of which is four child nodes that are parent-child nodes with the first-level 1a node, 4 A child node may include a child node having information of a character string existing at a first preset position, a child node having information of a character string existing at a second preset position, a child node having information having a prefix, and information having a suffix existing. Child node, another group of 4 child nodes that are parent-child nodes with the first level 1b, 4 child nodes may include child nodes with information of a first preset position character string, and a child node with a second preset position character string A child node of information, a child node with information that has a prefix, and a child node with information that has a suffix.
当所述标签属性规则中包括具备标签的类别和不具备标签的类别时,按照所述具备标签的类别和所述不具备标签的类别,将所述m级子节点中的第3级划分为两个子节点(例如3a和3b),所述两个子节点中的一个子节点包括所述第二数据中属于所述具备标签的类别的字符串,另一个子节点包括所述第二数据中属于所述不具备标签的类别的字符串,其中所述第3级中的任意一个子节点与所述第2级子节点中的一个子节点呈父子关系,例如:如图6所示,3a和3b子节点与2a子节点呈父子关系,3c子节点与2b子节点呈父子关系。When the tag attribute rule includes a category with a tag and a category without a tag, the third level among the m-level child nodes is divided into the category with the tag and the category without the tag Two child nodes (for example, 3a and 3b), one of the two child nodes includes a character string belonging to the tagged category in the second data, and the other child node includes a character string belonging to the second data For the character string without a label, any child node in the third level is in a parent-child relationship with a child node in the second level child nodes. For example, as shown in FIG. 6, 3a and The 3b child node has a parent-child relationship with the 2a child node, and the 3c child node has a parent-child relationship with the 2b child node.
如图9所示,下面结合图9依次说明:标签属性规则中可以包括为很多种类型,在本发明实施例中提供了两种,一种是包含具备标签的类别(例如图9中涉及的内容),另一种是不具备标签的类别其中,具备标签的类别可以具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。所以,首先介绍的是具备标签的类别(例如图9涉及的内容),具体如下所示,例如:As shown in FIG. 9, the following is explained in conjunction with FIG. 9: The tag attribute rule may include many types. In the embodiment of the present invention, two types are provided, and one includes a category with a tag (such as Content), the other is the category without tags. Among them, the categories with tags can specifically include: hostname-only categories, advertising attribute-only host information categories, host and domain name two-level classification categories, host and At least one of a category of the URL information of the advertisement's uniform resource locator, or a category of only the domain name and the URL information of the advertisement is different. Therefore, the categories with labels (such as those shown in Figure 9) are introduced first, as follows, for example:
MIME_TYPE of request contentMIME_TYPE of request content
"other":1"other": 1
"xbl":1"xbl": 1
"ping":1"ping": 1
"dtd":1"dtd": 1
"script":2"script": 2
"image":4"image": 4
"background":4"background": 4
"stylesheet":8"stylesheet": 8
"object":16"object": 16
"subdocument":32,"subdocument": 32,
"document":64,"document": 64,
"xmlhttprequest":2048,"xmlhttprequest": 2048,
"object_subrequest":4096,"object_subrequest": 4096,
"media":16384,"media": 16384,
"font":32768,"font": 32768,
"popup":0x1000000,"popup": 0x1000000,
其中,左边一列代表的是具备标签的类别根据标签类别对字符串做出的划分,右边一列 代表了与划分之后的标签类别的对应编号(该编号是标准中设定的)。Among them, the left column represents the classification of the character string according to the label category by the category with the label, and the right column represents the corresponding number (the number is set in the standard) corresponding to the label category after the classification.
把上面的字符串根据标签类别做进一步的划分,将选取上面的4个子分类(例如图10中"script"、"image"和"document")都可以进行这个标签类别的划分,如图10所示仅针对black.plain为例进行划分。根据带有标签类别"image"将第二数据中带有"image"的字符串设置在该子节点上,以此类推。图10中还有不包含标签的作为一类(例如图10中“*”节点下面的框中为不带不具备标签的类别的字符串)。还有其他的节点的标签类别(因图示范围有限,采用“……”表示其他的标签类别),然后大量的字符串就会被挂到不同的标签类别所在的节点中。需要说明的是,一般来说"script"和"image"的字符串出现的比例高,所以在图10中在"script"的标签类别的子节点和"image"的标签类别的子节点中给出较多的字符串例子。The above string is further divided according to the label category. The above 4 sub-categories (such as "script", "image" and "document" in Figure 10) can be used to divide this label category, as shown in Figure 10. The display is only divided for black.plain as an example. Set a string with "image" in the second data on the child node according to the labeled category "image", and so on. In FIG. 10, there is a category that does not include a label (for example, the box under the “*” node in FIG. 10 is a character string without a category without a label). There are other node label categories (because the scope of the illustration is limited, "..." is used to indicate other label categories), and then a large number of strings will be hung to the nodes where the different label categories are located. It should be noted that, generally speaking, the "script" and "image" strings appear at a high rate, so in Fig. 10, the sub-nodes of the tag category of "script" and the sub-nodes of the tag category of "image" Give more string examples.
其次,在一种可能的实施例中,具备标签的类别具体还可以分为:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。详细介绍一下具备标签的类别中具体的类别(如图9中序号2-6涉及到的内容)。Secondly, in a possible embodiment, the categories with tags can be further divided into: a category with only a host name, a category with only advertising information, a category with a two-level classification of the host and a domain name, a host and an advertisement At least one of the category of the URL information of the uniform resource locator or only the category of the URL information of the domain name and the advertisement is different. Introduce the specific categories of the categories with labels in detail (as shown in the number 2-6 in Figure 9).
其中,主机可以包括如下4中类型:The host may include the following four types:
第一种:Direct分类The first: Direct classification
具体地,仅包含主机名(即图9中序号2的说明部分只包含主机信息),例如:(图9中序号2中的示例部分)||9377os.com^这种字符串可以后续根据主机名的字符串进一步划分。Specifically, only the host name is included (that is, the description part of the serial number 2 in FIG. 9 only includes the host information), for example: (the example part in the serial number 2 in FIG. 9) || 9377os.com Name strings are further divided.
第二种:Third分类The second type: Third classification
具体地,仅包括广告归属的主机信息(即图9中序号3的说明部分只包含第三方网站访问广告归属网站的信息),例如:(图9中序号)||116b.com^$third-party可以在分类下后续根据主机名的字符串进一步划分。Specifically, only the host information of the advertising attribution is included (that is, the description part of the serial number 3 in FIG. 9 only includes the information of the third party website accessing the advertising attribution website), for example: (serial number in FIG. 9) || 116b.com ^ $ third- The party can be further divided according to the host name string under the classification.
第三种:Domain_Direct分类The third type: Domain_Direct classification
具体地,后续按照主机和domain两级分类进行字符划分(即图9中序号4的说明部分包含当前网页的domain和广告网页主机的信息)。Specifically, character classification is subsequently performed according to the two-level classification of the host and the domain (that is, the description part of the serial number 4 in FIG. 9 includes the domain of the current web page and the information of the host of the advertising web page).
第四种:Domain_Filter分类The fourth type: Domain_Filter classification
具体地,包含了主机和广告的url信息(即图9中序号5的说明部分包含domain和广告内容的信息),例如下述5条字符串:Specifically, the url information of the host and the advertisement is included (that is, the description part of the serial number 5 in FIG. 9 includes the information of the domain and the advertisement content), for example, the following five strings:
Figure PCTCN2019106728-appb-000001
/static/media/curl.swf$domain=duba.com
Figure PCTCN2019106728-appb-000001
/static/media/curl.swf$domain=duba.com
Figure PCTCN2019106728-appb-000002
/banner.js$domain=28188.com|28188.net
Figure PCTCN2019106728-appb-000002
/banner.js$domain=28188.com|28188.net
Figure PCTCN2019106728-appb-000003
/skin/tb12/$domain=17huohu.com|firefox.com.cn
Figure PCTCN2019106728-appb-000003
/skin/tb12/$domain=17huohu.com|firefox.com.cn
Figure PCTCN2019106728-appb-000004
.com/tps/$domain=ocucn.com
Figure PCTCN2019106728-appb-000004
.com / tps / $ domain = ocucn.com
Figure PCTCN2019106728-appb-000005
||cdndm.com/12/2016/$domain=1kkk.com|dm5.com
Figure PCTCN2019106728-appb-000005
|| cdndm.com/12/2016/$domain=1kkk.com | dm5.com
根据上述5条字符串,还可以按照Domain_Filter进一步划分:According to the above 5 strings, it can be further divided according to Domain_Filter:
包含广告主机名的字符串可以为:||cdndm.com/12/2016/$domain=1kkk.com|dm5.comThe string containing the hostname of the ad can be: || cdndm.com/12/2016/$domain=1kkk.com | dm5.com
不含主机名包含广告path的字符串可以为:.com/tps/$domain=ocucn.comThe string without the host name and the advertisement path can be: .com / tps / $ domain = ocucn.com
不含主机包含广告的文件名的字符串可以为:The string without the file name of the host containing the ad can be:
/static/media/curl.swf$domain=duba.com/static/media/curl.swf$domain=duba.com
根据上面的进一步划分,可以将Domain_Filter进一步划分为3个子节点(例如图11所示),即包含广告主机名的字符串的节点(如图11中111)、不含主机名包含广告path的字符串的节点(如图11中113)以及不含主机包含广告的文件名的字符串的节点(如图11中112)。According to the above further division, the Domain_Filter can be further divided into 3 sub-nodes (such as shown in Figure 11), that is, the node containing the string of the advertising host name (111 in Figure 11), without the characters of the host name containing the advertising path Nodes (such as 113 in FIG. 11) and nodes (such as 112 in FIG. 11) that do not contain a character string containing the file name of the advertisement.
上述分类的处理方法,也可以对归属主机属性分类下面的domain_filter做一样处理。例如对于包含图片的广告,可以使用这个分类方法:The above classification processing method can also do the same for the domain_filter under the home host attribute classification. For example, for ads containing images, you can use this classification method:
第五种,THIRD_FILTERS分类Fifth, THIRD_FILTERS classification
具体地,例如:||books.com.tw/exep/ap/$third-party字符串。Specifically, for example: || books.com.tw/exep/ap/$third-party string.
这里与第四种Domain_filter的区别在于只是用户当前访问的domain和广告的主机是不同的,其广告信息处理部分相同。因此也可以按照第四种Domain_filter部分同样处理。The difference between this and the fourth type of Domain_filter is that only the domain that the user is currently visiting and the host of the advertisement are different, and the processing of the advertisement information is the same. Therefore, it can also be processed according to the fourth Domain_filter part.
另外,还可以包括第五种:Type_filter分类In addition, you can also include the fifth: Type_filter classification
具体地,Domain_filter和Third_filter可以组合为Type_filter,当只是有广告内容信息的时候,Type_filter可以包括Domain_filter和Third_filter两个子类。Specifically, Domain_filter and Third_filter can be combined into Type_filter. When there is only advertisement content information, Type_filter can include two subclasses of Domain_filter and Third_filter.
综上,根据上述黑白名单划分、匹配模式划分和规则类别划分,可以将第二数据进行树形转化处理,所转化的树形结构可以如图12所示,具体地,以black.plain节点为例结合匹配模式划分和规则类别划分组成图12的树形结构。In summary, according to the above black-and-white list division, matching pattern division, and rule category division, the second data can be tree-transformed. The transformed tree structure can be shown in FIG. 12, specifically, the black.plain node is The example combines the matching pattern division and rule category division to form the tree structure of FIG. 12.
其中,针对图12中120-129子节点,可以对于domain、广告的主机、广告对象的路径和主机名(name)进行字符串的划分,具体地对于图12中direct或third中的至少一个子节点的分类,都可以对其中主机名进行划分,例如:可以以首字符是0-9,a-z,A-Z或其他中的至少一个类别来进一步划分(具体如图13所示),划分为3个子节点,每个子节点根据不同的类别将原有的4个字符串划分在3个子节点中,本发明实施例中只举出一个例子(主机名进行划分),其余的(domain、广告的主机和广告对象的路径)也可以进行如上划分,在此不再详细赘述。Among them, for the 120-129 child nodes in FIG. 12, a character string can be divided for domain, the host of the advertisement, the path of the advertisement object, and the host name (name), specifically for at least one of the direct or third children in FIG. 12 The classification of the nodes can be divided into host names, for example, it can be further divided by at least one of the first characters 0-9, az, AZ, or other categories (as shown in Figure 13), divided into three sub-groups. Node, each child node divides the original 4 character strings into 3 child nodes according to different categories. In the embodiment of the present invention, only one example is given (the host name is divided), and the rest (domain, host of the advertisement, and The path of the advertising object) can also be divided as above, which will not be described in detail here.
浏览器服务端还可以将第三级即图12中120-129子节点再次向第四级划分,其选用的划分预设规则可以是字符规则,具体地,当所述字符规则中包括首字符串的类别和预置字符串的类别时,按照所述首字符串的类别和所述预置字符串的类别,将所述m级子节点中的第4级划分为两个子节点,所述两个子节点中的一个子节点包括所述第二数据中属于所述首字符串的类别的字符串,另一个子节点包括所述第二数据中属于所述预置字符串的类别的字符串,其中所述第4级中的任意一个子节点与所述第3级子节点中的一个子节点呈父子关系。The browser server can also divide the third level, that is, the 120-129 child nodes in FIG. 12 to the fourth level again. The preset division rule selected can be a character rule. Specifically, when the character rule includes the first character, When the category of the string and the category of the preset character string are used, the fourth level of the m-level sub-nodes is divided into two sub-nodes according to the category of the first character string and the category of the preset character string. One of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes a character string belonging to the category of the preset character string in the second data. , Wherein any one of the child nodes in the fourth level is in a parent-child relationship with one of the child nodes in the third level.
需要说明的是,第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,所述k级子节点为所述m级子节点中的任意一级子节点,所述k为大于等于1的整数。结合上述的例子,可以理解的是,上述n=4的情况下,若k=2(由此可知,k小于等于n,为正整数),即第2级子节点中的每个子节点(例如:2a和2b)与第一级中的一个子节点(例如1b)具有父子关系;或者,k=3,即第3级子节点中的每个子节点(例如3a、3b和3c)与第二级中的一个子节点(例如2a或者2b)具有父子关系。其中,每个子节点在下一级都可能有与该节点具有父子关系的子节点。It should be noted that each child node in the k-th child node has a parent-child relationship with one child node in k-1 level, and the k-level child node is any one-level child node in the m-level child node, The k is an integer greater than or equal to 1. With reference to the above example, it can be understood that, in the case of n = 4, if k = 2 (it can be seen that k is less than or equal to n, which is a positive integer), that is, each child node in the second-level child node (for example, : 2a and 2b) have a parent-child relationship with a child node (for example, 1b) in the first level; or, k = 3, that is, each child node (for example, 3a, 3b, and 3c) in the third level and the second A child node (for example, 2a or 2b) in a level has a parent-child relationship. Among them, each child node may have a child node having a parent-child relationship with the node at the next level.
需要说明的是,在又一种可能的实施例中,在上述方式中,也可以选取上述4个中的两个或者三个划分方式。It should be noted that, in another possible embodiment, in the foregoing manner, two or three of the foregoing four division manners may also be selected.
上述的树形结构每个子节点的功能包括:根据每个子节点中包括的字符串对用户访问页面的URL或者访问网页各个元素的URL中的至少一个进行匹配;以及根据该用户访问页面的URL或者所述访问网页各个元素的URL中包含的字符串特征分配下一级与之匹配的子节点。The functions of each child node of the above tree structure include: matching at least one of the URL of the user's access page or the URL of each element of the web page according to the character string included in each child node; and according to the URL of the user's access page or The character string contained in the URL of each element of the visited webpage is assigned a child node matching the next level.
其次,当每次浏览器客户端启动浏览器以访问网页时,浏览器客户端需要根据具有树形结构的第一数据拦***问网页中的目标信息(即广告信息)需执行以下步骤,图14为本发明实施例提供的一种信息拦截方法的流程图,如图14所示,该步骤包括S1410-S1440,如下所示:Secondly, each time the browser client starts the browser to access the webpage, the browser client needs to intercept the target information (that is, the advertisement information) in the webpage according to the first data with a tree structure. This is a flowchart of an information interception method according to an embodiment of the present invention. As shown in FIG. 14, this step includes S1410-S1440, as follows:
S1410:浏览器客户端启动浏览器以访问网页。S1410: The browser client starts the browser to access the webpage.
具体地,在该步骤之前,还可以包括接收第一数据。其中该第一数据可以为从浏览器服务端下载的,一般来说下载是按照周期性下载的(例如:每天12:00联网的时候,自动下载)。第一数据为服务端根据所述第二数据进行树形转化处理之后得到,所述第二数据包括有效字符串和浏览器的自定义字符串,其中,所述有效字符串为通过对开源网站中的开源字符串和所述终端上报的在预设时间段内的历史数据进行筛选,确定使用率大于预设阈值的字符串。Specifically, before this step, the method may further include receiving the first data. The first data may be downloaded from a browser server. Generally speaking, the download is performed periodically (for example, it is automatically downloaded when the network is connected at 12:00 every day). The first data is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid character string and a custom character string of a browser, where the valid character string is obtained through an open source website. The open source character string in the filter and the historical data reported by the terminal within a preset time period are filtered to determine a character string whose usage rate is greater than a preset threshold.
S1420:浏览器客户端获取所述访问网页的信息。S1420: The browser client obtains information about the accessed web page.
具体地,该访问网页也可以指网址信息。Specifically, the visited webpage may also refer to URL information.
访问网页的信息可以包括:用户访问页面的URL或者访问网页各个元素的URL。其中,该访问网页的信息可能会包括目标信息也可以不包括目标信息,其中,该目标信息在本申请提供的实施例中,一般指代广告信息。The information for accessing the webpage may include the URL of the user accessing the page or the URL of accessing each element of the webpage. Wherein, the information for accessing the webpage may or may not include target information. The target information in the embodiments provided in this application generally refers to advertisement information.
S1430:浏览器客户端将所述访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,所述第一数据用于确定所述访问网页的信息中是否包括目标信息。S1430: The browser client matches the information of the visited web page with first data arranged in a tree structure, where the first data is used to determine whether the information of the visited web page includes target information.
具体的匹配过程可以如下所示:The specific matching process can be as follows:
首先,介绍一下该树形结构(该树形结构可以为上述浏览器服务端经过树形转化确定的树形结构),可以结合图6所示,进行详细说明,所述树形结构包括多个节点,所述多个节点包括根节点(ROOT)和至少一级子节点,所述至少一级子节点中的每一级包括至少两个子节点;每一级的节点与关联的下一级节点具有父子关系,所述第一数据根据预设规则分布在成树形结构的所述多个节点上。将所述访问网页的信息从所述树形结构的父节点的第一数据逐级向与所述父节点呈父子关系的子节点的第一数据进行匹配,直至确定所述访问网页的信息中是否包括所述目标信息。First, introduce the tree structure (the tree structure can be the tree structure determined by the tree conversion of the browser server mentioned above), which can be described in detail with reference to FIG. 6. The tree structure includes a plurality of Nodes, the plurality of nodes including a root node (ROOT) and at least one level of child nodes, each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level and the associated next level nodes With a parent-child relationship, the first data is distributed on the plurality of nodes in a tree structure according to a preset rule. Matching the information of the visited web page from the first data of the parent node of the tree structure to the first data of the child node in a parent-child relationship with the parent node until the information of the visited web page is determined Whether to include the target information.
具体地,树形结构可以包括m级子节点,所述m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,所述n、m均为大于等于1的整数,所述n大于等于所述m;第j级子节点从f种预设规则中选择1种预设规则进行划分,所述f种预设规则为所述n种预设规则中前j-1级子节点选择剩余的预设规则,所述j-1级子节点为所述j级子节点的上一级子节点,所述j级子节点为所述m级子节点中的任意一级子节点,所述j和f均为大于等于1的整数;所述n种预设规则中的每一种分别包括至少两个字符串的类别;Specifically, the tree structure may include m-level sub-nodes, and each level of the m-level sub-nodes is divided according to different preset rules among n preset rules, where n and m are both greater than or equal to 1. Integer n, the n is greater than or equal to the m; the j-th child node selects one preset rule from the f preset rules for division, and the f preset rules are the first of the n preset rules. The j-1 level child node selects the remaining preset rules, the j-1 level child node is a level child node of the j level child node, and the j level child node is the m level child node. At any level of child nodes, the j and f are integers greater than or equal to 1; each of the n preset rules includes at least two categories of strings;
所述第一数据包括多个字符串,所述第一数据的字符串按所述m级子节点划分,所述m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,所述每个子节点包括具有不同的所述字符串的类别的多个字符串。The first data includes a plurality of character strings, and the character string of the first data is divided according to the m-level child nodes. Each of the m-level child nodes corresponds to a different one of the n preset rules. The category of the character string, and each child node includes a plurality of character strings having different categories of the character string.
其中,n种预设规则可以包括下述至少一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。The n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule.
本申请提供的实施例就以该4中规则所示,进行划分匹配。The embodiments provided in this application perform division and matching according to the rules shown in the four.
该树形结构中的第1级包括两个子节点,其中,第1级子节点中的第一子节点包含具有白名单的类别的多个字符串,第1级子节点中的第二子节点包含具有黑名单的类别的多个字符串。其中,两个子节点是根据黑白名单规则进行划分。The first level in the tree structure includes two child nodes, where the first child node of the first level child nodes contains a plurality of character strings with a whitelisted category, and the second child node of the first level child nodes Contains multiple strings with blacklisted categories. Among them, the two child nodes are divided according to the black and white list rules.
在匹配过程中,如果访问网页的信息在第一子节点匹配时,则匹配直接结束,不用再去 第二子节点中匹配大量的字符串。During the matching process, if the information for accessing the webpage matches in the first child node, the matching ends directly, and there is no need to match a large number of strings in the second child node.
举个例子,例如:sina相关的网站为例,如图7所示,确定(black)子节点和(white)子节点,从图中可知,(white)子节点可以包括@@||sina.com.cn/litong/*/close字符串;(black)子节点可以包括:||sina.com.cn/litong/字符串,例如某个图片(某元素)的URL是https://sina.com.cn/litong/180528/close.jpg那么就会先在与(white)子节点中的字符串匹配时,当匹配到“@@||sina.com.cn/litong/*/close”字符串,从而就不需要再去与(black)子节点中的字符串匹配了。由此可知,该(white)子节点的字符串是用来筛选不是广告信息的字符串,当匹配到时,说明该信息不是广告,不拦截该信息即跳出该树形结构,终止匹配的过程。For example, for example: Sina related website is taken as an example. As shown in FIG. 7, (black) child nodes and (white) child nodes are determined. As can be seen from the figure, (white) child nodes may include @@ || sina. com.cn/litong/*/close string; (black) child nodes can include: || sina.com.cn/litong/ string, for example, the URL of a certain picture (some element) is https: // sina. com.cn/litong/180528/close.jpg Then when it matches the string in the (white) child node, when the "@@ || sina.com.cn/litong / * / close" character is matched String, so there is no need to match the string in the (black) child node. It can be seen that the string of the (white) child node is used to filter the string that is not advertising information. When matched, it indicates that the information is not an advertisement. If the information is not intercepted, it will jump out of the tree structure and terminate the matching process. .
但是,当访问网页的信息不包括具有白名单的类别的字符串时,将访问网页的信息与黑名单的类别的字符串进行匹配,即在第二子节点中进行匹配。当所述访问网页的信息不包括所述黑名单的类别的字符串时,所述终端确定所述访问网页的信息不包括所述目标信息,所述终端不拦截所述目标信息,说明该信息不是广告,不拦截该信息即跳出该树形结构,终止匹配的过程。However, when the information of the visited webpage does not include the character string of the category with the white list, the information of the visited webpage is matched with the character string of the blacklisted category, that is, the matching is performed in the second child node. When the information of the visited webpage does not include the character string of the category of the blacklist, the terminal determines that the information of the visited webpage does not include the target information, and the terminal does not intercept the target information, indicating the information It is not an advertisement, it jumps out of the tree structure without intercepting the information, and terminates the matching process.
当所述访问网页的信息包括所述黑名单的类别的字符串时,所述终端将所述访问网页的信息逐级与所述属于所述黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定所述访问网页的信息完全被匹配完毕,所述终端拦截所述访问网页的信息中的目标信息。When the information of the visited webpage includes a character string of the category of the blacklist, the terminal gradually classifies the information of the visited webpage with a child node of the character string of the category that belongs to the blacklist. The child nodes are matched until it is determined that the information of the visited web page is completely matched, and the terminal intercepts the target information in the information of the visited web page.
该树形结构中的第2级包括两个子节点,其中,第2级子节点中的任意一个子节点与所述第1级子节点中属于所述黑名单的类别的字符串的子节点呈父子关系。The second level in the tree structure includes two child nodes, where any one of the child nodes of the second level child node and the child nodes of the character string belonging to the category of the blacklist among the first level child nodes are Father-son relationship.
第2级中的第一个子节点包括具有定位匹配的类别的字符串,第2级中的第二个子节点包括具有预设匹配的类别的字符串。其中,定位匹配的类别用于筛选所述访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;所述预设匹配的类别用于筛选所述访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。The first child node in level 2 includes a character string with a category that locates a match, and the second child node in level 2 includes a character string with a category that has a preset match. Wherein, the category of the positioning match is used to filter at least one of information of a character string in a first preset position or information of a separator in a second preset position in the information of accessing the web page; the preset match The category of is used to filter at least one of information that has a prefix or information that has a suffix in the information for accessing the webpage.
例如:具有定位匹配的类别的子节点主要是根据固定位置的字符进行划分的,具体地,该第一预设位置存在字符*,其中,*表示在第一预设位置出现任意字符串;或者,在第二预设位置存在^,其中,^表示在第二预设位置出现分隔符(其中,分隔符可以是除了字母、数字、_、-、.或者%之外的任何字符)。当浏览器客户端访问页面的网址中包括://、:、/、?、&和=可以看做分隔符:For example: a child node with a positioning-matching category is mainly divided according to characters in a fixed position. Specifically, a character * exists in the first preset position, where * indicates that any character string appears in the first preset position; or ^ Exists in the second preset position, where ^ indicates that a separator appears in the second preset position (where the separator can be any character except letters, numbers, _,-,., Or%). When the URL of the browser client access page includes: //,:, / ,? , & And = can be seen as separators:
http://example.com:8000/foo.bar?a=12&b=%D1%82%D0%B5%D1%81%D1%82http://example.com:8000/foo.bar? a = 12 & b =% D1% 82% D0% B5% D1% 81% D1% 82
所以,在定位匹配的规则列表中的规则过滤^example.com^或^%D1%82%D0%B5%D1%81%D1%82^或^foo.bar^就可以和它匹配上。Therefore, the rule filtering ^ example.com ^ or ^% D1% 82% D0% B5% D1% 81% D1% 82 ^ or ^ foo.bar ^ in the list of positioning matching rules can be matched with it.
前缀的信息或者后缀信息也是相同的,在预设位置出现对应的字符串,则证明可以匹配上,例如:white.plain、black.plain和black.glob,如果访问网页中包括white.plain、black.plain和black.glob的前缀或者后缀相同的字符,即证明可以匹配上。当匹配到时,需要确定访问页面的网页是否被匹配完成,若没有被匹配完成,则继续到第3级进行继续匹配。若被匹配完成,说明该信息是广告,终端拦截该URL对应的目标信息,从而终止匹配的过程。The prefix information or suffix information is also the same. If the corresponding string appears in the preset position, it can be proved to match, for example: white.plain, black.plain, and black.glob. If you visit the web page including white.plain, black The characters with the same prefix or suffix of .plain and black.glob prove that they can be matched. When a match is made, it is necessary to determine whether the webpage of the visited page has been matched. If it has not been matched, then continue to the third level to continue the match. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
第3级中的第一个子节点包括具备标签的类别的字符串,第二个子节点包括不具备标签的字符串。其中,所述第3级子节点中的任意一个子节点与所述第2级子节点中的一个子节点呈父子关系。The first child node in level 3 includes a character string with a label category, and the second child node includes a character string without a label. Wherein, any child node of the third-level child node is in a parent-child relationship with one child node of the second-level child node.
其中,具备标签的类别用于筛选所述访问网页的信息中包括标签属性的信息,所述不具备标签的类别用于筛选所述访问网页的信息中不包括标签属性的信息。其中,具备标签的类别还可以进一步划分为:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。Wherein, the category with a tag is used to filter the information of the visited web page including the information of the tag attribute, and the category without the tag is used to filter the information of the visited web page without the information of the tag attribute. Among them, the categories with tags can be further divided into: categories with only host names, categories with only host information for advertising attributes, categories with two levels of classification for hosts and domain names, and categories for URLs for uniform resource locators for hosts and advertisements Or at least one of the categories where the domain name and the URL information of the advertisement are different.
具体匹配过程,可以为根据下述分类方法进行匹配:The specific matching process can be based on the following classification methods:
第一种:Direct匹配The first: Direct matching
具体地,仅包含主机名(即图9中序号2的说明部分只包含主机信息),例如:(图9中序号2中的示例部分)||9377os.com^这种字符串可以后续根据主机名的字符串进一步匹配。Specifically, only the host name is included (that is, the description part of the serial number 2 in FIG. 9 only includes the host information), for example: (the example part in the serial number 2 in FIG. 9) || 9377os.com The name string is further matched.
第二种:Third匹配Second: Third Match
具体地,仅包括广告归属的主机信息(即图9中序号3的说明部分只包含第三方网站访问广告归属网站的信息),例如:(图9中序号)||116b.com^$third-party可以在分类下后续根据主机名的字符串进一步匹配。Specifically, only the host information of the advertising attribution is included (that is, the description part of the serial number 3 in FIG. 9 only includes the information of the third-party website accessing the advertising attribution website), for example: (serial number in FIG. 9) || 116b.com ^ $ third- The party can further match under the classification based on the host name string.
第三种:Domain_Direct匹配Third type: Domain_Direct matching
具体地,后续按照主机和domain两级分类进行字符匹配(即图9中序号4的说明部分包含当前网页的domain和广告网页主机的信息)。Specifically, subsequent character matching is performed according to the two-level classification of the host and the domain (that is, the description part of the serial number 4 in FIG. 9 includes the domain of the current web page and the information of the host of the advertising web page).
第四种:Domain_Filter匹配Fourth type: Domain_Filter matching
具体地,包含了主机和广告的url信息(即图9中序号5的说明部分包含domain和广告内容的信息),例如下述5条字符串:Specifically, the url information of the host and the advertisement is included (that is, the description part of the serial number 5 in FIG. 9 includes the information of the domain and the advertisement content), for example, the following five strings:
Figure PCTCN2019106728-appb-000006
/static/media/curl.swf$domain=duba.com
Figure PCTCN2019106728-appb-000006
/static/media/curl.swf$domain=duba.com
Figure PCTCN2019106728-appb-000007
/banner.js$domain=28188.com|28188.net
Figure PCTCN2019106728-appb-000007
/banner.js$domain=28188.com|28188.net
Figure PCTCN2019106728-appb-000008
/skin/tb12/$domain=17huohu.com|firefox.com.cn
Figure PCTCN2019106728-appb-000008
/skin/tb12/$domain=17huohu.com|firefox.com.cn
Figure PCTCN2019106728-appb-000009
.com/tps/$domain=ocucn.com
Figure PCTCN2019106728-appb-000009
.com / tps / $ domain = ocucn.com
Figure PCTCN2019106728-appb-000010
||cdndm.com/12/2016/$domain=1kkk.com|dm5.com
Figure PCTCN2019106728-appb-000010
|| cdndm.com/12/2016/$domain=1kkk.com | dm5.com
根据上述5条字符串,还可以按照Domain_Filter进一步匹配:Based on the above 5 strings, you can further match according to Domain_Filter:
包含广告主机名的字符串可以为:||cdndm.com/12/2016/$domain=1kkk.com|dm5.comThe string containing the hostname of the ad can be: || cdndm.com/12/2016/$domain=1kkk.com | dm5.com
不含主机名包含广告path的字符串可以为:.com/tps/$domain=ocucn.comThe string without the host name and the advertisement path can be: .com / tps / $ domain = ocucn.com
不含主机包含广告的文件名的字符串可以为:The string without the file name of the host containing the ad can be:
/static/media/curl.swf$domain=duba.com/static/media/curl.swf$domain=duba.com
上述匹配方法,也可以对归属主机属性分类下面的domain_filter做一样处理。例如对于包含图片的广告,可以使用这个匹配方法:The above matching method can also do the same for the domain_filter under the home host attribute classification. For example, for ads containing images, you can use this matching method:
第五种,THIRD_FILTERS匹配Fifth, THIRD_FILTERS matching
具体地,例如:访问网页的信息是否包含||books.com.tw/exep/ap/$third-party字符串。Specifically, for example, whether the information for accessing the web page contains || books.com.tw/exep/ap/$third-party string.
这里与第四种Domain_filter的区别在于只是用户当前访问的domain和广告的主机是不同的,其广告信息处理部分相同。因此也可以按照第四种Domain_filter部分同样处理。The difference between this and the fourth type of Domain_filter is that only the domain that the user is currently visiting and the host of the advertisement are different, and the processing of the advertisement information is the same. Therefore, it can also be processed according to the fourth Domain_filter part.
另外,还可以包括第五种:Type_filter匹配In addition, you can also include the fifth: Type_filter matching
具体地,Domain_filter和Third_filter可以组合为Type_filter,当访问的页面中包 括广告内容信息的时候,Type_filter可以包括Domain_filter和Third_filter两个子类。Specifically, Domain_filter and Third_filter can be combined into Type_filter. When the accessed page includes advertisement content information, Type_filter can include two subclasses of Domain_filter and Third_filter.
当匹配到时,需要确定访问页面的网页是否被匹配完成,若没有被匹配完成,则继续到第4级进行继续匹配。若被匹配完成,说明该信息是广告,终端拦截该URL对应的目标信息,从而终止匹配的过程。When a match is made, it is necessary to determine whether the webpage of the visited page has been matched. If it has not been matched, then continue to the fourth level to continue matching. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
第4级中的第一个子节点包括首字符串的类别的字符串,第二个子节点包括预置字符串的类别的字符串。其中,所述第4级子节点中的任意一个子节点与所述第3级子节点中的一个子节点呈父子关系。The first child node in the fourth level includes a character string of the category of the first character string, and the second child node includes the character string of the category of the preset character string. Wherein, any child node of the fourth-level child node is in a parent-child relationship with one child node of the third-level child node.
具体地,在匹配过程中,首字符串的类别用于筛选访问网页的信息与所述首字符串的类别的字符串具有首字符相同的信息;所述预置字符串的类别用于筛选所述访问网页的信息与所述预置字符串的类别的字符串具有预置字符串相同的信息。Specifically, in the matching process, the category of the first character string is used to filter information for accessing the webpage and the character string of the category of the first character string has the same first character; the category of the preset character string is used to filter all information. The information about accessing the webpage and the character string of the category of the preset character string have the same information as the preset character string.
当匹配到时,需要确定访问页面的网页是否被匹配完成,若没有被匹配完成,则继续到第5级进行继续匹配,以此类推,直至访问页面的网页中完全被匹配。若被匹配完成,说明该信息是广告,终端拦截该URL对应的目标信息,从而终止匹配的过程。When a match is made, it is necessary to determine whether the web page of the visited page has been matched. If it has not been matched, continue to level 5 to continue matching, and so on, until the web page of the visited page is completely matched. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
综上所示,上述的树形结构每个子节点的功能包括:根据每个子节点中包括的字符串对用户访问页面的URL或者访问网页各个元素的URL中的至少一个进行匹配;以及根据该用户访问页面的URL或者所述访问网页各个元素的URL中包含的字符串特征分配下一级与之匹配的子节点。In summary, the functions of each child node of the above tree structure include: matching at least one of the URL of the user's access page or the URL of each element of the web page according to the character string included in each child node; and according to the user The character string contained in the URL of the visited page or the URL of each element of the visited web page is assigned a child node matching the next level.
S1440:当访问网页的信息中包括目标信息时,拦截所述目标信息。S1440: When the information for accessing the webpage includes target information, intercept the target information.
具体地,在S1430中,当确定访问页面的网页被匹配完成时,说明该信息是广告,终端拦截(可以包括删除或者隐藏)访问页面的网页中的URL对应的目标信息,从而终止匹配的过程。在实现的效果上说,用户是无法感知广告的存在。Specifically, in S1430, when it is determined that the webpage of the visited page is matched, it is indicated that the information is an advertisement, and the terminal intercepts (may include deleting or hiding) target information corresponding to the URL in the webpage of the visited page, thereby terminating the matching process. . In terms of the effect achieved, users are unable to perceive the existence of advertisements.
若确定访问页面的网页没有被匹配完成时,即匹配上的URL的长度并没有超过预设阈值,则证明该用户访问页面不存在广告信息,浏览器客户端可以直接显示给用户。If it is determined that the webpage of the visited page has not been matched, that is, the length of the URL on the match does not exceed the preset threshold, it proves that the user visits the page without advertising information, and the browser client can directly display it to the user.
本方案中,通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题。此外,通过对获取开源列表中字符串的统计,目的是移除无效或者极少人访问的字符串,减少规则的数量,以便于有效减少后面匹配的次数。In this solution, the target information in the browser page is intercepted by the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the visited web page and the first data. The number of matching times, thereby avoiding the problem of increasing the number of matching times due to a large number of strings of intercepted target information and no rationalized matching method. In addition, by collecting statistics on the strings in the open source list, the purpose is to remove invalid or rarely accessed strings, and reduce the number of rules in order to effectively reduce the number of subsequent matches.
图15为本发明实施例提供的一种信息拦截的终端结构示意图。如图15所示,该终端15可以包括:一个或多个处理器1502、收发器1501、存储器1503中多个应用程序(未在图中示出);以及一个或多个计算机程序,其中,一个或多个计算机程序被存储在存储器中,一个或多个计算机程序包括指令,当指令被终端执行时,使得终端执行以下步骤:FIG. 15 is a schematic structural diagram of an information interception terminal according to an embodiment of the present invention. As shown in FIG. 15, the terminal 15 may include: one or more processors 1502, a transceiver 1501, a plurality of application programs (not shown in the figure) in the memory 1503, and one or more computer programs. One or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the terminal, cause the terminal to perform the following steps:
启动浏览器以访问网页;Launch a browser to access a web page;
获取访问网页的信息;Get information about visiting web pages;
将访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,第一数据用于确定访问网页的信息中是否包括目标信息;Matching information of a visited web page with first data arranged in a tree structure, where the first data is used to determine whether the information of the visited web page includes target information;
当访问网页的信息中包括目标信息时,拦截目标信息。When the information of the visited web page includes the target information, the target information is intercepted.
其中,树形结构中可以包括:包括多个节点,多个节点包括根节点和至少一级子节点,至少一级子节点中的每一级包括至少两个子节点;每一级的节点与关联的下一级节点具有父子关系,第一数据根据预设规则分布在成树形结构的多个节点上。Wherein, the tree structure may include: including multiple nodes, multiple nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level are associated with The nodes at the next level have a parent-child relationship, and the first data is distributed on a plurality of nodes in a tree structure according to a preset rule.
该终端可以具体执行以下步骤:The terminal can perform the following steps:
将访问网页的信息从树形结构的父节点的第一数据逐级向与父节点呈父子关系的子节点的第一数据进行匹配,直至确定访问网页的信息中是否包括目标信息。The information of accessing the webpage is matched step by step from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until it is determined whether the information of the accessing the webpage includes the target information.
上述树形结构具体可以包括m级子节点,m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,n、m均为大于等于1的整数,n大于等于m;第j级子节点从f种预设规则中选择1种预设规则进行划分,f种预设规则为n种预设规则中前j-1级子节点选择剩余的预设规则,j-1级子节点为j级子节点的上一级子节点,j级子节点为m级子节点中的任意一级子节点,j和f均为大于等于1的整数;n种预设规则中的每一种分别包括至少两个字符串的类别;第一数据包括多个字符串,第一数据的字符串按m级子节点划分,m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,每个子节点包括具有不同的字符串的类别的多个字符串。The above tree structure may specifically include m-level sub-nodes, and each level of the m-level sub-nodes is divided according to different preset rules among n preset rules, where n and m are integers greater than or equal to 1, and n is greater than Equal to m; the j-th child selects one of the f preset rules to divide, and the f preset rules are the first j-1 level of the n preset rules to select the remaining preset rules. The j-1 level child node is a level child node of the j level child node, the j level child node is any level child node of the m level child node, and j and f are integers greater than or equal to 1; n kinds of presets Each of the rules includes categories of at least two character strings; the first data includes multiple character strings, and the character string of the first data is divided by m-level child nodes, and each of the m-level child nodes corresponds to n Different types of character strings in the preset rule, and each child node includes multiple character strings with different types of character strings.
其中,n种预设规则可以包括下述至少一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。The n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule.
具体地,黑白名单规则可以包括:白名单的类别和黑名单的类别,m级子节点中的第1级子节点根据黑白名单规则进行划分,第一数据中属于白名单的类别的字符串和属于黑名单的类别的字符串分别对应第1级子节点中的一个子节点。Specifically, the blacklist and whitelist rules may include the categories of the whitelist and the blacklist, the first-level child nodes of the m-level child nodes are divided according to the blacklist and whitelist rules, and the first character string and The strings that belong to the blacklist category correspond to one child node in the first-level child node, respectively.
终端可以执行以下步骤:将访问网页的信息与白名单的类别的字符串进行匹配,当访问网页的信息包括白名单的类别的字符串时,终端确定访问网页的信息不包括目标信息,终端不拦截目标信息。The terminal may perform the following steps: match the information of the visited web page with the character string of the white list category, and when the information of the visited web page includes the character string of the white list category, the terminal determines that the information of the visited web page does not include the target information, and the terminal does not Intercept target information.
终端还可以执行以下步骤:当访问网页的信息不包括白名单的类别的字符串时,将访问网页的信息与黑名单的类别的字符串进行匹配;当访问网页的信息不包括黑名单的类别的字符串时,终端确定访问网页的信息不包括目标信息,终端不拦截目标信息;当访问网页的信息包括黑名单的类别的字符串时,终端将访问网页的信息逐级与属于黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定访问网页的信息被匹配完毕,终端拦***问网页的信息中的目标信息。The terminal may also perform the following steps: when the information of the accessed webpage does not include the character string of the whitelisted category, match the information of the accessed webpage with the characterlist of the blacklisted category; when the information of the accessed webpage does not include the blacklisted category When the information of the visited webpage does not include the target information, the terminal does not intercept the target information; when the information of the visited webpage includes the character string of the blacklist category, the terminal will gradually access the information of the visited webpage to the information that belongs to the blacklist. The child nodes of the category string match the child nodes of the parent-child relationship until it is determined that the information for accessing the webpage is matched, and the terminal intercepts the target information in the information for accessing the webpage.
上述,定位和预设匹配规则可以具体包括:定位匹配的类别和预设匹配的类别,m级子节点中的第2级子节点根据定位和预设匹配规则进行划分,第一数据中属于定位匹配的类别的字符串和属于预设匹配的类别的字符串分别对应第2级子节点中的一个子节点,其中第2级子节点中的任意一个子节点与第1级子节点中属于黑名单的类别的字符串的子节点呈父子关系。As mentioned above, the positioning and preset matching rules may specifically include: the category of the positioning match and the category of the preset matching, the second-level child nodes of the m-level sub-nodes are divided according to the positioning and the preset matching rules, and the first data belongs to the positioning The strings of the matching category and the strings belonging to the preset matching category respectively correspond to a child node of the second-level child node, where any one of the second-level child nodes and the first-level child node are black. The child nodes of the strings of the category of the list are in a parent-child relationship.
其中,定位匹配的类别可以用于筛选访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;预设匹配的类别用于筛选访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。The category of the positioning match can be used to filter at least one of information that a character string exists in the first preset position or information that has a separator in the second preset position in the information for accessing the web page. There is at least one of information that has a prefix or information that has a suffix in filtering the information for accessing the webpage.
上述标签属性规则中可以具体包括:具备标签的类别和不具备标签的类别,m级子节点中的第3级子节点根据标签属性规则进行划分,第一数据中属于具备标签的类别的字符串和不具备标签的类别的字符串分别对应第3级子节点中的一个子节点,其中第3级子节点中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。The above-mentioned label attribute rules may specifically include: a category with a label and a category without a label, a third-level child node of the m-level child node is divided according to the label attribute rule, and a string in the first data that belongs to the category with the label The character string and the category without the label respectively correspond to one child node of the third-level child node, and any child node of the third-level child node is in a parent-child relationship with one child node of the second-level child node.
其中,具备标签的类别可以用于筛选访问网页的信息中包括标签属性的信息,不具备标签的类别用于筛选访问网页的信息中不包括标签属性的信息;其中,具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with tags can be used to filter the information of visiting web pages to include the information of tag attributes, and the categories without tags can be used to filter the information of visiting web pages to not include the information of tag attributes. Among them, the categories with tags include: Host-only category, advertising-only host information category, host and domain name two-level classification category, host and advertisement uniform resource locator URL information category, or only domain and advertisement URL information categories that differ At least one of.
上述字符规则可以包括:首字符串的类别和预置字符串的类别,m级子节点中的第4级子节点根据字符规则进行划分,第一数据中属于首字符串的类别的字符串和预置字符串的类别的字符串分别对应第4级子节点中的一个子节点,其中第4级子节点中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。The above character rules may include: a category of the first character string and a category of a preset character string, a fourth-level child node among the m-level child nodes is divided according to the character rule, and a character string belonging to the category of the first character string in the first data and The character strings of the types of the preset character strings correspond to one child node of the fourth-level child node, and any one of the child nodes of the fourth-level child node is in a parent-child relationship with one of the child nodes of the third-level child node.
其中,首字符串的类别可以用于筛选访问网页的信息与首字符串的类别的字符串具有首字符相同的信息;预置字符串的类别用于筛选访问网页的信息与预置字符串的类别的字符串具有预置字符串相同的信息。The category of the first string can be used to filter the information of the visited web page and the string of the category of the first string has the same information as the first character; the category of the preset string is used to filter the information of the visited web page and the preset string. The category strings have the same information as the preset strings.
上述步骤中,访问网页的信息可以包括:用户访问页面的URL或者访问网页各个元素的URL,目标信息为广告信息。第一数据为服务端根据第二数据进行树形转化处理之后得到,第二数据包括有效字符串和浏览器的自定义字符串,其中,有效字符串为通过对开源网站中的开源字符串和终端上报的在预设时间段内的历史数据进行筛选,确定使用率大于预设阈值的字符串。In the above steps, the information for accessing the webpage may include the URL of the user accessing the page or the URL of each element of the webpage, and the target information is advertisement information. The first data is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid string and a custom string of the browser, where the valid string is an open source string and The historical data reported by the terminal within a preset time period is filtered to determine a character string whose usage rate is greater than a preset threshold.
由于第一数据是该终端向服务端下载的,整体上述匹配的过程是在终端中进行,所以,该方式极大的提升了终端进行信息的匹配速度以及避免了现有技术中需要服务端有较高的性能才能快速完成页面内容的处理的问题。Since the first data is downloaded by the terminal to the server, the overall matching process described above is performed in the terminal, so this method greatly improves the speed of information matching by the terminal and avoids the need for the server to have The problem of high performance can quickly complete the processing of the page content.
本方案中,该终端通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题。In this solution, the terminal intercepts the target information in the browser page through the first data with a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of the web page access and the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to a large number of strings of intercepted target information and no reasonable matching method.
图16为本发明实施例提供的一种数据处理的服务器的结构示意图。如图16所示,服务器16可以包括:一个或多个处理器1601、收发器1602和存储器1603多个应用程序;以及一个或多个计算机程序,其中,一个或多个计算机程序被存储在存储器中,一个或多个计算机程序包括指令,当指令被服务器执行时,使得服务器执行以下步骤:FIG. 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention. As shown in FIG. 16, the server 16 may include: one or more processors 1601, a transceiver 1602, and a memory 1603; a plurality of application programs; and one or more computer programs, where the one or more computer programs are stored in the memory In one or more computer programs, the instructions include instructions that, when executed by the server, cause the server to perform the following steps:
将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
服务器将第一数据发送给终端,以便于终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains the target information according to the determination.
其中,目标信息可以为广告信息;访问网页的信息包括:用户访问页面的URL或者访问网页各个元素的URL中的至少一种。The target information may be advertisement information; the information for accessing the webpage includes at least one of a URL of a user accessing the page or a URL of accessing each element of the webpage.
上述服务器可以执行具体以下步骤:从开源网站周期性获取至少一个开源字符串;在至少一个开源字符串和客户端上报的在预设时间段内的历史数据中选取访问量大于第一阈值的多个字符串为有效字符串;获取浏览器服务器的自定义字符串;根据有效字符串和自定义字符串,确定第二数据,有效字符串和自定义字符串中分别包括至少一个字符串。The above server may perform the following specific steps: periodically obtain at least one open source string from an open source website; and select at least one open source string and historical data reported by the client within a preset period of time. Each string is a valid string; obtain a custom string of the browser server; determine the second data according to the valid string and the custom string, and each of the valid string and the custom string includes at least one string.
上述服务器可以执行具体以下步骤:根据n种预设规则将多个子节点划分为m级,m级子节点中每一级的预设规则都不同;n种预设规则中的每一种分别包括至少两个字符串的类别,根据字符串的类别将m级中的每层划分为至少两个子节点;第二数据包括多个字符串,每个子节点中分别包括属于不同字符串的类别的多个字符串,n、m均为大于等于1的整数,n大于等于m;第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,k级子节点为m级子节点中的任意一级子节点,k为大于等于1的整数。The above server may perform the following specific steps: divide multiple child nodes into m levels according to n preset rules, and the preset rules of each level of the m-level child nodes are different; each of the n preset rules includes The category of at least two strings, each layer in the m level is divided into at least two child nodes according to the category of the strings; the second data includes a plurality of strings, and each child node includes a plurality of categories belonging to different strings. Strings, n and m are integers greater than or equal to 1, and n is greater than or equal to m; each child node in the k-th child node has a parent-child relationship with one child node in k-1 level, and the child node in k level is m For any one-level child node of the first-level child node, k is an integer greater than or equal to 1.
上述n种预设规则可以包括下述至少一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则;服务器执行以下步骤:根据黑白名单规则、定位和预设匹配规则、标签属性规则和字符规则将多个子节点划分为m级子节点。The above n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, tag attribute rules or character rules; the server performs the following steps: according to the black and white list rules, positioning and preset matching rules, Label attribute rules and character rules divide multiple child nodes into m-level child nodes.
上述服务器可以执行具体以下步骤:当黑白名单规则中包括白名单的类别和黑名单的类别时,按照白名单的类别和黑名单的类别将m级子节点中的第1级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于白名单的类别的字符串,另一个子节点包括第二数据中属于黑名单的类别的字符串。The above server may perform the following specific steps: When the whitelist and blacklist categories are included in the blacklist and whitelist rules, the first level of the m-level subnodes is divided into two subnodes according to the whitelist categories and blacklist categories. One of the two child nodes includes a character string of a category that belongs to the white list in the second data, and the other child node includes a character string of a category that belongs to the black list in the second data.
上述服务器可以执行具体以下步骤:当定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照定位匹配的类别和预设匹配的类别,将m级子节点中的第2级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于定位匹配的类别的字符串,另一个子节点包括第二数据中属于预设匹配的类别的字符串,其中第2级中的两个子节点与第1级中属于黑名单的类别的字符串所在的节点呈父子关系。The above server may perform the following specific steps: when the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the The level is divided into two sub-nodes. One of the two sub-nodes includes a character string belonging to the category of the positioning match in the second data, and the other sub-node includes a character string belonging to the category of the preset match in the second data. The two child nodes in level 2 are in a parent-child relationship with the nodes where the strings belonging to the blacklisted category in level 1 are located.
上述服务器可以执行具体以下步骤:当标签属性规则中包括具备标签的类别和不具备标签的类别时,按照具备标签的类别和不具备标签的类别,将m级子节点中的第3级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于具备标签的类别的字符串,另一个子节点包括第二数据中属于不具备标签的类别的字符串,其中第3级中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。The above server may perform the following specific steps: When the label attribute rule includes a category with a label and a category without a label, according to the category with the label and the category without the label, the third level of the m-level child node is divided into Two child nodes, one of the two child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string belonging to a category without a label in the second data. Any child node of the parent-child relationship with a child node in the second-level child node.
其中,具备标签的类别可以包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with tags may include: a category with only a host name, a category with only advertising information, a category with two levels of hosting and domain name, a category with uniform resource locator URL information for the hosting and advertising, or only The domain name and the URL information of the advertisement are at least one of different categories.
上述服务器可以执行具体以下步骤:当字符规则中包括首字符串的类别和预置字符串的类别时,按照首字符串的类别和预置字符串的类别,将m级子节点中的第4级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于首字符串的类别的字符串,另一个子节点包括第二数据中属于预置字符串的类别的字符串,其中第4级中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。The above server may perform the following specific steps: When the character rule includes the category of the first string and the category of the preset string, according to the category of the first string and the category of the preset string, The level is divided into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the character string belonging to the category of the preset character string in the second data. Any child node in the fourth level is in a parent-child relationship with a child node in the third level.
本方案中,通过对第二数据进行树形转化处理,该树形结构可以对第二数据中的字符串进行深度区分,转化为区分度非常高的树形结构,有效减少访问网页的信息与第一数据的匹配次数。In this solution, by performing a tree transformation process on the second data, the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
图17为本发明实施例提供的一种信息拦截的装置结构示意图。如图17所示,该装置17可以包括:FIG. 17 is a schematic structural diagram of an information interception apparatus according to an embodiment of the present invention. As shown in FIG. 17, the device 17 may include:
处理模块1702,用于启动浏览器以访问网页;A processing module 1702, configured to start a browser to access a web page;
收发模块1701,用于获取访问网页的信息;The transceiver module 1701 is configured to obtain information about accessing a web page;
该处理模块还用于,将访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,第一数据用于确定访问网页的信息中是否包括目标信息;当访问网页的信息中包括目标信息时,拦截目标信息。The processing module is further configured to match the information of the visited web page with the first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes the target information; When target information is included, the target information is intercepted.
其中,树形结构可以包括多个节点,多个节点包括根节点和至少一级子节点,至少一级子节点中的每一级包括至少两个子节点;每一级的节点与关联的下一级节点具有父子关系,第一数据根据预设规则分布在成树形结构的多个节点上。The tree structure may include multiple nodes, the multiple nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level and the associated next node The hierarchical nodes have a parent-child relationship, and the first data is distributed on a plurality of nodes in a tree structure according to a preset rule.
上述处理模块具体可以用于,将访问网页的信息从树形结构的父节点的第一数据逐级向与父节点呈父子关系的子节点的第一数据进行匹配,直至确定访问网页的信息中是否包括目标信息。The above processing module may be specifically configured to match the information for accessing the web page from the first data of the parent node of the tree structure to the first data of the child nodes in a parent-child relationship with the parent node until the information for accessing the web page is determined. Whether to include target information.
上述树形结构可以包括m级子节点,m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,n、m均为大于等于1的整数,n大于等于m;第j级子节点从f种预 设规则中选择1种预设规则进行划分,f种预设规则为n种预设规则中前j-1级子节点选择剩余的预设规则,j-1级子节点为j级子节点的上一级子节点,j级子节点为m级子节点中的任意一级子节点,j和f均为大于等于1的整数;n种预设规则中的每一种分别包括至少两个字符串的类别;第一数据包括多个字符串,第一数据的字符串按m级子节点划分,m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,每个子节点包括具有不同的字符串的类别的多个字符串。The above tree structure may include m-level child nodes, and each level of the m-level child nodes is divided according to different preset rules among the n types of preset rules, where n and m are integers greater than or equal to 1, and n is greater than or equal to m; the j-th level child node selects one preset rule from the f preset rules to divide, and the f preset rules are the remaining j-1 level sub-nodes from the n preset rules to select the remaining preset rules, j The -1 level child node is a level child node of the j level child node, the j level child node is any level child node of the m level child node, and j and f are integers greater than or equal to 1; n kinds of preset rules Each of them includes categories of at least two character strings; the first data includes multiple character strings, and the character string of the first data is divided by m-level child nodes, and each of the m-level child nodes corresponds to n types respectively Different types of character strings in a preset rule, and each child node includes multiple character strings having different types of character strings.
其中,n种预设规则可以包括下述至少一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。The n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule.
上述黑白名单规则可以包括白名单的类别和黑名单的类别,m级子节点中的第1级子节点根据黑白名单规则进行划分,第一数据中属于白名单的类别的字符串和属于黑名单的类别的字符串分别对应第1级子节点中的一个子节点。The above black and white list rules may include the types of the white list and the black list. The first-level child nodes of the m-level sub-nodes are divided according to the black and white list rules. The character string of the category corresponds to a child node in the first-level child node.
处理模块具体可以用于,将访问网页的信息与白名单的类别的字符串进行匹配,当访问网页的信息包括白名单的类别的字符串时,确定访问网页的信息不包括目标信息,不拦截目标信息。The processing module can be specifically used to match the information of the visited web page with the strings of the white list category. When the information of the visited web page includes the strings of the white list category, it is determined that the information of the visited web page does not include the target information and is not intercepted. Target information.
处理模块具体可以用于,当访问网页的信息不包括白名单的类别的字符串时,将访问网页的信息与黑名单的类别的字符串进行匹配;当访问网页的信息不包括黑名单的类别的字符串时,确定访问网页的信息不包括目标信息,不拦截目标信息;当访问网页的信息包括黑名单的类别的字符串时,将访问网页的信息逐级与属于黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定访问网页的信息被匹配完毕,拦***问网页的信息中的目标信息。The processing module may be specifically configured to match the information of the visited web page with the character string of the blacklisted category when the information of the visited webpage does not include the whitelisted category strings; when the information of the visited webpage does not include the blacklisted category strings When the information of the visited webpage does not include the target information, the target information is not intercepted; when the information of the visited webpage includes the character string of the blacklist category, the information of the visited webpage is gradually ranked with the characters belonging to the blacklisted category. The child nodes of the string match the child nodes of the parent-child relationship until it is determined that the information for accessing the webpage is matched, and the target information in the information for accessing the webpage is intercepted.
上述定位和预设匹配规则可以包括定位匹配的类别和预设匹配的类别,m级子节点中的第2级子节点根据定位和预设匹配规则进行划分,第一数据中属于定位匹配的类别的字符串和属于预设匹配的类别的字符串分别对应第2级子节点中的一个子节点,其中第2级子节点中的任意一个子节点与第1级子节点中属于黑名单的类别的字符串的子节点呈父子关系。The above positioning and preset matching rules may include the positioning matching category and the preset matching category. The second-level child nodes in the m-level child nodes are divided according to the positioning and preset matching rules. The first data belongs to the positioning matching category. And the strings that belong to the preset matching category correspond to one child node of the second-level child nodes, in which any one of the second-level child nodes and the first-level child nodes belong to the blacklisted category. The child nodes of the string are in a parent-child relationship.
其中,定位匹配的类别可以用于筛选访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;预设匹配的类别用于筛选访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。The category of the positioning match can be used to filter at least one of information that a character string exists in the first preset position or information that has a separator in the second preset position in the information for accessing the web page. There is at least one of information that has a prefix or information that has a suffix in filtering the information for accessing the webpage.
上述标签属性规则可以包括具备标签的类别和不具备标签的类别,m级子节点中的第3级子节点根据标签属性规则进行划分,第一数据中属于具备标签的类别的字符串和不具备标签的类别的字符串分别对应第3级子节点中的一个子节点,其中第3级子节点中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。The above-mentioned label attribute rule may include a category with a label and a category without a label. The third-level child node of the m-level child node is divided according to the label attribute rule. The character string of the label category corresponds to a child node of the third-level child node, and any child node of the third-level child node is in a parent-child relationship with a child node of the second-level child node.
其中,具备标签的类别可以用于筛选访问网页的信息中包括标签属性的信息,不具备标签的类别用于筛选访问网页的信息中不包括标签属性的信息;其中,具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with tags can be used to filter the information of visiting web pages to include the information of tag attributes, and the categories without tags can be used to filter the information of visiting web pages to not include the information of tag attributes. Among them, the categories with tags include: Host-only category, advertising-only host information category, host and domain name two-level classification category, host and advertisement uniform resource locator URL information category, or only domain and advertisement URL information categories that differ At least one of.
上述字符规则可以包括首字符串的类别和预置字符串的类别,m级子节点中的第4级子节点根据字符规则进行划分,第一数据中属于首字符串的类别的字符串和预置字符串的类别的字符串分别对应第4级子节点中的一个子节点,其中第4级子节点中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。The above character rules may include the category of the first character string and the category of the preset character string. The fourth-level child nodes of the m-level child nodes are divided according to the character rules. The string of the category of the set string corresponds to a child node of the fourth-level child node, in which any child node of the fourth-level child node and a child node of the third-level child node are in a parent-child relationship.
其中,首字符串的类别可以用于筛选访问网页的信息与首字符串的类别的字符串具有首 字符相同的信息;预置字符串的类别用于筛选访问网页的信息与预置字符串的类别的字符串具有预置字符串相同的信息。The category of the first string can be used to filter the information of the visited web page and the string of the category of the first string has the same information as the first character; the category of the preset string is used to filter the information of the visited web page and the preset string. The category strings have the same information as the preset strings.
上述访问网页的信息可以包括用户访问页面的URL或者访问网页各个元素的URL,目标信息为广告信息。第一数据可以为服务端根据第二数据进行树形转化处理之后得到,第二数据包括有效字符串和浏览器的自定义字符串,其中,有效字符串为通过对开源网站中的开源字符串和上报的在预设时间段内的历史数据进行筛选,确定使用率大于预设阈值的字符串。The above-mentioned information for accessing the webpage may include the URL of the page accessed by the user or the URL of each element of the webpage, and the target information is advertisement information. The first data may be obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid string and a custom string of the browser, where the valid string is an open source string in an open source website. Filter the reported historical data within a preset time period to determine a character string with a usage rate greater than a preset threshold.
本方案中,该装置通过具有树形结构的第一数据拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题,整体可以提升匹配速度40%以上。In this solution, the device intercepts the target information in the browser page through the first data having a tree structure, and the tree structure can deeply distinguish the character strings in the first data, effectively reducing the information of accessing the web page from the first The number of matching times of the data, thereby avoiding the problem of increasing the number of matching times due to the large number of strings that intercept the target information and the lack of a reasonable matching method. The overall matching speed can be increased by more than 40%.
图18为本发明实施例提供的一种数据处理的装置的结构示意图。如图18所示,该装置18包括:FIG. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 18, the device 18 includes:
处理模块1802,将第二数据进行树形转化处理,确定第一数据;The processing module 1802 performs tree transformation processing on the second data to determine the first data;
收发模块1801,将第一数据发送给终端,以便于终端根据确定访问网页是否包含目标信息。The transceiver module 1801 sends the first data to the terminal, so that the terminal determines whether the accessed webpage contains target information according to the determination.
其中,目标信息可以为广告信息;访问网页的信息包括:用户访问页面的URL或者访问网页各个元素的URL中的至少一种。The target information may be advertisement information; the information for accessing the webpage includes at least one of a URL of a user accessing the page or a URL of accessing each element of the webpage.
上述收发模块还可以用于,从开源网站周期性获取至少一个开源字符串;获取浏览器服务器的自定义字符串。处理模块还可以用于,在至少一个开源字符串和客户端上报的在预设时间段内的历史数据中选取访问量大于第一阈值的多个字符串为有效字符串;处理模块还可以用于,根据有效字符串和自定义字符串,确定第二数据,有效字符串和自定义字符串中分别包括至少一个字符串;根据n种预设规则将多个子节点划分为m级,m级子节点中每一级的预设规则都不同;n种预设规则中的每一种分别包括至少两个字符串的类别,根据字符串的类别将m级中的每层划分为至少两个子节点;第二数据包括多个字符串,每个子节点中分别包括属于不同字符串的类别的多个字符串,n、m均为大于等于1的整数,n大于等于m;第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,k级子节点为m级子节点中的任意一级子节点,k为大于等于1的整数。The foregoing transceiver module may also be used to periodically obtain at least one open source string from an open source website; and obtain a custom string of a browser server. The processing module may also be used to select a plurality of strings with a visit amount greater than the first threshold as valid strings from at least one open source string and historical data reported by the client within a preset time period; the processing module may also use Therefore, the second data is determined according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string; the plurality of child nodes are divided into m levels and m levels according to n preset rules. The preset rules at each level of the child node are different; each of the n preset rules includes at least two categories of strings, and each layer in the m level is divided into at least two children according to the category of the strings Node; the second data includes multiple character strings, each of which includes multiple character strings belonging to different character string categories, n and m are integers greater than or equal to 1, n is greater than or equal to m; the k-th child node Each child node in the parent node has a child relationship with one child node in the k-1 level. The k-level child node is any one-level child node in the m-level child node, and k is an integer greater than or equal to 1.
其中,n种预设规则可以包括下述至少一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则;处理模块还可以用于,根据黑白名单规则、定位和预设匹配规则、标签属性规则和字符规则将多个子节点划分为m级子节点。The n preset rules may include at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule; the processing module may also be used to, according to the black and white list rule, positioning and preset Matching rules, label attribute rules, and character rules divide multiple child nodes into m-level child nodes.
处理模块具体可以用于,当黑白名单规则中包括白名单的类别和黑名单的类别时,按照白名单的类别和黑名单的类别将m级子节点中的第1级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于白名单的类别的字符串,另一个子节点包括第二数据中属于黑名单的类别的字符串。The processing module may be specifically used to divide the first level of the m-level sub-nodes into two sub-nodes according to the types of the white list and the black list when the black and white list rules include the types of the white list and the types of the black list. One of the two child nodes includes a character string belonging to a white list category in the second data, and the other child node includes a character string belonging to a black list category in the second data.
处理模块具体可以用于,当定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照定位匹配的类别和预设匹配的类别,将m级子节点中的第2级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于定位匹配的类别的字符串,另一个子节点包括第二数据中属于预设匹配的类别的字符串,其中第2级中的两个子节点与第1级中属于黑名单的类别的字符串所在的节点呈父子关系。The processing module may be specifically configured to: when the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the second level of the m-level subnodes Divided into two sub-nodes, one of the two sub-nodes includes the character string belonging to the category of the positioning match in the second data, and the other sub-node includes the character string belonging to the category of the preset match in the second data, in which the second The two child nodes in the level are in a parent-child relationship with the nodes where the strings belonging to the blacklisted category in the first level are located.
处理模块具体可以用于,当标签属性规则中包括具备标签的类别和不具备标签的类别时, 按照具备标签的类别和不具备标签的类别,将m级子节点中的第3级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于具备标签的类别的字符串,另一个子节点包括第二数据中属于不具备标签的类别的字符串,其中第3级中的任意一个子节点与第2级子节点中的一个子节点呈父子关系。The processing module can be specifically used to divide the third level of the m-level sub-nodes into two according to the categories with and without labels when the label attribute rule includes the categories with and without the labels. Child nodes, one of the two child nodes includes a character string belonging to a category with a label in the second data, and the other child node includes a character string belonging to a category without a label in the second data. Any child node is in a parent-child relationship with a child node in the second-level child node.
其中,具备标签的类别具体可以包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with labels may include: a category with only a host name, a category with only advertising information, a category with two levels of hosting and domain names, a category with uniform resource locator URL information for hosts and advertisements, or only It is at least one of categories in which the domain name and the URL information of the advertisement are different.
处理模块具体可以用于,当字符规则中包括首字符串的类别和预置字符串的类别时,按照首字符串的类别和预置字符串的类别,将m级子节点中的第4级划分为两个子节点,两个子节点中的一个子节点包括第二数据中属于首字符串的类别的字符串,另一个子节点包括第二数据中属于预置字符串的类别的字符串,其中第4级中的任意一个子节点与第3级子节点中的一个子节点呈父子关系。The processing module can be specifically used to: when the character rule includes the category of the first string and the category of the preset string, according to the category of the first string and the category of the preset string, the 4th level of the m-level child node Divided into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the character string belonging to the category of the preset character string in the second data, where Any child node in the fourth level is in a parent-child relationship with a child node in the third level.
本方案中,通过对第二数据进行树形转化处理,该树形结构可以对第二数据中的字符串进行深度区分,转化为区分度非常高的树形结构,有效减少访问网页的信息与第一数据的匹配次数。In this solution, by performing a tree transformation process on the second data, the tree structure can deeply distinguish the character strings in the second data and transform it into a highly distinguished tree structure, which effectively reduces the information and Number of matches for the first data.
本发明实施例提供的一种信息拦截的方法、装置及终端。通过对开源列表中大量的字符串的统计,有效剔除失效或者访问量较少的字符串,减少字符串的数量,在此基础上,将第二数据转化为具有树形结构的第一数据,用于拦截浏览器页面中的目标信息,该树形结构可以对第一数据中的字符串进行深度区分,有效减少访问网页的信息与第一数据的匹配次数,从而避免了拦截目标信息的字符串较多且没有合理化的匹配方式导致匹配次增多的问题,在实际统计中,匹配速度可以整体提升40%以上。具体地,在对树形结构的划分时,通过对字符串的树形分析,使用黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则将第二数据进行划分,该方式可以对字符串进行深度区分,转换为区分度非常高的树形结构,从而极大的提升了浏览器客户端拦截广告的速度,有效提高用户的体验感。An information interception method, device and terminal provided by the embodiments of the present invention. Through the statistics of a large number of strings in the open source list, the invalid or less-visited strings are effectively eliminated, and the number of strings is reduced. Based on this, the second data is converted into the first data with a tree structure. Used to intercept the target information in the browser page. The tree structure can deeply distinguish the character string in the first data, effectively reducing the number of times the information on the web page is accessed and the first data, thereby avoiding the characters that intercept the target information. The problem of increasing the number of matching times caused by a large number of strings and no rationalized matching method. In actual statistics, the overall matching speed can be increased by more than 40%. Specifically, when the tree structure is divided, the second data is divided by using a tree analysis of the character string, using black and white list rules, positioning and preset matching rules, label attribute rules, or character rules. Strings are distinguished in depth and converted into a highly distinguished tree structure, which greatly improves the speed at which the browser client intercepts advertisements and effectively improves the user experience.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions, and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. The scope of protection, any modification, equivalent replacement, or improvement made on the basis of the technical solution of the present invention shall be included in the scope of protection of the present invention.

Claims (30)

  1. 一种信息拦截的终端,其特征在于,包括:一个或多个处理器、收发器、存储器、多个应用程序,以及一个或多个计算机程序,其中,所述一个或多个计算机程序被存储在所述存储器中,所述一个或多个计算机程序包括指令,当所述指令被所述终端执行时,使得所述终端执行以下步骤:An information interception terminal, comprising: one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs, wherein the one or more computer programs are stored In the memory, the one or more computer programs include instructions that, when executed by the terminal, cause the terminal to perform the following steps:
    启动浏览器以访问网页;Launch a browser to access a web page;
    获取所述访问网页的信息;Obtaining information about the visited webpage;
    将所述访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,所述第一数据用于确定所述访问网页的信息中是否包括目标信息;Matching information of the visited web page with first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes target information;
    当所述访问网页的信息中包括所述目标信息时,拦截所述目标信息。When the target webpage information includes the target information, the target information is intercepted.
  2. 根据权利要求1所述的终端,其特征在于,所述树形结构包括多个节点,所述多个节点包括根节点和至少一级子节点,所述至少一级子节点中的每一级包括至少两个子节点;The terminal according to claim 1, wherein the tree structure includes a plurality of nodes, the plurality of nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes Including at least two child nodes;
    每一级的节点与关联的下一级节点具有父子关系,所述第一数据根据预设规则分布在成树形结构的所述多个节点上。The nodes at each level have a parent-child relationship with the associated next-level nodes, and the first data is distributed on the plurality of nodes in a tree structure according to a preset rule.
  3. 根据权利要求2所述的终端,其特征在于,所述终端执行以下步骤:The terminal according to claim 2, wherein the terminal performs the following steps:
    将所述访问网页的信息从所述树形结构的父节点的第一数据逐级向与所述父节点呈父子关系的子节点的第一数据进行匹配,直至确定所述访问网页的信息中是否包括所述目标信息。Matching the information of the visited web page from the first data of the parent node of the tree structure to the first data of the child node in a parent-child relationship with the parent node until the information of the visited web page is determined Whether to include the target information.
  4. 根据权利要求3所述的终端,其特征在于,所述树形结构包括m级子节点,所述m级子节点中的每一级子节点按照n种预设规则中不同的预设规则划分,所述n、m均为大于等于1的整数,所述n大于等于所述m;The terminal according to claim 3, wherein the tree structure includes m-level sub-nodes, and each of the m-level sub-nodes is divided according to different preset rules among n preset rules , The n and m are all integers greater than or equal to 1, and the n is greater than or equal to the m;
    第j级子节点从f种预设规则中选择1种预设规则进行划分,所述f种预设规则为所述n种预设规则中前j-1级子节点选择剩余的预设规则,所述j-1级子节点为所述j级子节点的上一级子节点,所述j级子节点为所述m级子节点中的任意一级子节点,所述j和f均为大于等于1的整数;The j-th child node selects one preset rule from the f preset rules to divide, and the f preset rules select the remaining j-1 sub-nodes among the n preset rules to select the remaining preset rules. The j-1 level child node is a level child node of the j level child node, the j level child node is any level child node of the m level child node, and j and f are both Is an integer greater than or equal to 1;
    所述n种预设规则中的每一种分别包括至少两个字符串的类别;Each of the n preset rules includes at least two categories of character strings;
    所述第一数据包括多个字符串,所述第一数据的字符串按所述m级子节点划分,所述m级子节点中的每个子节点分别对应n种预设规则中的不同的字符串的类别,所述每个子节点包括具有不同的所述字符串的类别的多个字符串。The first data includes a plurality of character strings, and the character string of the first data is divided according to the m-level child nodes. Each of the m-level child nodes corresponds to a different one of the n preset rules. The category of the character string, and each child node includes a plurality of character strings having different categories of the character string.
  5. 根据权利要求4所述的终端,其特征在于,所述n种预设规则包括下述至少一种规则:The terminal according to claim 4, wherein the n preset rules include at least one of the following rules:
    黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则。Black and white list rules, positioning and preset matching rules, tag attribute rules, or character rules.
  6. 根据权利要求5所述的终端,其特征在于,所述黑白名单规则包括白名单的类别和黑名单的类别,所述m级子节点中的第1级子节点根据所述黑白名单规则进行划分,所述第一数据中属于所述白名单的类别的字符串和属于所述黑名单的类别的字符串分别对应所述第1级子节点中的一个子节点。The terminal according to claim 5, wherein the blacklist and whitelist rules include a whitelist category and a blacklist category, and the first-level child nodes of the m-level child nodes are divided according to the blacklist and whitelist rules. A character string belonging to the category of the white list and a character string belonging to the category of the black list in the first data respectively correspond to a child node among the first-level child nodes.
  7. 根据权利要求6所述的终端,其特征在于,所述终端执行以下步骤:The terminal according to claim 6, wherein the terminal performs the following steps:
    所述终端将所述访问网页的信息与所述白名单的类别的字符串进行匹配,当所述访问网页的信息包括白名单的类别的字符串时,所述终端确定所述访问网页的信息不包括所述目标信息。The terminal matches the information of the visited web page with a character string of the category of the white list, and when the information of the visited web page includes a character string of the category of the white list, the terminal determines the information of the visited web page The target information is not included.
  8. 根据权利要求7所述的终端,其特征在于,所述终端还执行以下步骤:当所述访问网页的信息不包括白名单的类别的字符串时,将所述访问网页的信息与所述黑名单的类别的字符串进行匹配;The terminal according to claim 7, characterized in that the terminal further performs the following step: when the information of the visited web page does not include a character string of a whitelist category, the information of the visited web page and the black Match the strings of the category of the list;
    当所述访问网页的信息不包括所述黑名单的类别的字符串时,所述终端确定所述访问网页的信息不包括所述目标信息;When the information of the visited webpage does not include the character string of the category of the blacklist, the terminal determines that the information of the visited webpage does not include the target information;
    当所述访问网页的信息包括所述黑名单的类别的字符串时,所述终端将所述访问网页的信息逐级与所述属于所述黑名单的类别的字符串的子节点呈父子关系的子节点相匹配,直至确定所述访问网页的信息被匹配完毕,所述终端拦截所述访问网页的信息中的目标信息。When the information of the visited webpage includes a character string of the category of the blacklist, the terminal gradually classifies the information of the visited webpage with a child node of the character string of the category that belongs to the blacklist. The child nodes are matched until it is determined that the information of the visited web page is matched, and the terminal intercepts the target information in the information of the visited web page.
  9. 根据权利要求8所述的终端,其特征在于,所述定位和预设匹配规则包括定位匹配的类别和预设匹配的类别,所述m级子节点中的第2级子节点根据所述定位和预设匹配规则进行划分,所述第一数据中属于所述定位匹配的类别的字符串和属于所述预设匹配的类别的字符串分别对应所述第2级子节点中的一个子节点,其中所述第2级子节点中的任意一个子节点与所述第1级子节点中属于所述黑名单的类别的字符串的子节点呈父子关系。The terminal according to claim 8, wherein the positioning and preset matching rules comprise a positioning matching category and a preset matching category, and the second-level child node of the m-level child nodes is based on the positioning And the preset matching rule is divided, and the character string belonging to the category of the positioning match and the character string belonging to the category of the preset match in the first data respectively correspond to a child node of the second-level child node , Wherein any one of the second-level child nodes has a parent-child relationship with a child node of a character string belonging to the category of the blacklist among the first-level child nodes.
  10. 根据权利要求9所述的终端,其特征在于,所述定位匹配的类别用于筛选所述访问网页的信息中第一预设位置存在字符串的信息,或者在第二预设位置存在分隔符的信息中的至少一种;The terminal according to claim 9, wherein the category of the positioning match is used to filter information of a character string in a first preset position in the information of the visited webpage, or a separator in a second preset position At least one of the information;
    所述预设匹配的类别用于筛选所述访问网页的信息中存在前缀的信息,或者具有后缀的信息中的至少一种。The preset matching category is used to filter at least one of information that has a prefix or information that has a suffix in the information for accessing the webpage.
  11. 根据权利要求9所述的终端,其特征在于,所述标签属性规则包括具备标签的类别和不具备标签的类别,所述m级子节点中的第3级子节点根据所述标签属性规则进行划分,所述第一数据中属于所述具备标签的类别的字符串和所述不具备标签的类别的字符串分别对应所述第3级子节点中的一个子节点,其中所述第3级子节点中的任意一个子节点与所述第2级子节点中的一个子节点呈父子关系。The terminal according to claim 9, wherein the tag attribute rule includes a category with a tag and a category without a tag, and the third-level child node among the m-level child nodes is performed according to the tag attribute rule Divided, the character string belonging to the category with the label and the character string without the category in the first data respectively correspond to a child node of the third-level child node, wherein the third-level child node Any one of the child nodes is in a parent-child relationship with one of the second-level child nodes.
  12. 根据权利要求11所述的终端,其特征在于,所述具备标签的类别用于筛选所述访问网页的信息中包括标签属性的信息,所述不具备标签的类别用于筛选所述访问网页的信息中不包括标签属性的信息;其中,The terminal according to claim 11, wherein the tag-equipped category is used to filter information of the visited web page and includes tag attribute information, and the tag-unlike category is used to filter the visited web page. The information does not include information about tag attributes; of which,
    所述具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The categories with tags include: a category with only a host name, a category with only advertising information, a category with two levels of hosting and domain name, a category with uniform resource locator URL information for the hosting and advertising, or only The domain name and the URL information of the advertisement are at least one of different categories.
  13. 根据权利要求11所述的终端,其特征在于,所述字符规则包括首字符串的类别和预置字符串的类别,所述m级子节点中的第4级子节点根据所述字符规则进行划分,所述第一数据中属于所述首字符串的类别的字符串和所述预置字符串的类别的字符串分别对应所述第4级子节点中的一个子节点,其中所述第4级子节点中的任意一 个子节点与所述第3级子节点中的一个子节点呈父子关系。The terminal according to claim 11, wherein the character rule includes a category of a first character string and a category of a preset character string, and a fourth-level child node of the m-level child nodes is performed according to the character rule. Divided, the character string belonging to the category of the first character string and the character string of the category of the preset character string in the first data respectively correspond to a child node among the fourth-level child nodes, wherein the first Any one of the child nodes in the fourth level has a parent-child relationship with one of the child nodes in the third level.
  14. 根据权利要求13所述的终端,其特征在于,所述首字符串的类别用于筛选所述访问网页的信息与所述首字符串的类别的字符串具有首字符相同的信息;The terminal according to claim 13, wherein the category of the first character string is used to filter the information of the visited web page and the character string of the category of the first character string has the same information as the first character;
    所述预置字符串的类别用于筛选所述访问网页的信息与所述预置字符串的类别的字符串具有预置字符串相同的信息。The category of the preset character string is used to filter the information for accessing the webpage and the character string of the category of the preset character string has the same information as the preset character string.
  15. 根据权利要求1-14任一项所述的终端,其特征在于,所述访问网页的信息包括:所述用户访问页面的URL或者所述访问网页各个元素的URL,所述目标信息为广告信息。The terminal according to any one of claims 1 to 14, wherein the information for accessing the web page comprises: a URL of the user access page or a URL of each element of the access web page, and the target information is advertisement information .
  16. 根据权利要求1所述的终端,其特征在于,所述第一数据为服务端根据所述第二数据进行树形转化处理之后得到,所述第二数据包括有效字符串和浏览器的自定义字符串,其中,所述有效字符串为通过对开源网站中的开源字符串和所述终端上报的在预设时间段内的历史数据进行筛选,确定使用率大于预设阈值的字符串。The terminal according to claim 1, wherein the first data is obtained after the server performs tree transformation processing according to the second data, and the second data includes a valid string and a customization of a browser. A character string, wherein the valid character string is a character string determined by filtering an open source character string in an open source website and historical data reported by the terminal within a preset period of time to determine a usage rate greater than a preset threshold.
  17. 一种数据处理的服务器,其特征在于,包括:一个或多个处理器、收发器、存储器、多个应用程序,以及一个或多个计算机程序,其中,所述一个或多个计算机程序被存储在所述存储器中,所述一个或多个计算机程序包括指令,当所述指令被所述服务器执行时,使得所述服务器执行以下步骤:A data processing server, comprising: one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs, wherein the one or more computer programs are stored In the memory, the one or more computer programs include instructions that, when executed by the server, cause the server to perform the following steps:
    将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
    所述服务器将所述第一数据发送给终端,以便于所述终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
  18. 根据权利要求17所述的服务器,其特征在于,所述目标信息为广告信息;The server according to claim 17, wherein the target information is advertisement information;
    所述访问网页的信息包括:所述用户访问页面的URL或者所述访问网页各个元素的URL中的至少一种。The information for accessing the webpage includes at least one of a URL of the webpage accessed by the user or a URL of each element of the webpage accessed.
  19. 根据权利要求17所述的服务器,其特征在于,所述服务器执行以下步骤:The server according to claim 17, wherein the server performs the following steps:
    从开源网站周期性获取至少一个开源字符串;Obtain at least one open source string periodically from an open source website;
    在所述至少一个开源字符串和所述客户端上报的在预设时间段内的历史数据中选取访问量大于第一阈值的多个字符串为有效字符串;Selecting, from the at least one open source character string and historical data reported by the client within a preset period of time, a plurality of character strings with a visit amount greater than a first threshold as valid character strings;
    获取浏览器服务器的自定义字符串;Get the custom string of the browser server;
    根据所述有效字符串和所述自定义字符串,确定所述第二数据,所述有效字符串和所述自定义字符串中分别包括至少一个字符串。Determine the second data according to the valid character string and the custom character string, and each of the valid character string and the custom character string includes at least one character string.
  20. 据权利要求17或19所述的服务器,其特征在于,所述服务器执行以下步骤:The server according to claim 17 or 19, wherein the server performs the following steps:
    根据n种预设规则将第二数据划分为m级,所述m级子节点中每一级的预设规则都不同;Divide the second data into m levels according to n preset rules, and the preset rules of each level of the m-level sub-nodes are different;
    所述n种预设规则中的每一种分别包括至少两个字符串的类别,根据所述字符串的类别将所述m级中的每层划分为至少两个子节点;Each of the n kinds of preset rules includes at least two categories of character strings, and each layer in the m level is divided into at least two child nodes according to the categories of the character strings;
    所述第二数据包括多个字符串,所述每个子节点中分别包括属于不同字符串的类别的多个字符串,所述n、m均为大于等于1的整数,所述n大于等于所述m;The second data includes a plurality of character strings, and each of the child nodes includes a plurality of character strings belonging to different types of character strings. The n and m are all integers greater than or equal to 1, and the n is greater than or equal to述 m ; Said m;
    第k级子节点中的每个子节点与k-1级中的一个子节点具有父子关系,所述k级子节点为所述m级子节点中的任意一级子节点,所述k为大于等于1的整数。Each child node in the k-th child node has a parent-child relationship with one child node in k-1 level, the k-level child node is any one-level child node in the m-level child node, and k is greater than An integer equal to 1.
  21. 据权利要求20所述的服务器,其特征在于,所述n种预设规则包括下述至少 一种规则:黑白名单规则、定位和预设匹配规则、标签属性规则或字符规则;所述服务器执行以下步骤:The server according to claim 20, wherein the n preset rules comprise at least one of the following rules: a black and white list rule, a positioning and preset matching rule, a tag attribute rule, or a character rule; the server executes The following steps:
    根据所述黑白名单规则、定位和预设匹配规则、标签属性规则和字符规则将多个子节点划分为所述m级子节点。Divide a plurality of child nodes into the m-level child nodes according to the black and white list rules, positioning and preset matching rules, label attribute rules, and character rules.
  22. 根据权利要求21所述的服务器,其特征在于,所述服务器执行以下步骤:The server according to claim 21, wherein the server performs the following steps:
    当所述黑白名单规则中包括白名单的类别和黑名单的类别时,按照所述白名单的类别和所述黑名单的类别将所述m级子节点中的第1级划分为两个子节点,所述两个子节点中的一个子节点包括所述第二数据中属于所述白名单的类别的字符串,另一个子节点包括所述第二数据中属于所述黑名单的类别的字符串。When the blacklist and whitelist rules include a whitelist category and a blacklist category, the first level of the m-level subnodes is divided into two subnodes according to the whitelist category and the blacklist category. , One of the two child nodes includes a character string belonging to the white list category in the second data, and the other child node includes a character string belonging to the black list category in the second data .
  23. 根据权利要求22所述的服务器,其特征在于,所述服务器执行以下步骤:The server according to claim 22, wherein the server performs the following steps:
    当所述定位和预设匹配规则中包括定位匹配的类别和预设匹配的类别时,按照所述定位匹配的类别和所述预设匹配的类别,将所述m级子节点中的第2级划分为两个子节点,所述两个子节点中的一个子节点包括所述第二数据中属于所述定位匹配的类别的字符串,另一个子节点包括所述第二数据中属于所述预设匹配的类别的字符串,其中所述第2级中的两个子节点与所述第1级中属于所述黑名单的类别的字符串所在的节点呈父子关系。When the positioning and preset matching rules include the positioning matching category and the preset matching category, according to the positioning matching category and the preset matching category, the second The level is divided into two child nodes, one of the two child nodes includes a character string in the second data that belongs to the positioning matching category, and the other child node includes the second data that belongs to the pre- A character string of a matched category is set, wherein two child nodes in the second level are in a parent-child relationship with a node in which the character string of the category belonging to the blacklist in the first level is located.
  24. 根据权利要求23所述的服务器,其特征在于,所述服务器执行以下步骤:The server according to claim 23, wherein the server performs the following steps:
    当所述标签属性规则中包括具备标签的类别和不具备标签的类别时,按照所述具备标签的类别和所述不具备标签的类别,将所述m级子节点中的第3级划分为两个子节点,所述两个子节点中的一个子节点包括所述第二数据中属于所述具备标签的类别的字符串,另一个子节点包括所述第二数据中属于所述不具备标签的类别的字符串,其中所述第3级中的任意一个子节点与所述第2级子节点中的一个子节点呈父子关系。When the tag attribute rule includes a category with a tag and a category without a tag, the third level among the m-level child nodes is divided into the category with the tag and the category without the tag Two child nodes, one of the two child nodes includes a character string in the second data that belongs to the tagged category, and the other child node includes the second data that belongs to the unlabeled A character string of a category, where any one of the child nodes in the third level is in a parent-child relationship with one of the child nodes in the second level.
  25. 根据权利要求24所述的服务器,其特征在于,所述具备标签的类别具体包括:仅有主机名的类别、仅有广告属性的主机信息的类别、主机和域名两级分类的类别、主机和广告的统一资源定位符URL信息的类别或仅是域名和广告的URL信息不同的类别中的至少一种。The server according to claim 24, wherein the tag-equipped categories include: a host-only category, a host-only category of advertising information, a two-level category of hosts and domain names, a host and At least one of a category of the URL information of the advertisement's uniform resource locator, or a category of only the domain name and the URL information of the advertisement is different.
  26. 根据权利要求24或25所述的服务器,其特征在于,所述服务器执行以下步骤:The server according to claim 24 or 25, wherein the server performs the following steps:
    当所述字符规则中包括首字符串的类别和预置字符串的类别时,按照所述首字符串的类别和所述预置字符串的类别,将所述m级子节点中的第4级划分为两个子节点,所述两个子节点中的一个子节点包括所述第二数据中属于所述首字符串的类别的字符串,另一个子节点包括所述第二数据中属于所述预置字符串的类别的字符串,其中所述第4级中的任意一个子节点与所述第3级子节点中的一个子节点呈父子关系。When the category of the first character string and the category of the preset character string are included in the character rule, according to the category of the first character string and the category of the preset character string, the fourth The level is divided into two child nodes, one of the two child nodes includes a character string belonging to the category of the first character string in the second data, and the other child node includes the second data belonging to the A character string of a category of a preset character string, in which any one of the child nodes in the fourth level is in a parent-child relationship with one of the child nodes in the third level.
  27. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行以下步骤:A computer-readable storage medium includes instructions that, when run on a computer, cause the computer to perform the following steps:
    启动浏览器以访问网页;Launch a browser to access a web page;
    获取所述访问网页的信息;Obtaining information about the visited webpage;
    将所述访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,所述第一数据用于确定所述访问网页的信息中是否包括目标信息;Matching information of the visited web page with first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes target information;
    当所述访问网页的信息中包括所述目标信息时,拦截所述目标信息。When the target webpage information includes the target information, the target information is intercepted.
  28. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行以下步骤:A computer-readable storage medium includes instructions that, when run on a computer, cause the computer to perform the following steps:
    将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
    所述服务器将所述第一数据发送给终端,以便于所述终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
  29. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行以下步骤:A computer program product containing instructions that, when run on a computer, causes the computer to perform the following steps:
    启动浏览器以访问网页;Launch a browser to access a web page;
    获取所述访问网页的信息;Obtaining information about the visited webpage;
    将所述访问网页的信息与呈树形结构排布的第一数据进行匹配,其中,所述第一数据用于确定所述访问网页的信息中是否包括目标信息;Matching information of the visited web page with first data arranged in a tree structure, wherein the first data is used to determine whether the information of the visited web page includes target information;
    当所述访问网页的信息中包括所述目标信息时,拦截所述目标信息。When the target webpage information includes the target information, the target information is intercepted.
  30. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行以下步骤:A computer program product containing instructions that, when run on a computer, causes the computer to perform the following steps:
    将第二数据进行树形转化处理,确定第一数据;Tree-transform the second data to determine the first data;
    所述服务器将所述第一数据发送给终端,以便于所述终端根据确定访问网页是否包含目标信息。The server sends the first data to the terminal, so that the terminal determines whether the accessed web page contains target information according to the determination.
PCT/CN2019/106728 2018-09-27 2019-09-19 Information blocking method, device and terminal WO2020063448A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811132493.9 2018-09-27
CN201811132493.9A CN110955855B (en) 2018-09-27 2018-09-27 Information interception method, device and terminal

Publications (1)

Publication Number Publication Date
WO2020063448A1 true WO2020063448A1 (en) 2020-04-02

Family

ID=69951180

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/106728 WO2020063448A1 (en) 2018-09-27 2019-09-19 Information blocking method, device and terminal

Country Status (2)

Country Link
CN (1) CN110955855B (en)
WO (1) WO2020063448A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641911A (en) * 2021-08-19 2021-11-12 郑州阿帕斯数云信息科技有限公司 Method, device, equipment and storage medium for establishing advertisement interception rule base

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112073374B (en) * 2020-08-05 2023-03-24 长沙市到家悠享网络科技有限公司 Information interception method, device and equipment
CN117093777A (en) * 2023-08-22 2023-11-21 北京领雁科技股份有限公司 Method and device for intercepting browser page, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105824972A (en) * 2016-04-15 2016-08-03 广东欧珀移动通信有限公司 Method and device for blocking network advertisements
US20160261608A1 (en) * 2015-03-06 2016-09-08 International Business Machines Corporation Identifying malicious web infrastructures
CN107437026A (en) * 2017-07-13 2017-12-05 西北大学 A kind of malicious web pages commercial detection method based on advertising network topology
CN108170810A (en) * 2017-12-29 2018-06-15 南京邮电大学 A kind of commercial detection method based on dynamic behaviour

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9215212B2 (en) * 2009-06-22 2015-12-15 Citrix Systems, Inc. Systems and methods for providing a visualizer for rules of an application firewall
CN102332028B (en) * 2011-10-15 2013-08-28 西安交通大学 Webpage-oriented unhealthy Web content identifying method
JP6184857B2 (en) * 2013-12-17 2017-08-23 ケーディーアイコンズ株式会社 Information processing apparatus and program
CN106033450B (en) * 2015-03-17 2020-02-14 中兴通讯股份有限公司 Advertisement blocking method and device and browser
US20160335298A1 (en) * 2015-05-12 2016-11-17 Extreme Networks, Inc. Methods, systems, and non-transitory computer readable media for generating a tree structure with nodal comparison fields and cut values for rapid tree traversal and reduced numbers of full comparisons at leaf nodes
CN107193889A (en) * 2017-05-02 2017-09-22 努比亚技术有限公司 Ad blocking method, terminal and computer-readable recording medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
US20160261608A1 (en) * 2015-03-06 2016-09-08 International Business Machines Corporation Identifying malicious web infrastructures
CN105824972A (en) * 2016-04-15 2016-08-03 广东欧珀移动通信有限公司 Method and device for blocking network advertisements
CN107437026A (en) * 2017-07-13 2017-12-05 西北大学 A kind of malicious web pages commercial detection method based on advertising network topology
CN108170810A (en) * 2017-12-29 2018-06-15 南京邮电大学 A kind of commercial detection method based on dynamic behaviour

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641911A (en) * 2021-08-19 2021-11-12 郑州阿帕斯数云信息科技有限公司 Method, device, equipment and storage medium for establishing advertisement interception rule base
CN113641911B (en) * 2021-08-19 2024-03-08 郑州阿帕斯数云信息科技有限公司 Advertisement interception rule base establishing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110955855B (en) 2023-06-02
CN110955855A (en) 2020-04-03

Similar Documents

Publication Publication Date Title
US11108807B2 (en) Performing rule-based actions for newly observed domain names
US9928301B2 (en) Classifying uniform resource locators
US10817663B2 (en) Dynamic native content insertion
CN109033358B (en) Method for associating news aggregation with intelligent entity
US8478701B2 (en) Locating a user based on aggregated tweet content associated with a location
CN102722563B (en) Method and device for displaying page
US8903800B2 (en) System and method for indexing food providers and use of the index in search engines
WO2015062366A1 (en) Webpage advertisement interception method, device, and browser
WO2020063448A1 (en) Information blocking method, device and terminal
US10311120B2 (en) Method and apparatus for identifying webpage type
CN102750352A (en) Method and device for classified collection of historical access records in browser
Mehta et al. A comparative study of various approaches to adaptive web scraping
Roumeliotis et al. An effective SEO techniques and technologies guide-map
Bakariya et al. An inclusive survey on data preprocessing methods used in web usage mining
Fatt et al. Phishdentity: Leverage website favicon to offset polymorphic phishing website
CN103577578B (en) A kind of tab file analysis method and device
CN107844537A (en) A kind of method and system of marking of web pages
CN111460307B (en) Mobile terminal accurate searching method and device
Akbar et al. Implementation of White Hat SEO Techniques to Improve Digital Promotion of Village Potentials Product (Case Study: Kebun Kelapa Village)
CN112597405A (en) Event external information source extraction method based on microblog platform
Roumeliotis et al. 2 SEO Techniques & On-Page Optimization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19868107

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19868107

Country of ref document: EP

Kind code of ref document: A1