CN110955855A - Information interception method, device and terminal - Google Patents

Information interception method, device and terminal Download PDF

Info

Publication number
CN110955855A
CN110955855A CN201811132493.9A CN201811132493A CN110955855A CN 110955855 A CN110955855 A CN 110955855A CN 201811132493 A CN201811132493 A CN 201811132493A CN 110955855 A CN110955855 A CN 110955855A
Authority
CN
China
Prior art keywords
category
information
character string
level
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811132493.9A
Other languages
Chinese (zh)
Other versions
CN110955855B (en
Inventor
付振中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Petal Cloud Technology Co Ltd
Original Assignee
Huawei Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Device Co Ltd filed Critical Huawei Device Co Ltd
Priority to CN201811132493.9A priority Critical patent/CN110955855B/en
Priority to PCT/CN2019/106728 priority patent/WO2020063448A1/en
Publication of CN110955855A publication Critical patent/CN110955855A/en
Application granted granted Critical
Publication of CN110955855B publication Critical patent/CN110955855B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a terminal for intercepting information, which can comprise: the processor, the transceiver, the memory, and the plurality of applications cause the terminal to perform the steps of: starting a browser to access a webpage; acquiring information of an access webpage; matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information; and when the target information is included in the information of the access webpage, intercepting the target information. According to the scheme, the terminal intercepts the target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish the character strings in the first data, the matching times of the information of the accessed webpage and the first data are effectively reduced, and therefore the problem that the number of matching times is increased due to the fact that the number of the character strings for intercepting the target information is large and a reasonable matching mode is not available is solved.

Description

Information interception method, device and terminal
Technical Field
The embodiment of the invention relates to the technical field of webpage analysis and interception, in particular to a method, a device and a terminal for information interception.
Background
With the explosion of the internet, more and more web pages are inserted with various advertisements. In order to avoid the inconvenience of the advertisements in the process of browsing the web page in the browser for the user, the advertisements in the web page need to be intercepted.
At present, a general user webpage access request is sent to a server for processing, the server loads an easy list rule list while caching page content, hides advertisement elements through the rule list, and then returns the page content with the hidden advertisement elements to a client for displaying. The easy rule list contains a plurality of character strings, is an advertisement interception rule set opened by an open source organization, and defines which elements in a webpage are advertisements and should be intercepted.
Disclosure of Invention
The embodiment of the invention provides an information interception method, an information interception device and a terminal, which are used for solving the problems that the advertisement interception rules are more and the matching times are increased due to the fact that no rationalized matching mode exists on the basis of the advertisement interception mode implemented by the terminal and the optimization of the rule matching mode.
In a first aspect, an embodiment of the present invention provides a terminal for intercepting information, where the terminal may include: one or more processors, a transceiver, a memory, a plurality of applications, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the terminal, cause the terminal to perform the steps of:
starting a browser to access a webpage;
acquiring information of an access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and when the target information is included in the information of the access webpage, intercepting the target information.
According to the scheme, the terminal intercepts the target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish the character strings in the first data, the matching times of the information of the accessed webpage and the first data are effectively reduced, and therefore the problem that the number of matching times is increased due to the fact that the number of the character strings for intercepting the target information is large and a reasonable matching mode is not available is solved.
In an alternative implementation, the "tree structure" may include:
the method comprises the following steps of including a plurality of nodes, wherein the plurality of nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
In another alternative implementation, the terminal may specifically perform the following steps:
and matching the information of the accessed webpage from the first data of the father node of the tree structure to the first data of the child node in a parent-child relationship with the father node step by step until whether the information of the accessed webpage comprises target information is determined.
Because the information of the accessed webpage has a difference in length, the longer information of the accessed webpage cannot be directly matched, so that the information of the accessed webpage is matched step by step, the information of the accessed webpage can be completely matched, and the accuracy of intercepting the target information is improved.
In another optional implementation manner, the tree structure may specifically include m levels of child nodes, each level of child nodes in the m levels of child nodes is divided according to different preset rules in n types of preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
the j-th-level child node selects 1 preset rule from f preset rules for division, the f preset rules select the rest preset rules for the front j-1-level child nodes in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child nodes, and j and f are integers more than or equal to 1;
each of the n preset rules respectively comprises at least two character string categories;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
Because each terminal or operator has different definitions for the target information, the method and the system provide various preset rules and categories, the preset rules can be selected according to requirements, the flexibility of the tree structure can be improved, and the method and the system are suitable for more scenes.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
In yet another alternative implementation, the "black and white list rule" may include:
the category of the white list and the category of the black list, the 1 st level child node in the m level child nodes is divided according to the black list and white list rule, and the character string belonging to the category of the white list and the character string belonging to the category of the black list in the first data respectively correspond to one child node in the 1 st level child node. In yet another alternative implementation, the terminal may perform the following steps:
and matching the information of the accessed webpage with the character string of the category of the white list, and when the information of the accessed webpage comprises the character string of the category of the white list, determining that the information of the accessed webpage does not comprise the target information by the terminal.
Since some information accessing the web page may carry "ad" but may not be the target information for some operators, setting a string with a white list category excludes the possibility of carrying "ad" but not the target information (i.e., advertisement), which improves the accuracy of interception.
In yet another alternative implementation, the terminal may further perform the following steps:
when the information of the accessed webpage does not comprise the character string of the category of the white list, matching the information of the accessed webpage with the character string of the category of the black list;
when the information of the accessed webpage does not comprise the character string of the category of the blacklist, the terminal determines that the information of the accessed webpage does not comprise the target information;
when the information of the accessed webpage comprises the character strings of the category of the blacklist, the terminal matches the information of the accessed webpage with child nodes of the character strings belonging to the category of the blacklist in a parent-child relationship step by step until the information of the accessed webpage is completely matched, and the terminal intercepts target information in the information of the accessed webpage.
In yet another alternative implementation, the "positioning and preset matching rule" may specifically include:
the method comprises the steps that the matched category and the preset matched category are located, the 2 nd-level child node in the m-level child nodes is divided according to the locating and preset matching rules, a character string belonging to the matched category and a character string belonging to the preset matched category in first data respectively correspond to one child node in the 2 nd-level child nodes, and any child node in the 2 nd-level child nodes and a child node of the character string belonging to the blacklist category in the 1 st-level child nodes are in a parent-child relationship.
In yet another alternative implementation manner, the "category of location matching" may be used to filter at least one of information that a character string exists at a first preset location in the information for accessing the web page, or information that a separator exists at a second preset location;
the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information for accessing the webpage.
In yet another optional implementation manner, the "tag attribute rule" may specifically include:
the data processing method comprises the steps that a type with a label and a type without the label are provided, a 3 rd-level child node in m-level child nodes is divided according to a label attribute rule, a character string belonging to the type with the label and a character string belonging to the type without the label in first data respectively correspond to one child node in the 3 rd-level child node, and any one child node in the 3 rd-level child node and one child node in the 2 nd-level child node are in a parent-child relationship.
In yet another alternative implementation manner, the "category with tag" may be used to filter information that accesses the web page, including information of the tag attribute, and the category without tag is used to filter information that accesses the web page, not including information of the tag attribute; wherein the content of the first and second substances,
the category with a label specifically includes: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
Due to the fact that the information for accessing the webpage is diversified, the method can provide more possibilities and can intercept the target information more accurately.
In yet another alternative implementation, the "character rule" may include:
the type of the first character string and the type of the preset character string, the 4 th-level child node in the m-level child nodes is divided according to character rules, the character string belonging to the type of the first character string and the character string belonging to the type of the preset character string in the first data respectively correspond to one child node in the 4 th-level child nodes, and any one child node in the 4 th-level child nodes and one child node in the 3 rd-level child nodes are in a parent-child relationship.
In yet another alternative implementation, the "category of the first character string" may be used to filter information that the information for accessing the web page has the same first character as the character string of the category of the first character string;
the category of the preset character string is used for screening information that the information for accessing the webpage has the same information as the preset character string of the category of the preset character string.
In yet another alternative implementation, the "information for accessing a web page" may include: the user accesses the URL of the page or accesses the URL of each element of the page, and the target information is advertisement information.
In another optional implementation manner, the "first data" is obtained after the server performs tree-form conversion processing according to second data, where the second data includes an effective character string and a custom character string of the browser, where the effective character string is a character string whose usage rate is determined to be greater than a preset threshold value by screening an open source character string in an open source website and historical data reported by the terminal within a preset time period.
Because the first data is downloaded from the terminal to the server, and the whole matching process is carried out in the terminal, the method greatly improves the matching speed of the terminal for information and solves the problem that the processing of the page content can be quickly finished only by the server with higher performance in the prior art.
In a second aspect, an embodiment of the present invention provides a data processing server, including: one or more processors, transceivers, and memory; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the server, cause the server to perform the steps of:
performing tree-shaped conversion processing on the second data to determine first data;
the server sends the first data to the terminal so that the terminal can determine whether the accessed webpage contains the target information or not according to the determination.
According to the scheme, through tree-shaped conversion processing of the second data, the character strings in the second data can be deeply distinguished through the tree-shaped structure, the character strings are converted into the tree-shaped structure with high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
In an alternative implementation, the "target information" may be advertisement information;
the information for accessing the web page includes: the user accesses at least one of the URL of the page or the URL of each element of the page.
In another alternative implementation, the server may perform the following steps: periodically acquiring at least one open source character string from an open source website;
selecting a plurality of character strings with the access quantity larger than a first threshold value from at least one open source character string and historical data reported by a client in a preset time period as effective character strings;
acquiring a custom character string of a browser server;
and determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
Since each browser server generally has different criteria, that is, the target information may be defined as advertisement information at the a site, but not defined as advertisement information at the B site, the custom character string of the browser server is added when generating the second data, so that the matched second data has flexibility and can be widely used.
In yet another alternative implementation, the server may perform the following steps:
dividing a plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different;
each of the n preset rules respectively comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are more than or equal to 1, and n is more than or equal to m;
each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and preset matching rules, label attribute rules or character rules;
the server performs the following steps:
and dividing the plurality of sub-nodes into m-level sub-nodes according to the black and white list rule, the positioning and preset matching rule, the label attribute rule and the character rule.
In yet another alternative implementation, the server may perform the following steps:
when the black-and-white list rule comprises the category of the white list and the category of the black list, dividing the level 1 of the m-level child nodes into two child nodes according to the category of the white list and the category of the black list, wherein one child node of the two child nodes comprises a character string belonging to the category of the white list in the second data, and the other child node comprises a character string belonging to the category of the black list in the second data.
In yet another alternative implementation, the server may perform the following steps:
when the positioning and preset matching rules comprise the positioning matching category and the preset matching category, dividing the level 2 of the m-level child nodes into two child nodes according to the positioning matching category and the preset matching category, wherein one child node of the two child nodes comprises a character string belonging to the positioning matching category in the second data, the other child node comprises a character string belonging to the preset matching category in the second data, and the two child nodes in the level 2 and the node where the character string belonging to the blacklist category in the level 1 are located are in a parent-child relationship.
In yet another alternative implementation, the server may perform the following steps: when the label attribute rule comprises a category with a label and a category without the label, dividing the level 3 of the m-level child nodes into two child nodes according to the category with the label and the category without the label, wherein one child node of the two child nodes comprises a character string belonging to the category with the label in the second data, the other child node comprises a character string belonging to the category without the label in the second data, and any child node in the level 3 and one child node in the level 2 child node are in a parent-child relationship.
In yet another alternative implementation, the "tagged category" may include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
In yet another alternative implementation, the server may perform the following steps:
when the character rule comprises the category of the first character string and the category of the preset character string, dividing the level 4 of the m-level child nodes into two child nodes according to the category of the first character string and the category of the preset character string, wherein one of the two child nodes comprises the character string in the second data, the character string belongs to the category of the first character string, the other child node comprises the character string in the second data, the character string belongs to the category of the preset character string, and any one of the child nodes in the level 4 and one of the child nodes in the level 3 are in a parent-child relationship.
In a third aspect, an embodiment of the present invention provides an information interception method, where the method may be executed based on a terminal, and the method may include the following steps:
starting a browser to access a webpage;
acquiring information of an access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and when the target information is included in the information of the access webpage, intercepting the target information.
According to the scheme, the target information in the browser page is intercepted through the first data with the tree structure, the tree structure can be used for deeply distinguishing character strings in the first data, the matching times of information for accessing the webpage and the first data are effectively reduced, the problem that the number of matching times is increased due to the fact that the number of character strings for intercepting the target information is large and a reasonable matching mode is not available is solved, and the matching speed can be improved by more than 40% as a whole.
In an alternative implementation, the tree structure may include a plurality of nodes, where the plurality of nodes includes a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
In another optional implementation manner, in the step of "matching the information of the accessed webpage with the first data arranged in the tree structure", the method may specifically include:
and matching the information of the accessed webpage from the first data of the father node of the tree structure to the first data of the child node in a parent-child relationship with the father node step by step until whether the information of the accessed webpage comprises target information is determined.
Because the information of the accessed webpage has a difference in length, the longer information of the accessed webpage cannot be directly matched, so that the information of the accessed webpage is matched step by step, the information of the accessed webpage can be completely matched, and the accuracy of intercepting the target information is improved.
In yet another alternative implementation, the "tree structure" may include:
each level of sub-nodes in the m levels of sub-nodes is divided according to different preset rules in n preset rules, n and m are integers which are more than or equal to 1, and n is more than or equal to m;
the j-th-level child node selects 1 preset rule from f preset rules for division, the f preset rules select the rest preset rules for the front j-1-level child nodes in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child nodes, and j and f are integers more than or equal to 1;
each of the n preset rules respectively comprises at least two character string categories;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
Because each terminal or operator has different definitions for the target information, the method and the system provide various preset rules and categories, the preset rules can be selected according to requirements, the flexibility of the tree structure can be improved, and the method and the system are suitable for more scenes.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
In yet another optional implementation manner, the "black-and-white list rule" may include a category of a white list and a category of a black list, the level 1 child node in the m-level child nodes is divided according to the black-and-white list rule, and the character string belonging to the category of the white list and the character string belonging to the category of the black list in the first data correspond to one child node in the level 1 child node respectively.
In yet another optional implementation manner, in the step of "matching the information of the accessed webpage with the first data arranged in the tree structure", the method may specifically include:
and matching the information of the accessed webpage with the character string of the category of the white list, and determining that the information of the accessed webpage does not include the target information when the information of the accessed webpage includes the character string of the category of the white list.
Since some information accessing the web page may carry "ad" but may not be the target information for some operators, setting a string with a white list category excludes the possibility of carrying "ad" but not the target information (i.e., advertisement), which improves the accuracy of interception.
In yet another optional implementation manner, in the step of "matching the information of the accessed webpage with the first data arranged in the tree structure", the method may specifically include: when the information of the accessed webpage does not comprise the character string of the category of the white list, matching the information of the accessed webpage with the character string of the category of the black list;
when the information of the accessed webpage does not include the character string of the category of the blacklist, determining that the information of the accessed webpage does not include the target information;
when the information of the accessed webpage comprises the character strings of the category of the blacklist, matching the information of the accessed webpage with child nodes of the character strings belonging to the category of the blacklist in a parent-child relationship step by step until the information of the accessed webpage is completely matched, and intercepting target information in the information of the accessed webpage.
In yet another optional implementation manner, the "positioning and preset matching rule" may specifically include a category of positioning matching and a category of preset matching, a level 2 child node in the m-level child nodes is divided according to the positioning and preset matching rule, a character string belonging to the category of positioning matching and a character string belonging to the category of preset matching in the first data correspond to one child node in the level 2 child nodes, respectively, where any one child node in the level 2 child node and a child node of a character string belonging to the category of the blacklist in the level 1 child node are in a parent-child relationship.
In yet another alternative implementation manner, the "category of location matching" may be used to filter at least one of information that a character string exists at a first preset location in the information for accessing the web page, or information that a separator exists at a second preset location;
the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information for accessing the webpage.
In yet another optional implementation manner, the "tag attribute rule" may include a category with a tag and a category without a tag, a 3 rd-level child node in the m-level child nodes is divided according to the tag attribute rule, a character string belonging to the category with a tag and a character string belonging to the category without a tag in the first data correspond to one child node in the 3 rd-level child node, respectively, where any one child node in the 3 rd-level child node is in a parent-child relationship with one child node in the 2 nd-level child node.
In yet another optional implementation manner, the "category with a tag" may be used to filter information of the accessed web page that includes information of a tag attribute, and the category without a tag is used to filter information of the accessed web page that does not include information of a tag attribute;
the category with the label specifically includes: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
Due to the fact that the information for accessing the webpage is diversified, the method can provide more possibilities and can intercept the target information more accurately.
In yet another alternative implementation manner, the "character rule" may include a category of the first character string and a category of the preset character string, the level 4 child nodes in the m-level child nodes are divided according to the character rule, the character string belonging to the category of the first character string and the character string belonging to the category of the preset character string in the first data correspond to one child node in the level 4 child nodes, respectively, where any one child node in the level 4 child node is in a parent-child relationship with one child node in the level 3 child node.
In yet another alternative implementation, the "category of the initial string" may be used to filter information that the information of accessing the web page has the same first character as the character string of the category of the initial string;
the category of the preset character string is used for screening information that the information for accessing the webpage has the same information as the preset character string of the category of the preset character string.
In yet another alternative implementation, the "information for accessing a web page" may include a URL of a user accessing a page or a URL of each element of the web page, and the target information is advertisement information.
In yet another optional implementation manner, the "first data" is obtained after the server performs tree-form conversion processing according to second data, where the second data includes an effective character string and a custom character string of the browser, where the effective character string is a character string whose usage rate is greater than a preset threshold value determined by screening an open source character string in an open source website and reported historical data within a preset time period.
In a fourth aspect, an embodiment of the present invention provides a data processing method, where the method may be executed based on a server (i.e., a server), and the method may specifically include the following steps:
performing tree-shaped conversion processing on the second data to determine first data;
and sending the first data to the terminal so that the terminal can determine whether the accessed webpage contains the target information or not according to the determination.
According to the scheme, through tree-shaped conversion processing of the second data, the character strings in the second data can be deeply distinguished through the tree-shaped structure, the character strings are converted into the tree-shaped structure with high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
In an alternative implementation, the "target information" may be advertisement information;
the information for accessing the web page includes: the user accesses at least one of the URL of the page or the URL of each element of the page.
In another optional implementation manner, before the step of performing tree transformation processing on the second data and determining the first data, the method may further include: periodically acquiring at least one open source character string from an open source website;
selecting a plurality of character strings with the access quantity larger than a first threshold value from at least one open source character string and historical data reported by a terminal in a preset time period as effective character strings;
acquiring a custom character string of a browser server;
and determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
Since each browser server generally has different criteria, that is, the target information may be defined as advertisement information at the a site, but not defined as advertisement information at the B site, the custom character string of the browser server is added when generating the second data, so that the matched second data has flexibility and can be widely used.
In yet another optional implementation manner, in the step of "performing tree transformation processing on the second data and determining the first data", the method may specifically include: dividing a plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different;
each of the n preset rules respectively comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are more than or equal to 1, and n is more than or equal to m;
each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and preset matching rules, label attribute rules or character rules;
the server performs the following steps:
and dividing the plurality of sub-nodes into m-level sub-nodes according to the black and white list rule, the positioning and preset matching rule, the label attribute rule and the character rule.
In yet another optional implementation manner, in the step of "performing tree transformation processing on the second data and determining the first data", the method may specifically include:
when the black-and-white list rule comprises the category of the white list and the category of the black list, dividing the level 1 of the m-level child nodes into two child nodes according to the category of the white list and the category of the black list, wherein one child node of the two child nodes comprises a character string belonging to the category of the white list in the second data, and the other child node comprises a character string belonging to the category of the black list in the second data.
In yet another optional implementation manner, in the step of "performing tree transformation processing on the second data and determining the first data", the method may specifically include:
when the positioning and preset matching rules comprise the positioning matching category and the preset matching category, dividing the level 2 of the m-level child nodes into two child nodes according to the positioning matching category and the preset matching category, wherein one child node of the two child nodes comprises a character string belonging to the positioning matching category in the second data, the other child node comprises a character string belonging to the preset matching category in the second data, and the two child nodes in the level 2 and the node where the character string belonging to the blacklist category in the level 1 are located are in a parent-child relationship.
In yet another optional implementation manner, in the step of "performing tree transformation processing on the second data and determining the first data", the method may specifically include:
when the label attribute rule comprises a category with a label and a category without the label, dividing the level 3 of the m-level child nodes into two child nodes according to the category with the label and the category without the label, wherein one child node of the two child nodes comprises a character string belonging to the category with the label in the second data, the other child node comprises a character string belonging to the category without the label in the second data, and any child node in the level 3 and one child node in the level 2 child node are in a parent-child relationship.
In yet another optional implementation manner, the "category with tag" may specifically include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
In yet another optional implementation manner, in the step of "performing tree transformation processing on the second data and determining the first data", the method may specifically include:
when the character rule comprises the category of the first character string and the category of the preset character string, dividing the level 4 of the m-level child nodes into two child nodes according to the category of the first character string and the category of the preset character string, wherein one of the two child nodes comprises the character string in the second data, the character string belongs to the category of the first character string, the other child node comprises the character string in the second data, the character string belongs to the category of the preset character string, and any one of the child nodes in the level 4 and one of the child nodes in the level 3 are in a parent-child relationship.
In a fifth aspect, an embodiment of the present invention provides an apparatus, where the apparatus may include:
the processing module is used for starting a browser to access a webpage;
the receiving and sending module is used for acquiring information of the access webpage;
the processing module is further used for matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information; and when the target information is included in the information of the access webpage, intercepting the target information.
In the scheme, the device intercepts the target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish the character strings in the first data, and the matching times of the information for accessing the webpage and the first data are effectively reduced, so that the problem that the number of matching times is increased due to the fact that more character strings for intercepting the target information are not matched reasonably is solved, and the matching speed can be improved by more than 40%.
In an alternative implementation, the tree structure may include a plurality of nodes, where the plurality of nodes includes a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
In another alternative implementation, the "processing module" may be specifically configured to match the information of the accessed webpage from the first data of the parent node of the tree structure to the first data of the child node in parent-child relationship with the parent node step by step until it is determined whether the information of the accessed webpage includes the target information.
Because the information of the accessed webpage has a difference in length, the longer information of the accessed webpage cannot be directly matched, so that the information of the accessed webpage is matched step by step, the information of the accessed webpage can be completely matched, and the accuracy of intercepting the target information is improved.
In yet another optional implementation manner, the tree structure may include m levels of child nodes, each level of child nodes in the m levels of child nodes is divided according to different preset rules in n types of preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
the j-th-level child node selects 1 preset rule from f preset rules for division, the f preset rules select the rest preset rules for the front j-1-level child nodes in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child nodes, and j and f are integers more than or equal to 1;
each of the n preset rules respectively comprises at least two character string categories;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
Because each terminal or operator has different definitions for the target information, the method and the system provide various preset rules and categories, the preset rules can be selected according to requirements, the flexibility of the tree structure can be improved, and the method and the system are suitable for more scenes.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
In yet another optional implementation manner, the "black-and-white list rule" may include a category of a white list and a category of a black list, the level 1 child node in the m-level child nodes is divided according to the black-and-white list rule, and the character string belonging to the category of the white list and the character string belonging to the category of the black list in the first data correspond to one child node in the level 1 child node respectively.
In yet another alternative implementation, the processing module may be specifically configured to match the information of the accessed web page with a character string of a category of a white list, and determine that the information of the accessed web page does not include the target information when the information of the accessed web page includes the character string of the category of the white list.
Since some information accessing the web page may carry "ad" but may not be the target information for some operators, setting a string with a white list category excludes the possibility of carrying "ad" but not the target information (i.e., advertisement), which improves the accuracy of interception.
In yet another optional implementation manner, the processing module may be specifically configured to, when the information on the accessed webpage does not include a character string of a category of the white list, match the information on the accessed webpage with a character string of a category of the black list;
when the information of the accessed webpage does not include the character string of the category of the blacklist, determining that the information of the accessed webpage does not include the target information;
when the information of the accessed webpage comprises the character strings of the category of the blacklist, matching the information of the accessed webpage with child nodes of the character strings belonging to the category of the blacklist in a parent-child relationship step by step until the information of the accessed webpage is completely matched, and intercepting target information in the information of the accessed webpage.
In yet another optional implementation manner, the "positioning and preset matching rule" may include a category of positioning matching and a category of preset matching, a level 2 child node in the m-level child nodes is divided according to the positioning and preset matching rule, a character string belonging to the category of positioning matching and a character string belonging to the category of preset matching in the first data correspond to one child node in the level 2 child nodes, respectively, and any one child node in the level 2 child nodes and a child node of a character string belonging to the category of the blacklist in the level 1 child node are in a parent-child relationship.
In yet another alternative implementation manner, the "category of location matching" may be used to filter at least one of information that a character string exists at a first preset location in the information for accessing the web page, or information that a separator exists at a second preset location;
the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information for accessing the webpage.
In yet another optional implementation manner, the "tag attribute rule" may include a category with a tag and a category without a tag, a 3 rd-level child node in the m-level child nodes is divided according to the tag attribute rule, a character string belonging to the category with a tag and a character string belonging to the category without a tag in the first data correspond to one child node in the 3 rd-level child node, respectively, where any one child node in the 3 rd-level child node is in a parent-child relationship with one child node in the 2 nd-level child node.
In yet another optional implementation manner, the "category with a tag" may be used to filter information of the accessed web page that includes information of a tag attribute, and the category without a tag is used to filter information of the accessed web page that does not include information of a tag attribute; wherein the content of the first and second substances,
the category with a label specifically includes: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
Due to the fact that the information for accessing the webpage is diversified, the method can provide more possibilities and can intercept the target information more accurately.
In yet another alternative implementation manner, the "character rule" may include a category of the first character string and a category of the preset character string, the level 4 child nodes in the m-level child nodes are divided according to the character rule, the character string belonging to the category of the first character string and the character string belonging to the category of the preset character string in the first data correspond to one child node in the level 4 child nodes, respectively, where any one child node in the level 4 child node is in a parent-child relationship with one child node in the level 3 child node.
In yet another alternative implementation, the "category of the initial string" may be used to filter information that the information of accessing the web page has the same first character as the character string of the category of the initial string;
the category of the preset character string is used for screening information that the information for accessing the webpage has the same information as the preset character string of the category of the preset character string.
In yet another alternative implementation, the "information for accessing a web page" may include a URL of a user accessing a page or a URL of each element of the web page, and the target information is advertisement information.
In yet another optional implementation manner, the "first data" may be obtained after the server performs tree-form conversion processing according to second data, where the second data includes an effective character string and a custom character string of the browser, where the effective character string is a character string whose usage rate is determined to be greater than a preset threshold value by screening an open source character string in an open source website and reported history data within a preset time period.
In a sixth aspect, an embodiment of the present invention provides an apparatus for data processing, where the apparatus includes:
the processing module is used for performing tree-shaped conversion processing on the second data and determining first data;
and the transceiver module is used for transmitting the first data to the terminal so that the terminal can determine whether the accessed webpage contains the target information.
According to the scheme, through tree-shaped conversion processing of the second data, the character strings in the second data can be deeply distinguished through the tree-shaped structure, the character strings are converted into the tree-shaped structure with high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
In an alternative implementation, the "target information" may be advertisement information;
the information for accessing the web page includes: the user accesses at least one of the URL of the page or the URL of each element of the page.
In another optional implementation manner, the "transceiver module" may be further configured to periodically acquire at least one open source character string from an open source website;
the processing module may be further configured to select, as an effective string, a plurality of strings whose access amounts are greater than a first threshold from the at least one open source string and historical data reported by the client within a preset time period;
the transceiver module can also be used for acquiring a custom character string of the browser server;
the processing module may be further configured to determine the second data according to the valid character string and the custom character string, where the valid character string and the custom character string each include at least one character string.
Since each browser server generally has different criteria, that is, the target information may be defined as advertisement information at the a site, but not defined as advertisement information at the B site, the custom character string of the browser server is added when generating the second data, so that the matched second data has flexibility and can be widely used.
In yet another optional implementation manner, the processing module may be specifically configured to divide the plurality of sub-nodes into m levels according to n preset rules, where the preset rules of each level of the m levels of sub-nodes are different;
each of the n preset rules respectively comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are more than or equal to 1, and n is more than or equal to m;
each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1.
In another alternative implementation, the "n preset rules" may include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules;
the processing module can be further used for dividing the plurality of sub-nodes into m-level sub-nodes according to the black and white list rule, the positioning and preset matching rule, the label attribute rule and the character rule.
In yet another optional implementation manner, the "processing module" may be specifically configured to, when the black-and-white list rule includes a category of a white list and a category of a black list, divide level 1 of the m-level child nodes into two child nodes according to the category of the white list and the category of the black list, where one of the two child nodes includes a character string in the second data, the character string belonging to the category of the white list, and the other child node includes a character string in the second data, the character string belonging to the category of the black list.
In yet another optional implementation manner, the "processing module" may be specifically configured to, when the positioning and preset matching rule includes a category of positioning matching and a category of preset matching, divide level 2 of the m-level child nodes into two child nodes according to the category of positioning matching and the category of preset matching, where one child node of the two child nodes includes a character string in the second data that belongs to the category of positioning matching, and the other child node includes a character string in the second data that belongs to the category of preset matching, and where the two child nodes in level 2 and a node where the character string in the category of the blacklist in level 1 is located are in a parent-child relationship.
In yet another optional implementation manner, the "processing module" may be specifically configured to, when the tag attribute rule includes a category with a tag and a category without a tag, divide level 3 of the m-level child nodes into two child nodes according to the category with a tag and the category without a tag, where one of the two child nodes includes a character string in the second data that belongs to the category with a tag, and the other child node includes a character string in the second data that belongs to the category without a tag, where any child node in level 3 and one child node in level 2 are in a parent-child relationship.
In yet another optional implementation manner, the "category with tag" may specifically include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
In yet another alternative implementation, the "processing module" may be specifically configured to, when the character rule includes a category of the first character string and a category of the preset character string, divide the level 4 of the m-level child nodes into two child nodes according to the category of the first character string and the category of the preset character string, where one of the two child nodes includes a character string in the second data that belongs to the category of the first character string, and the other child node includes a character string in the second data that belongs to the category of the preset character string, where any one of the child nodes in the level 4 is in a parent-child relationship with one of the child nodes in the level 3.
In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium, which may include instructions that, when executed on a computer, cause the computer to perform the following steps:
starting a browser to access a webpage;
acquiring information of the access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and intercepting the target information when the information of the access webpage comprises the target information.
In an eighth aspect, an embodiment of the present invention provides a computer-readable storage medium, including instructions, which when executed on a computer, cause the computer to perform the following steps:
performing tree-shaped conversion processing on the second data to determine first data;
and the server sends the first data to a terminal so that the terminal can determine whether the accessed webpage contains target information or not.
In a ninth aspect, an embodiment of the present invention provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of:
starting a browser to access a webpage;
acquiring information of the access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and intercepting the target information when the information of the access webpage comprises the target information.
In a tenth aspect, an embodiment of the present invention provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of:
performing tree-shaped conversion processing on the second data to determine first data;
and the server sends the first data to a terminal so that the terminal can determine whether the accessed webpage contains target information or not.
Drawings
FIG. 1 is a schematic diagram of an application scenario of advertisement blocking;
FIG. 2 is a schematic diagram of an application scenario of another advertisement blocking;
fig. 3 is a schematic view of an application scenario of advertisement blocking according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a method for processing data according to an embodiment of the present invention;
fig. 5 is a schematic diagram illustrating a matching result of URLs of elements accessed by a browser client according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a tree structure according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a tree structure partitioned based on black and white list rules according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a tree structure partitioned based on positioning and preset matching rules according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a statistical classification structure based on label attribute rule or character rule division according to an embodiment of the present invention;
fig. 10 is a schematic diagram of a tree structure based on rule division according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a tree structure based on sub-classification according to an embodiment of the present invention;
fig. 12 is a schematic diagram of a tree structure partitioned based on a black-and-white list rule, a positioning and preset matching rule, and a tag attribute rule according to an embodiment of the present invention;
fig. 13 is a schematic diagram of a tree structure partitioned based on character rules according to an embodiment of the present invention;
fig. 14 is a flowchart of an information intercepting method according to an embodiment of the present invention;
fig. 15 is a schematic structural diagram of a terminal for intercepting information according to an embodiment of the present invention;
fig. 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention;
fig. 17 is a schematic structural diagram of an information intercepting apparatus according to an embodiment of the present invention;
fig. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
Currently, the technology for advertisement interception may be interception by using a server of Opera, as shown in fig. 1, the server of Opera may include: browser server, webpage cache library and page processing server. Specifically, when a client (for example, a mobile phone, a tablet computer, etc.) browses a webpage by using an Opera browser, the client sends a webpage access request to a server, a browser server receives the webpage access request and sends query webpage information to a webpage cache library, the webpage cache library searches corresponding data according to the webpage information and sends the corresponding data to the browser server, and the browser server returns webpage content.
The related data stored in the web page cache library is that a web page processing server periodically sends a web page access request, receives web page content information, and processes the web page content information, and the processing content may include: at least one of picture compression, text compression or advertisement filtering, and then compressing the processed content information and sending the compressed content information to a webpage cache library for storage so as to facilitate the browsing server to inquire. Therefore, the method is to hide the advertisement based on the server side and then return the webpage content with the hidden advertisement to the client side for display. The method needs to cache a large number of pages at the server and analyze all contents in the pages, and the process needs the server in the server to have higher performance to quickly complete the processing of the page contents, and has very high requirements on the performance and storage of hardware.
Another advertisement blocking technology is applied to a browser system (as shown in fig. 2), an easylist rule list needs to be downloaded by a browser server, an advertisement blocking character string is downloaded by the browser client to the browser server periodically (it needs to be noted that the easylist rule list includes the advertisement blocking character string), when the browser client accesses a web page, a Uniform Resource Locator (URL) of an element in the access page is matched with the advertisement blocking character string, and the element in the access page corresponding to the character string is hidden according to the matched character string.
Although this approach is based on browser client interception of advertisements, it presents at least two problems. First, the easylist rule list downloaded through the browser server ages the problem: for example, the URL in the current easylist rule list is about 4.5W, and in the continuous increase of volunteers, volunteers are willing only to add new rules to "contribute" and do nothing to add value to them, such as: deleting old URLs in the easylist rule list, which risks causing the URLs in the easylist rule list to grow, it should be noted that the advertisement blocking string is determined by the URLs in the easylist rule list. Meanwhile, a plurality of urls in the easy list rule list are proposed early, the page implementation mode of the original website is modified, and the urls in the easy list rule list are outdated, so that the urls in the outdated easy list rule list cannot provide effective advertisements for the browser client side to intercept character strings for interception. Secondly, the matching performance of the URLs in the visited web pages and the URLs in the easy list rule list is low, for example, as mentioned above, the URL rule in the easy list rule list is about 4.5W, and some large websites have more than 100 web requests and some even up to 430 web requests of the first page, thereby causing that in the advertisement blocking of such web pages, 4.5W 100W or even 400W ad matching times or even ten thousand ad matching times are carried out, which inevitably has a significant influence on the performance of the device of the browser client.
Therefore, based on the above problems, embodiments of the present invention provide a method, an apparatus, and a terminal for intercepting information based on a client, where the terminal intercepts target information in a browser page through first data having a tree structure, and the tree structure can deeply distinguish character strings in the first data, thereby effectively reducing the number of times of matching between information of accessing a web page and the first data, and thus avoiding a problem that the number of matching times is increased due to a large number of character strings for intercepting the target information and no rationalized matching manner.
For convenience of description, the target information in the access webpage is exemplified by the advertisement information, wherein the method provided by the embodiment of the present invention may also be used for information other than the advertisement information, for example: consultations, web addresses, etc.
Fig. 3 is a schematic view of an application scenario of advertisement blocking according to an embodiment of the present invention. As shown in fig. 3, the scenario may include a client and a server, where the client may specifically be a browser client, and the server may specifically be a browser server.
Specifically, the method may include two processes, and the first process may be that the browser service determines the first data, and specifically, the browser service obtains at least one of URLs of a large number of pages visited by the user or at least one URL of a page element, where the page element may include: at least one of text, connection or picture; the browser service side periodically acquires an open source list (such as an easy list rule list) of open source websites. The browser server side learns by using a browser server side learning mechanism (for example, the cloud side learning mechanism in fig. 3) according to the acquired at least one of the URL of the page visited by the user or at least one URL of the page element and the acquired open source list (the open source list may include the character string of the advertisement blocking), determines a valid character string (for example, a character string with a visit amount greater than a preset threshold value in a preset number of days, which is represented by the character string of the X days visit amount top1w in fig. 3), and removes the character string which is invalid or visited by few people, and reduces the number of rules, so as to effectively reduce the number of subsequent matches.
The browser server side combines the effective character string and the custom character string of the browser (for example: the self-operation interception rule in fig. 3) to determine the second data. The effective character string and the custom character string respectively comprise at least one character string. And the browser server converts the second data into a tree private format to generate first data, stores the tree private format (the first data with the tree structure) into a private format preference rule base, and synchronizes to the browser client. The method comprises the steps that a browser client periodically downloads first data with a tree structure to a browser server, when a third webpage is accessed, webpage information for accessing the third webpage is matched with the first data with the tree structure, a matching result is determined, if the first data with the tree structure is matched, the browser client intercepts matched target information in the webpage information for accessing the third webpage, and generally the target information is advertisement information.
In summary, according to the method, on one hand, invalid or low-access character strings in the original open source list are removed by counting a large amount of data accessed by users, so that the validity of the rules is ensured, and meanwhile, the matching targets are reduced. On the other hand, through deep understanding of the second data, the character strings are classified according to corresponding rules to form a tree structure, so that the matching times of single information (namely information for accessing each element in a third webpage, wherein the element generally refers to characters, pictures, videos and the like) are greatly reduced during matching.
The method for intercepting information provided by the embodiment of the present invention is further described below with reference to fig. 4 to fig. 13, and first, a process of processing data (i.e., determining first data) at the browser server needs to be described. As shown in fig. 4 to 13:
fig. 4 is a flowchart illustrating a data processing method according to an embodiment of the present invention. As shown in fig. 4, steps S410-S470 may be included, as follows:
s410: and the browser server receives an instruction of the browser client to access the webpage.
The instruction for accessing the web page by the browser client can be an instruction for accessing a plurality of web pages by a large number of users through the browser; or, instructions for a large number of users to access the same web page through a browser. Specifically, the browser client records at least one of the URLs of the access pages or the URLs of the page elements according to instructions of a large number of users for accessing the web pages, compresses at least one of the URLs of the access pages or the URLs of the page elements within preset time, and the compressed file is the instruction of the browser client for accessing the web pages. Wherein, the second message does not have any user identification, and the purpose is to ensure the privacy of the user.
It should be noted that, because a large number of users access the web pages, the browser client sends the instruction of accessing the web pages to the browser server many times.
S420: the browser server periodically acquires the latest open source list (such as the easy string or the list containing the easy string) from the open source website, for example: the server acquires the latest open source list from the open source website at 12 am every day.
S430: and the browser server side determines the effective character strings, namely rules for screening high hit rate according to the open source list and the instructions of the browser client side for accessing the webpage.
Specifically, each time the browser client reports an instruction that the browser client accesses a web page, the server extracts at least one of the URLs of the user access pages or the URLs of the page elements in the instruction that the browser client accesses the web page, matches the two character strings in the open source list, counts and adds 1 to the character string if the corresponding character string is matched, and repeats the steps until the browser server completes matching all records in at least one of the URLs of the user access pages or the URLs of the page elements, that is, at least one of the URLs of the user access pages or the URLs of the page elements can be placed in the backup directory, and in one possible implementation manner, the file can be deleted within a preset time period.
The browser server stores the counting result of each character string, then counts the access counting result of each character string in a preset time period (for example, the latest 30 days), arranges the access counting result values from large to small, determines the character string corresponding to the access counting result being larger than a first threshold (for example, arranges the character string corresponding to the access counting result being larger than the first threshold N ═ 2000 or more), the character string corresponding to the access counting result being larger than the first threshold is a valid character string, and the character string corresponding to the access counting result being smaller than a second threshold (for example, the second threshold N ═ 100) or having failed (for example, the character string with zero access amount) is directly removed.
Examples are specifically illustrated, for example: cn has the following rules in the easylist string list:
||mobile.sina.cn/public/files/image/600x150_
||mobile.sina.cn/public/files/image/620x300_
||sina.cn/api/article/news_banner?
||sina.cn/cm/sinaads_
||sina.cn^*/impress?
cn, when opening the home page of the current sina.cn, the URL of the page to be accessed, which is collected by the browser client, is matched with the strings in the eaylist string list by the server, as shown in fig. 5. As can be seen from fig. 5: cn ^ i/impress? This rule is hit 4 times, | sina.cn/cm/sinaads _ is hit once. From this, it can be seen which of the 5 strings given in the easy list are frequently accessed and which are rarely or not accessed. As a result of the single access, as shown in fig. 5, after collecting access instructions of millions of users, it is possible to obtain which are valid and which are invalid.
S440: and the browser server side combines the effective character string and the user-defined character string of the browser to determine second data.
In particular, valid strings are filtered from an open source list (e.g., an easy list rule list), so valid strings are open source. Meanwhile, different browsers have some customized rules, namely, customized character strings of the browsers, during operation.
In another possible embodiment, before the step S440, a custom character string of the browser server may be obtained.
S450: and the browser server performs tree-shaped conversion processing on the second data to determine the first data.
The first data may include a character string for matching target information, where the target information refers to advertisement information in the present application, and the first data is used for the browser client to intercept and access the advertisement information in the page according to the first data.
Specifically, the browser server divides the second data into m levels according to n preset rules, wherein the preset rules of each level in the m levels of child nodes are different; each of the n preset rules comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories; the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are greater than or equal to 1, and n is greater than or equal to m.
It can also be understood that the browser service divides the second data into m levels (m is a positive integer greater than 0 and n is greater than or equal to m) according to n preset rules (n is a positive integer greater than 0), each of the m levels includes at least two sub-nodes, each preset rule in the n preset rules includes at least two categories, at least two sub-nodes in each level are divided according to at least two categories (i.e., each sub-node in each level corresponds to one category), and at least two sub-nodes in each level include a plurality of character strings with one category.
When n preset rules are selected, the preset rules of each level in the m levels of sub-nodes are different, one of the preset rules may be arranged and selected according to the sequence in the n preset rules, and the other is that any two or three of the n preset rules are randomly selected, but at least the number of the n preset rules cannot be lower than two.
By way of example, as shown in fig. 6, the browser service divides the second data into 4 levels (the fourth level is not shown) according to 4 preset rules, each of the 4 levels includes at least two child nodes, and the 4 preset rules may include: black and white list rules, positioning and preset matching rules, tag attribute rules or character rules, it should be noted that the preset rules may also include other possibilities (for example, identification, fixed statement, etc.), and the present application only uses the above rules for example, and is not limited to these 4 possibilities. When 2 or 3 of the division modes are selected for tree transformation, the types of the division modes are reduced, and although the division strength is weak, the matching speed is improved compared with the prior art.
Each preset rule in the 4 preset rules may include at least two categories, and each child node in each level corresponds to one category, where at least two child nodes in each level are divided according to at least two categories.
For example, when the black and white list rule and the positioning and character rule division in the division mode are selected for tree-form conversion processing, the black and white list division is firstly adopted, and then the character string division is adopted for tree-form conversion processing; when the positioning and preset matching rule, the label attribute rule division and the character rule in the division mode are selected, firstly, the positioning and preset matching rule is adopted, then, the label attribute rule division is carried out, and finally, the character rule is subjected to tree-form conversion processing. It should also be understood that when the above 4 rules are selected, they should be arranged in the following order, and if they are selected to include or exclude some of the above rules, they should be arranged in the order according to the actual situation.
For example: when the BLACK-and-WHITE list rule includes a category of a WHITE list and a category of a BLACK list, dividing level 1 of the m-level child nodes into two child nodes according to the category of the WHITE list and the category of the BLACK list (1 a in fig. 6 corresponds to a BLACK list child node in fig. 7, and 1b in fig. 6 corresponds to a WHITE child node in fig. 7), where one of the two child nodes includes a character string in the second data that belongs to the category of the WHITE list (content in a frame below the WHITE child node in fig. 7), and the other child node includes a character string in the second data that belongs to the category of the BLACK list (content in a frame below the BLACK list child node in fig. 7).
Specifically, the browser server divides the second data into a first sub-node and a second sub-node according to the category of the white list and the category of the black list in the black and white list rule, wherein the first sub-node (for example, the 1a sub-node) comprises a character string belonging to the category of the black list, and the second sub-node (for example, the 1b sub-node) comprises a character string belonging to the category of the white list.
When the positioning and preset matching rule includes a category of positioning matching and a category of preset matching, according to the category of positioning matching and the category of preset matching, dividing level 2 of the m-level child nodes into two child nodes (for example, a child node 2a and a child node 2b in fig. 6), where one of the two child nodes includes a character string in the second data belonging to the category of positioning matching, and the other child node includes a character string in the second data belonging to the category of preset matching, where the two child nodes in level 2 and a node where the character string in the level 1 belonging to the category of the blacklist is located are in a parent-child relationship. In another possible embodiment, the 3 rd child node in the 2 nd level (for example, the 2c child node in fig. 6) has a parent-child relationship with the node in the 1 st level where the character string belonging to the category of the white list is located. It should be noted that, in the implementation of the present invention, the child nodes in the level 2 are all nodes having a parent-child relationship with the node where the character string belonging to the category of the blacklist is located. Specifically, the child node having the category of location matching includes at least one of information that a character string exists at a first preset location or information that a separator exists at a second preset location; the preset matching category comprises at least one of information for screening the information of the accessed webpage, wherein the information has a prefix, or information with a suffix.
The following details the child node of the category with location match and the child node of the category with preset match:
the child nodes with the positioning matching categories are mainly divided according to characters at fixed positions, specifically, characters exist at the first preset position, wherein the characters indicate that any character string appears at the first preset position; or, there is ^ at the second preset position, where ^ indicates that a separator (where the separator may be any character other than letter, number, _, -,. or%,) occurs at the second preset position.
For example, in the web address of a browser client accessing a page://,? And can be considered as separators:
http://example.com:8000/foo.bar?a=12&b=%D1%82%D0%B5%D1%81%D1%82
therefore, rule filtering in the list of location-matched rules is matched with either ^ example ^ com ^ or ^ D1% 82% D0% B5% D1% 81% D1% 82^ or ^ fo.
In addition, the preset matching categories are divided according to a common mode, wherein the preset matching categories may include: at least one of prefix matching or postmatch. The following description will be made in the case where both are present.
As can be seen from the above, the white. For example: for the above sina-related branches, the branch scenario shown in fig. 8 is changed, and due to the limited number, only 3 sub-nodes (for example, white. That is, the second level child nodes having a parent-child relationship with the first level 1a node under the first level 1a node may include 2a and 2b, while the second level child nodes having a parent-child relationship with the first level 1b node under the first level 1b node may also include 2a and 2b or 2c (this possibility is not shown in fig. 6).
The category of the preset match may be divided into 2 branches (i.e. prefix match and post match), and in a possible embodiment, the preset match may be merged with the first stage into one layer, that is, the ROOT node (ROOT) may include 4 nodes at the same time, for example: plant, white, glob, black.
In another possible embodiment, 4 child nodes may appear in the second level, and the 4 child nodes may include a child node having information that a string exists in a first preset position, a child node having information that a string exists in a second preset position, a child node having information that a prefix exists, and a child node having information that a suffix exists.
In yet another possible embodiment, the 8 child nodes may be present in the second level, and the 8 child nodes may be divided into at least two groups, where one group is 4 child nodes that are parent child nodes to the first level 1a node, the 4 child nodes may include a child node having information that a character string is present in a first preset position, a child node having information that a character string is present in a second preset position, a child node having information that a prefix is present, and a child node having information that a suffix is present, and the other group is 4 child nodes that are parent child nodes to the first level 1b, and the 4 child nodes may include a child node having information that a character string is present in a first preset position, a child node having information that a character string is present in a second preset position, a child node having information that a prefix is present, and a child node having information that a suffix is present.
When the tag attribute rule includes a category with a tag and a category without a tag, according to the category with a tag and the category without a tag, dividing a level 3 of the m-level child nodes into two child nodes (e.g., 3a and 3b), where one child node of the two child nodes includes a character string in the second data that belongs to the category with a tag, and the other child node includes a character string in the second data that belongs to the category without a tag, where any child node in the level 3 and one child node of the level 2 child nodes are in a parent-child relationship, for example: as shown in FIG. 6, the 3a and 3b child nodes are in a parent-child relationship with the 2a child node, and the 3c child node is in a parent-child relationship with the 2b child node.
As shown in fig. 9, the following description is made in order with reference to fig. 9: the tag attribute rules may be included in many types, and in the embodiment of the present invention, two types are provided, one type is a type including a tagged category (for example, the content referred to in fig. 9), and the other type is a non-tagged category, where the tagged category may specifically include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different. Therefore, the category with a label (for example, the content related to fig. 9) is described first, and specifically, as follows:
MIME_TYPE of request content
"other":1
"xbl":1
"ping":1
"dtd":1
"script":2
"image":4
"background":4
"stylesheet":8
"object":16
"subdocument":32,
"document":64,
"xmlhttprequest":2048,
"object_subrequest":4096,
"media":16384,
"font":32768,
"popup":0x1000000,
the left column represents the classification of the character string by the label-bearing class according to the label class, and the right column represents the number (which is set in the standard) corresponding to the label class after the classification.
The above character strings are further divided according to the label category, and the 4 sub-categories (for example, "script", "image" and "document" in fig. 10) selected above can be divided into the label category, as shown in fig. 10, the division is only performed for black. And setting a character string with the label type image in the second data on the child node, and the like. Fig. 10 also includes as a class a category that does not include a label (e.g., a character string without a category that does not have a label in a box below the "+" node in fig. 10). There are other node label categories (for the limited scope of the illustration, "… …" is used to indicate other label categories), and then a large number of strings are hung up in the node where the different label categories are located. Since the character strings of "script" and "image" are generally present at a high ratio, many examples of character strings are given in fig. 10 for the child node of the tag type of "script" and the child node of the tag type of "image".
Secondly, in a possible embodiment, the category with the label may be further specifically classified as: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different. The specific category of the labeled categories (see the contents of reference numerals 2-6 in fig. 9) will be described in detail.
The host can include the following types in 4:
the first method comprises the following steps: direct classification
Specifically, only the host name is included (that is, the description part of the sequence number 2 in fig. 9 only includes host information), for example: (the example part in sequence number 2 in FIG. 9) |9377os. com ^ such a string can be subsequently further divided according to the string of the host name.
And the second method comprises the following steps: third classification
Specifically, only the host information of the advertisement attribution is included (i.e. the description part of the serial number 3 in fig. 9 only includes the information of the third-party website accessing the advertisement attribution website), for example: (ordinal number in FIG. 9) | |116b.com ^ $ third-party can be further divided under classification subsequently according to host name's string.
And the third is that: domain _ Direct classification
Specifically, character division is performed subsequently according to two-level classification of host and domain (i.e. the description part of the sequence number 4 in fig. 9 contains the domain of the current web page and the information of the host of the advertisement web page).
And fourthly: domain _ Filter Classification
Specifically, url information of the host and the advertisement (i.e., information that the part indicated by the number 5 in fig. 9 contains domain and advertisement content) is included, for example, the following 5 character strings:
Figure BDA0001813922110000191
/static/media/curl.swf$domain=duba.com
Figure BDA0001813922110000192
/banner.js$domain=28188.com|28188.net
/skin/tb12/$domain=17huohu.com|firefox.com.cn
Figure BDA0001813922110000194
.com/tps/$domain=ocucn.com
Figure BDA0001813922110000195
||cdndm.com/12/2016/$domain=1kkk.com|dm5.com
according to the 5 character strings, the method can be further divided according to Domain _ Filter:
the string containing the ad hostname may be: com/12/2016/$ domain ═ 1kkk.com | dm5.com
The string containing the ad path without the host name may be: com/tps/$ domain ═ ocean
The string that does not contain the filename of the host containing the advertisement may be:
/static/media/curl.swf$domain=duba.com
according to the above further division, the Domain _ Filter can be further divided into 3 sub-nodes (e.g., as shown in fig. 11), i.e., a node containing a character string of the advertisement host name (e.g., 111 in fig. 11), a node not containing a character string of the advertisement path without the host name (e.g., 113 in fig. 11), and a node not containing a character string of the file name of the host containing the advertisement (e.g., 112 in fig. 11).
The above classification processing method may also perform the same processing on domain _ filter below the attribute classification of the home host. For example, for advertisements containing pictures, this classification method can be used:
fifth, THIRD _ FILTERS Classification
Specifically, for example: com. tw/exep/ap/$ third-party string.
The difference from the fourth Domain _ filter is that only the host of the Domain and the advertisement currently accessed by the user is different, and the advertisement information processing part is the same. And thus may be processed as well according to the fourth Domain _ filter section.
In addition, a fifth method may be included: type _ filter classification
Specifically, the Domain _ filter and Third _ filter may be combined into Type _ filter, and when there is only advertisement content information, the Type _ filter may include two subclasses, Domain _ filter and Third _ filter.
To sum up, according to the black and white list division, the matching mode division and the rule category division, the second data may be subjected to tree transformation processing, and the transformed tree structure may be as shown in fig. 12, specifically, the tree structure of fig. 12 is formed by combining the matching mode division and the rule category division by taking black.
For the sub-nodes 120-129 in fig. 12, the domain, the advertisement host, the advertisement object path and the host name (name) may be divided into character strings, and specifically, for the classification of at least one sub-node in direct or third in fig. 12, the host name may be divided into: the initial character may be 0-9, a-Z or at least one of other categories to further divide (specifically, as shown in fig. 13), and the division is performed into 3 child nodes, and each child node divides the original 4 character strings into 3 child nodes according to different categories.
The browser service end may further divide the third-level, that is, the 120-and 129-level child node in fig. 12, to the fourth level again, where the selected preset division rule may be a character rule, and specifically, when the character rule includes a category of an initial string and a category of a preset character string, according to the category of the initial string and the category of the preset character string, the 4 th level of the m-level child nodes is divided into two child nodes, one of the two child nodes includes a character string belonging to the category of the initial string in the second data, and the other child node includes a character string belonging to the category of the preset character string in the second data, where any one child node in the 4 th level is in a parent-child relationship with one child node in the 3 rd-level child node.
It should be noted that each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1. In combination with the above example, it is understood that, in the case where n is 4, if k is 2 (from this, k is equal to or less than n and is a positive integer), that is, each child node (e.g., 2a and 2b) in the 2 nd-level child node has a parent-child relationship with one child node (e.g., 1b) in the first level; alternatively, k is 3, i.e., each of the 3 rd level child nodes (e.g., 3a, 3b, and 3c) has a parent-child relationship with one child node (e.g., 2a or 2b) in the second level. Wherein each child node may have a child node at a next level having a parent-child relationship with the node.
It should be noted that, in another possible embodiment, in the above manner, two or three division manners among the above 4 may also be selected.
The functions of each child node of the tree structure described above include: matching at least one of the URL of the user access page or the URL of each element of the access page according to the character string included in each child node; and distributing the next-level matched child nodes according to the URL of the user access page or the character string characteristics contained in the URL of each element of the access page.
Next, each time the browser client starts the browser to access the web page, the browser client needs to intercept target information (i.e. advertisement information) in the accessed web page according to the first data with the tree structure, and the following steps need to be performed, where fig. 14 is a flowchart of an information interception method provided in an embodiment of the present invention, and as shown in fig. 14, the steps include S1410-S1440, as follows:
s1410: the browser client launches a browser to access the web page.
Specifically, before this step, the method may further include receiving the first data. The first data can be downloaded from a browser server, and the downloading is generally performed periodically (for example, automatically downloading when networking is performed at 12: 00 a day). The first data is obtained after the server side performs tree-shaped conversion processing according to the second data, wherein the second data comprises an effective character string and a user-defined character string of a browser, and the effective character string is a character string with the utilization rate larger than a preset threshold value determined by screening an open source character string in an open source website and historical data reported by the terminal within a preset time period.
S1420: and the browser client acquires the information of the access webpage.
Specifically, the access web page may also refer to website information.
The information for accessing the web page may include: the user accesses the URL of the page or accesses the URL of the various elements of the page. The information of the accessed webpage may or may not include target information, where the target information generally refers to advertisement information in the embodiments provided in the present application.
S1430: and the browser client matches the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information.
The specific matching process may be as follows:
first, a tree structure (the tree structure may be a tree structure determined by the browser service end through tree transformation) is introduced, and details will be described with reference to fig. 6, where the tree structure includes a plurality of nodes, the plurality of nodes includes a ROOT node (ROOT) and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on the nodes in the tree structure according to a preset rule. And matching the information of the accessed webpage from the first data of the father node of the tree structure to the first data of the child node in parent-child relationship with the father node step by step until whether the information of the accessed webpage comprises the target information is determined.
Specifically, the tree structure may include m levels of child nodes, each level of child nodes in the m levels of child nodes is divided according to different preset rules in n preset rules, where n and m are integers greater than or equal to 1, and n is greater than or equal to m; selecting 1 preset rule from f preset rules for the j-level child node to divide, wherein the f preset rules are the remaining preset rules selected by the previous j-1-level child node in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child node, and j and f are integers greater than or equal to 1; each of the n preset rules comprises at least two character string categories respectively;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to the m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
Wherein, the n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
The embodiment provided by the present application performs partition matching as shown by the rule in fig. 4.
Level 1 in the tree structure includes two child nodes, wherein a first child node in the level 1 child nodes contains a plurality of strings of a category having a white list and a second child node in the level 1 child nodes contains a plurality of strings of a category having a black list. Wherein, the two child nodes are divided according to the black and white list rule.
In the matching process, if the information of the accessed webpage is matched in the first child node, the matching is directly finished, and a large number of character strings are not required to be matched in the second child node.
For example, for example: for example, as shown in fig. 7, a (black) sub-node and a (white) sub-node are determined, and as can be seen from the figure, the (white) sub-node may include @ | | si. The (black) child node may include: if the URL of a picture (an element) is https:// sina. com. cn/little/180528/close.jpg, the character string in the (white) child node is matched, and if the character string is matched with the character string in the (white) child node, the character string in the (black) child node is not matched with the character string in the (black) child node. Therefore, the character string of the (white) child node is used for screening the character string which is not advertisement information, when the character string is matched with the (white) child node, the (white) child node indicates that the information is not an advertisement, the (white) child node jumps out of the tree structure without intercepting the information, and the matching process is terminated.
However, when the information of the accessed web page does not include the character string of the category having the white list, the information of the accessed web page is matched with the character string of the category of the black list, that is, matched in the second child node. And when the information of the accessed webpage does not comprise the character string of the category of the blacklist, the terminal determines that the information of the accessed webpage does not comprise the target information, the terminal does not intercept the target information, the information is not an advertisement, the tree structure is jumped out without intercepting the information, and the matching process is terminated.
When the information of the access webpage comprises the character strings of the category of the blacklist, the terminal matches the information of the access webpage with child nodes of parent-child relations of the child nodes of the character strings belonging to the category of the blacklist step by step until the information of the access webpage is completely matched, and the terminal intercepts target information in the information of the access webpage.
The level 2 in the tree structure comprises two child nodes, wherein any one of the child nodes in the level 2 is in parent-child relationship with the child node of the character string belonging to the category of the blacklist in the child node of the level 1.
The first child node in level 2 includes a string having a category for locating a match, and the second child node in level 2 includes a string having a category for presetting a match. The positioning matching category is used for screening at least one of information that a character string exists at a first preset position or information that a separator exists at a second preset position in the information of the accessed webpage; the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information of the accessed webpage.
For example: the child nodes with the positioning matching categories are mainly divided according to characters at fixed positions, specifically, characters exist at the first preset position, wherein the characters indicate that any character string appears at the first preset position; or, there is ^ at the second preset position, where ^ indicates that a separator (where the separator may be any character other than letter, number, _, -,. or%,) occurs at the second preset position. When the browser client accesses the web site of the page, there are included://,? And can be considered as separators:
http://example.com:8000/foo.bar?a=12&b=%D1%82%D0%B5%D1%81%D1%82
therefore, rule filtering in the list of location-matched rules is matched with either ^ example ^ com ^ or ^ D1% 82% D0% B5% D1% 81% D1% 82^ or ^ fo.
The prefix information or suffix information is also the same, and if a corresponding character string appears at a preset position, it proves that the matching can be performed, for example: a page, a black, and a black, etc. if the accessed web page includes characters with the same prefix or suffix, the matching is proved. When the matching is completed, whether the webpage of the access page is matched or not needs to be determined, and if the matching is not completed, the 3 rd level is continued to be matched. If the matching is completed, the information is the advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
The first child node in level 3 includes a string of the tagged category and the second child node includes a string of the untagged category. Wherein any child node in the level 3 child node is in a parent-child relationship with one child node in the level 2 child node.
The category with the label is used for screening the information of the accessed webpage, wherein the information of the accessed webpage comprises the information of the label attribute, and the category without the label is used for screening the information of the accessed webpage, wherein the information of the accessed webpage does not comprise the information of the label attribute. The category with the label can be further divided into: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
The specific matching process may be matching according to the following classification method:
the first method comprises the following steps: direct matching
Specifically, only the host name is included (that is, the description part of the sequence number 2 in fig. 9 only includes host information), for example: (the example part in sequence number 2 in FIG. 9) |9377os. com ^ such a string can be further matched subsequently according to the string of the host name.
And the second method comprises the following steps: third matching
Specifically, only the host information of the advertisement attribution is included (i.e. the description part of the serial number 3 in fig. 9 only includes the information of the third-party website accessing the advertisement attribution website), for example: (sequence number in FIG. 9) | |116b.com ^ $ third-party can be further matched by host name's string subsequently under classification.
And the third is that: domain _ Direct matching
Specifically, character matching is subsequently performed according to the two-level classification of host and domain (i.e., the description portion of sequence number 4 in fig. 9 contains the domain of the current web page and the information of the host of the advertisement web page).
And fourthly: domain _ Filter matching
Specifically, url information of the host and the advertisement (i.e., information that the part indicated by the number 5 in fig. 9 contains domain and advertisement content) is included, for example, the following 5 character strings:
Figure BDA0001813922110000231
/static/media/curl.swf$domain=duba.com
Figure BDA0001813922110000232
/banner.js$domain=28188.com|28188.net
Figure BDA0001813922110000233
/skin/tb12/$domain=17huohu.com|firefox.com.cn
Figure BDA0001813922110000234
.com/tps/$domain=ocucn.com
Figure BDA0001813922110000235
||cdndm.com/12/2016/$domain=1kkk.com|dm5.com
according to the 5 character strings, further matching can be carried out according to Domain _ Filter:
the string containing the ad hostname may be: com/12/2016/$ domain ═ 1kkk.com | dm5.com
The string containing the ad path without the host name may be: com/tps/$ domain ═ ocean
The string that does not contain the filename of the host containing the advertisement may be:
/static/media/curl.swf$domain=duba.com
the matching method can also perform the same processing on domain _ filter below the attribute classification of the home host. For example, for advertisements containing pictures, this matching method can be used:
fifth, THIRD _ FILTERS matching
Specifically, for example: whether the information for accessing the webpage contains | | books.
The difference from the fourth Domain _ filter is that only the host of the Domain and the advertisement currently accessed by the user is different, and the advertisement information processing part is the same. And thus may be processed as well according to the fourth Domain _ filter section.
In addition, a fifth method may be included: type _ filter match
Specifically, the Domain _ filter and Third _ filter may be combined into Type _ filter, and when advertisement content information is included in the accessed page, the Type _ filter may include two subclasses, the Domain _ filter and Third _ filter.
When matching is achieved, whether the webpage of the access page is matched or not needs to be determined, and if not, the step 4 is continued to carry out continuous matching. If the matching is completed, the information is the advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
The first child node in the level 4 includes a character string of the category of the first character string, and the second child node includes a character string of the category of the preset character string. Wherein any child node in the 4 th-level child node is in a parent-child relationship with one child node in the 3 rd-level child node.
Specifically, in the matching process, the category of the first character string is used for screening information of accessing the webpage, and the information of the first character string is the same as the information of the first character string of the category of the first character string; the category of the preset character string is used for screening the information of the accessed webpage and the information that the character string of the category of the preset character string has the same preset character string.
When the matching is completed, whether the matching of the web pages of the access page is completed or not needs to be determined, if the matching is not completed, the matching is continued to the 5 th level, and the like until the web pages of the access page are completely matched. If the matching is completed, the information is the advertisement, and the terminal intercepts the target information corresponding to the URL, thereby terminating the matching process.
In summary, the functions of each child node of the tree structure described above include: matching at least one of the URL of the user access page or the URL of each element of the access page according to the character string included in each child node; and distributing the next-level matched child nodes according to the URL of the user access page or the character string characteristics contained in the URL of each element of the access page.
S1440: and when the information of the accessed webpage comprises target information, intercepting the target information.
Specifically, in S1430, when it is determined that the matching of the web page of the access page is completed, indicating that the information is an advertisement, the terminal intercepts (may include deleting or hiding) target information corresponding to the URL in the web page of the access page, thereby terminating the matching process. In terms of the effect achieved, the user cannot perceive the existence of the advertisement.
If the fact that the webpage of the access page is not matched is determined, namely the length of the matched URL does not exceed the preset threshold value, it is proved that the advertisement information does not exist in the access page of the user, and the browser client can be directly displayed to the user.
According to the scheme, the target information in the browser page is intercepted through the first data with the tree structure, the tree structure can be used for deeply distinguishing character strings in the first data, the matching times of information for accessing the webpage and the first data are effectively reduced, and therefore the problem that the number of matching times is increased due to the fact that the number of character strings for intercepting the target information is large and a reasonable matching mode is not available is solved. In addition, through statistics on the character strings in the acquired open source list, the purpose is to remove the character strings which are invalid or accessed by few people, and the number of rules is reduced, so that the number of times of later matching is effectively reduced.
Fig. 15 is a schematic structural diagram of a terminal for intercepting information according to an embodiment of the present invention. As shown in fig. 15, the terminal 15 may include: one or more processors 1502, a transceiver 1501, a plurality of applications (not shown in the figure) in memory 1503; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the terminal, cause the terminal to perform the steps of:
starting a browser to access a webpage;
acquiring information of an access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and when the target information is included in the information of the access webpage, intercepting the target information.
Wherein, the tree structure may include: the method comprises the following steps of including a plurality of nodes, wherein the plurality of nodes include a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes; the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
The terminal may specifically perform the following steps:
and matching the information of the accessed webpage from the first data of the father node of the tree structure to the first data of the child node in a parent-child relationship with the father node step by step until whether the information of the accessed webpage comprises target information is determined.
The tree structure may specifically include m levels of child nodes, each level of child nodes in the m levels of child nodes is divided according to different preset rules in n types of preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m; the j-th-level child node selects 1 preset rule from f preset rules for division, the f preset rules select the rest preset rules for the front j-1-level child nodes in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child nodes, and j and f are integers more than or equal to 1; each of the n preset rules respectively comprises at least two character string categories; the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
Wherein, the n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
Specifically, the black and white list rule may include: the category of the white list and the category of the black list, the 1 st level child node in the m level child nodes is divided according to the black list and white list rule, and the character string belonging to the category of the white list and the character string belonging to the category of the black list in the first data respectively correspond to one child node in the 1 st level child node.
The terminal may perform the following steps: and matching the information of the accessed webpage with the character string of the category of the white list, and when the information of the accessed webpage comprises the character string of the category of the white list, determining that the information of the accessed webpage does not comprise target information by the terminal and not intercepting the target information by the terminal.
The terminal may further perform the steps of: when the information of the accessed webpage does not comprise the character string of the category of the white list, matching the information of the accessed webpage with the character string of the category of the black list; when the information of the accessed webpage does not comprise the character string of the category of the blacklist, the terminal determines that the information of the accessed webpage does not comprise the target information, and the terminal does not intercept the target information; when the information of the accessed webpage comprises the character strings of the category of the blacklist, the terminal matches the information of the accessed webpage with child nodes of the character strings belonging to the category of the blacklist in a parent-child relationship step by step until the information of the accessed webpage is completely matched, and the terminal intercepts target information in the information of the accessed webpage.
Above, the positioning and presetting the matching rule may specifically include: the method comprises the steps that the matched category and the preset matched category are located, the 2 nd-level child node in the m-level child nodes is divided according to the locating and preset matching rules, a character string belonging to the matched category and a character string belonging to the preset matched category in first data respectively correspond to one child node in the 2 nd-level child nodes, and any child node in the 2 nd-level child nodes and a child node of the character string belonging to the blacklist category in the 1 st-level child nodes are in a parent-child relationship.
The positioning matching category can be used for screening at least one of information that a character string exists at a first preset position in the information for accessing the webpage or information that a separator exists at a second preset position; the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information for accessing the webpage.
The tag attribute rule may specifically include: the data processing method comprises the steps that a type with a label and a type without the label are provided, a 3 rd-level child node in m-level child nodes is divided according to a label attribute rule, a character string belonging to the type with the label and a character string belonging to the type without the label in first data respectively correspond to one child node in the 3 rd-level child node, and any one child node in the 3 rd-level child node and one child node in the 2 nd-level child node are in a parent-child relationship.
The category without the label is used for screening the information without the label attribute; the category with the label specifically includes: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
The character rule may include: the type of the first character string and the type of the preset character string, the 4 th-level child node in the m-level child nodes is divided according to character rules, the character string belonging to the type of the first character string and the character string belonging to the type of the preset character string in the first data respectively correspond to one child node in the 4 th-level child nodes, and any one child node in the 4 th-level child nodes and one child node in the 3 rd-level child nodes are in a parent-child relationship.
The category of the first character string can be used for screening information of accessing the webpage, wherein the information of accessing the webpage is the same as that of the character string of the category of the first character string; the category of the preset character string is used for screening information that the information for accessing the webpage has the same information as the preset character string of the category of the preset character string.
In the above step, the accessing the information of the web page may include: the user accesses the URL of the page or accesses the URL of each element of the page, and the target information is advertisement information. The first data are obtained after the server side carries out tree-shaped conversion processing according to second data, and the second data comprise effective character strings and custom character strings of the browser, wherein the effective character strings are character strings with the utilization rate larger than a preset threshold value determined by screening the open source character strings in the open source website and historical data reported by the terminal in a preset time period.
Because the first data is downloaded from the terminal to the server, and the whole matching process is carried out in the terminal, the method greatly improves the matching speed of the terminal for information and solves the problem that the processing of the page content can be quickly finished only by the server with higher performance in the prior art.
According to the scheme, the terminal intercepts the target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish the character strings in the first data, the matching times of the information of the accessed webpage and the first data are effectively reduced, and therefore the problem that the number of matching times is increased due to the fact that the number of the character strings for intercepting the target information is large and a reasonable matching mode is not available is solved.
Fig. 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention. As shown in fig. 16, the server 16 may include: one or more processors 1601, transceivers 1602, and memory 1603; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the server, cause the server to perform the steps of:
performing tree-shaped conversion processing on the second data to determine first data;
the server sends the first data to the terminal so that the terminal can determine whether the accessed webpage contains the target information or not according to the determination.
Wherein, the target information can be advertisement information; the information for accessing the web page includes: the user accesses at least one of the URL of the page or the URL of each element of the page.
The server may perform the following steps: periodically acquiring at least one open source character string from an open source website; selecting a plurality of character strings with the access quantity larger than a first threshold value from at least one open source character string and historical data reported by a client in a preset time period as effective character strings; acquiring a custom character string of a browser server; and determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
The server may perform the following steps: dividing a plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different; each of the n preset rules respectively comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories; the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are more than or equal to 1, and n is more than or equal to m; each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1.
The n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules; the server performs the following steps: and dividing the plurality of sub-nodes into m-level sub-nodes according to the black and white list rule, the positioning and preset matching rule, the label attribute rule and the character rule.
The server may perform the following steps: when the black-and-white list rule comprises the category of the white list and the category of the black list, dividing the level 1 of the m-level child nodes into two child nodes according to the category of the white list and the category of the black list, wherein one child node of the two child nodes comprises a character string belonging to the category of the white list in the second data, and the other child node comprises a character string belonging to the category of the black list in the second data.
The server may perform the following steps: when the positioning and preset matching rules comprise the positioning matching category and the preset matching category, dividing the level 2 of the m-level child nodes into two child nodes according to the positioning matching category and the preset matching category, wherein one child node of the two child nodes comprises a character string belonging to the positioning matching category in the second data, the other child node comprises a character string belonging to the preset matching category in the second data, and the two child nodes in the level 2 and the node where the character string belonging to the blacklist category in the level 1 are located are in a parent-child relationship.
The server may perform the following steps: when the label attribute rule comprises a category with a label and a category without the label, dividing the level 3 of the m-level child nodes into two child nodes according to the category with the label and the category without the label, wherein one child node of the two child nodes comprises a character string belonging to the category with the label in the second data, the other child node comprises a character string belonging to the category without the label in the second data, and any child node in the level 3 and one child node in the level 2 child node are in a parent-child relationship.
The category with the label may include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
The server may perform the following steps: when the character rule comprises the category of the first character string and the category of the preset character string, dividing the level 4 of the m-level child nodes into two child nodes according to the category of the first character string and the category of the preset character string, wherein one of the two child nodes comprises the character string in the second data, the character string belongs to the category of the first character string, the other child node comprises the character string in the second data, the character string belongs to the category of the preset character string, and any one of the child nodes in the level 4 and one of the child nodes in the level 3 are in a parent-child relationship.
According to the scheme, through tree-shaped conversion processing of the second data, the character strings in the second data can be deeply distinguished through the tree-shaped structure, the character strings are converted into the tree-shaped structure with high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
Fig. 17 is a schematic structural diagram of an information intercepting apparatus according to an embodiment of the present invention. As shown in fig. 17, the apparatus 17 may include:
a processing module 1702 for launching a browser to access a web page;
a transceiver module 1701 for acquiring information for accessing a web page;
the processing module is further used for matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information; and when the target information is included in the information of the access webpage, intercepting the target information.
Wherein the tree structure may include a plurality of nodes, the plurality of nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes including at least two child nodes; the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
The processing module may be specifically configured to match the information of the accessed webpage from the first data of the parent node of the tree structure to the first data of the child node in parent-child relationship with the parent node step by step until it is determined whether the information of the accessed webpage includes the target information.
The tree structure can comprise m levels of sub nodes, each level of sub nodes in the m levels of sub nodes is divided according to different preset rules in n types of preset rules, n and m are integers which are more than or equal to 1, and n is more than or equal to m; the j-th-level child node selects 1 preset rule from f preset rules for division, the f preset rules select the rest preset rules for the front j-1-level child nodes in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child nodes, and j and f are integers more than or equal to 1; each of the n preset rules respectively comprises at least two character string categories; the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
Wherein, the n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
The black-and-white list rule may include a category of a white list and a category of a black list, the level 1 child node of the m-level child nodes is divided according to the black-and-white list rule, and the character string belonging to the category of the white list and the character string belonging to the category of the black list in the first data correspond to one child node of the level 1 child node respectively.
The processing module may be specifically configured to match the information of the accessed webpage with the character string of the category of the white list, and when the information of the accessed webpage includes the character string of the category of the white list, determine that the information of the accessed webpage does not include the target information, and not intercept the target information.
The processing module may be specifically configured to, when the information on the accessed web page does not include the character string of the category of the white list, match the information on the accessed web page with the character string of the category of the black list; when the information of the accessed webpage does not comprise the character string of the category of the blacklist, determining that the information of the accessed webpage does not comprise target information, and not intercepting the target information; when the information of the accessed webpage comprises the character strings of the category of the blacklist, matching the information of the accessed webpage with child nodes of the character strings belonging to the category of the blacklist in a parent-child relationship step by step until the information of the accessed webpage is completely matched, and intercepting target information in the information of the accessed webpage.
The positioning and preset matching rules may include a category of positioning matching and a category of preset matching, the level 2 child node in the m-level child nodes is divided according to the positioning and preset matching rules, the character string belonging to the category of positioning matching and the character string belonging to the category of preset matching in the first data correspond to one child node in the level 2 child nodes, respectively, and any one child node in the level 2 child nodes and the child node of the character string belonging to the category of the blacklist in the level 1 child node are in a parent-child relationship.
The positioning matching category can be used for screening at least one of information that a character string exists at a first preset position in the information for accessing the webpage or information that a separator exists at a second preset position; the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information for accessing the webpage.
The label attribute rule may include a type with a label and a type without a label, the 3 rd-level child node in the m-level child nodes is divided according to the label attribute rule, the character string belonging to the type with a label and the character string belonging to the type without a label in the first data correspond to one child node in the 3 rd-level child node, respectively, and any one child node in the 3 rd-level child node and one child node in the 2 nd-level child node are in a parent-child relationship.
The category without the label is used for screening the information without the label attribute; the category with the label specifically includes: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
The above character rule may include a category of the first character string and a category of the preset character string, the level 4 child node in the level m child nodes is divided according to the character rule, the character string belonging to the category of the first character string and the character string belonging to the category of the preset character string in the first data correspond to one child node in the level 4 child nodes, respectively, where any one child node in the level 4 child node is in a parent-child relationship with one child node in the level 3 child node.
The category of the first character string can be used for screening information of accessing the webpage, wherein the information of accessing the webpage is the same as that of the character string of the category of the first character string; the category of the preset character string is used for screening information that the information for accessing the webpage has the same information as the preset character string of the category of the preset character string.
The information of the access webpage can comprise a URL of the user access webpage or a URL of each element of the access webpage, and the target information is advertisement information. The first data can be obtained after the server side performs tree-shaped conversion processing according to second data, wherein the second data comprises an effective character string and a user-defined character string of the browser, and the effective character string is a character string with the utilization rate larger than a preset threshold value determined by screening an open source character string in an open source website and reported historical data in a preset time period.
In the scheme, the device intercepts the target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish the character strings in the first data, and the matching times of the information for accessing the webpage and the first data are effectively reduced, so that the problem that the number of matching times is increased due to the fact that more character strings for intercepting the target information are not matched reasonably is solved, and the matching speed can be improved by more than 40%.
Fig. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in fig. 18, the apparatus 18 includes:
the processing module 1802 performs tree-shaped conversion processing on the second data to determine first data;
the transceiving module 1801 sends the first data to the terminal, so that the terminal determines whether the accessed webpage contains the target information.
Wherein, the target information can be advertisement information; the information for accessing the web page includes: the user accesses at least one of the URL of the page or the URL of each element of the page.
The receiving and sending module can be used for periodically acquiring at least one open source character string from an open source website; and acquiring the custom character string of the browser server. The processing module can be further used for selecting a plurality of character strings with access quantity larger than a first threshold value from at least one open source character string and historical data reported by the client within a preset time period as effective character strings; the processing module can be further used for determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string; dividing a plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different; each of the n preset rules respectively comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories; the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are more than or equal to 1, and n is more than or equal to m; each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1.
Wherein, the n preset rules may include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules; the processing module can also be used for dividing the plurality of sub-nodes into m-level sub-nodes according to the black-and-white list rule, the positioning and preset matching rule, the label attribute rule and the character rule.
The processing module may be specifically configured to, when the black-and-white list rule includes a category of a white list and a category of a black list, divide level 1 of the m-level child nodes into two child nodes according to the category of the white list and the category of the black list, where one of the two child nodes includes a character string in the second data that belongs to the category of the white list, and the other child node includes a character string in the second data that belongs to the category of the black list.
The processing module may be specifically configured to, when the positioning and preset matching rule includes a positioning matching category and a preset matching category, divide level 2 of the m-level child nodes into two child nodes according to the positioning matching category and the preset matching category, where one of the two child nodes includes a character string in the second data that belongs to the positioning matching category, and the other child node includes a character string in the second data that belongs to the preset matching category, and where the two child nodes in the level 2 and a node in the level 1 where the character string that belongs to the category of the blacklist is located are in a parent-child relationship.
The processing module may be specifically configured to, when the tag attribute rule includes a category with a tag and a category without a tag, divide a level 3 of the m-level child nodes into two child nodes according to the category with the tag and the category without the tag, where one of the two child nodes includes a character string in the second data that belongs to the category with the tag, and the other child node includes a character string in the second data that belongs to the category without the tag, where any one of the child nodes in the level 3 is in a parent-child relationship with one of the child nodes in the level 2 child node.
The category with the label may specifically include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
The processing module may be specifically configured to, when the character rule includes a category of the first character string and a category of the preset character string, divide a level 4 of the m-level child nodes into two child nodes according to the category of the first character string and the category of the preset character string, where one of the two child nodes includes a character string in the second data that belongs to the category of the first character string, and the other child node includes a character string in the second data that belongs to the category of the preset character string, and where any one of the child nodes in the level 4 is in a parent-child relationship with one of the child nodes in the level 3.
According to the scheme, through tree-shaped conversion processing of the second data, the character strings in the second data can be deeply distinguished through the tree-shaped structure, the character strings are converted into the tree-shaped structure with high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
The embodiment of the invention provides a method, a device and a terminal for intercepting information. According to the method, the character strings with failure or less access amount are effectively removed through statistics of a large number of character strings in the source opening list, the number of the character strings is reduced, on the basis, second data are converted into first data with a tree structure for intercepting target information in a browser page, the tree structure can deeply distinguish the character strings in the first data, the matching times of information of the accessed webpage and the first data are effectively reduced, the problem that the number of matching times is increased due to the fact that more character strings of the intercepted target information are used and no reasonable matching mode exists is solved, and in actual statistics, the matching speed can be integrally increased by more than 40%. Specifically, when the tree structure is divided, the second data is divided by using a black and white list rule, a positioning and presetting matching rule, a label attribute rule or a character rule through tree analysis of the character strings, the character strings can be deeply divided by the method and are converted into the tree structure with high discrimination, so that the speed of intercepting advertisements by a browser client is greatly increased, and the experience of a user is effectively improved.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (30)

1. A terminal for intercepting information, comprising: one or more processors, a transceiver, a memory, a plurality of applications, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the terminal, cause the terminal to perform the steps of:
starting a browser to access a webpage;
acquiring information of the access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and intercepting the target information when the information of the access webpage comprises the target information.
2. The terminal of claim 1, wherein the tree structure comprises a plurality of nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes comprising at least two child nodes;
the nodes of each level and the nodes of the next level related to each level have a parent-child relationship, and the first data are distributed on the nodes in the tree structure according to a preset rule.
3. The terminal according to claim 2, characterized in that the terminal performs the following steps:
and matching the information of the accessed webpage from the first data of the father node of the tree structure to the first data of the child node in parent-child relationship with the father node step by step until whether the information of the accessed webpage comprises the target information is determined.
4. The terminal according to claim 3, wherein the tree structure comprises m levels of sub-nodes, each level of sub-nodes in the m levels of sub-nodes is divided according to different preset rules in n preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
selecting 1 preset rule from f preset rules for the j-level child node to divide, wherein the f preset rules are the remaining preset rules selected by the previous j-1-level child node in the n preset rules, the j-1-level child node is the previous-level child node of the j-level child node, the j-level child node is any one-level child node in the m-level child node, and j and f are integers greater than or equal to 1;
each of the n preset rules comprises at least two character string categories respectively;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to the m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in the n preset rules, and each sub-node comprises a plurality of character strings with different character string categories.
5. The terminal according to claim 4, wherein the n preset rules include at least one of the following rules:
black and white list rules, positioning and preset matching rules, label attribute rules or character rules.
6. The terminal according to claim 5, wherein the black-and-white list rule includes a category of a white list and a category of a black list, the level 1 child node of the m-level child nodes is divided according to the black-and-white list rule, and the character string belonging to the category of the white list and the character string belonging to the category of the black list in the first data correspond to one child node of the level 1 child nodes, respectively.
7. The terminal according to claim 6, characterized in that the terminal performs the following steps:
and the terminal matches the information of the accessed webpage with the character string of the category of the white list, and when the information of the accessed webpage comprises the character string of the category of the white list, the terminal determines that the information of the accessed webpage does not comprise the target information.
8. The terminal according to claim 7, characterized in that the terminal further performs the steps of: when the information of the accessed webpage does not comprise the character string of the category of the white list, matching the information of the accessed webpage with the character string of the category of the black list;
when the information of the accessed webpage does not comprise the character string of the category of the blacklist, the terminal determines that the information of the accessed webpage does not comprise the target information;
when the information of the access webpage comprises the character strings of the category of the blacklist, the terminal matches the information of the access webpage with child nodes of parent-child relations of the child nodes of the character strings belonging to the category of the blacklist step by step until the information of the access webpage is completely matched, and the terminal intercepts target information in the information of the access webpage.
9. The terminal according to claim 8, wherein the positioning and pre-setting matching rules include a category of positioning matching and a category of pre-setting matching, the level 2 child nodes in the m-level child nodes are divided according to the positioning and pre-setting matching rules, the character string belonging to the category of positioning matching and the character string belonging to the category of pre-setting matching in the first data respectively correspond to one child node in the level 2 child nodes, and wherein any one of the level 2 child nodes is in a parent-child relationship with the child node of the character string belonging to the category of the blacklist in the level 1 child node.
10. The terminal according to claim 9, wherein the category of the positioning match is used to filter at least one of information that a character string exists at a first preset position or information that a separator exists at a second preset position in the information of the accessed webpage;
the preset matched category is used for screening at least one of information with prefixes or information with suffixes in the information of the accessed webpage.
11. The terminal according to claim 9, wherein the label attribute rule includes a labeled category and a non-labeled category, a 3 rd-level child node of the m-level child nodes is divided according to the label attribute rule, a character string belonging to the labeled category and a character string belonging to the non-labeled category in the first data respectively correspond to one child node of the 3 rd-level child nodes, and wherein any one child node of the 3 rd-level child nodes is in a parent-child relationship with one child node of the 2 nd-level child nodes.
12. The terminal according to claim 11, wherein the category with tags is used to filter information including tag attributes from the information of the accessed web page, and the category without tags is used to filter information not including tag attributes from the information of the accessed web page; wherein the content of the first and second substances,
the category with the label specifically includes: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
13. The terminal according to claim 11, wherein the character rule includes a category of an initial character string and a category of a preset character string, a level 4 child node among the m-level child nodes is divided according to the character rule, a character string belonging to the category of the initial character string and a character string belonging to the category of the preset character string in the first data each correspond to one child node among the level 4 child nodes, and wherein any one child node among the level 4 child nodes is in a parent-child relationship with one child node among the level 3 child nodes.
14. The terminal of claim 13, wherein the category of the first character string is used to filter the information of the accessed webpage and the character string of the category of the first character string has the same information of the first character;
the category of the preset character string is used for screening the information of the accessed webpage and the information that the character string of the category of the preset character string has the same preset character string.
15. The terminal according to any of claims 1-14, wherein the information for accessing the web page comprises: the URL of the user access page or the URL of each element of the access page, and the target information is advertisement information.
16. The terminal according to claim 1, wherein the first data is obtained after the server performs tree-shaped conversion processing according to the second data, and the second data includes an effective character string and a custom character string of a browser, where the effective character string is a character string whose usage rate is determined to be greater than a preset threshold value by screening an open source character string in an open source website and historical data reported by the terminal within a preset time period.
17. A server for data processing, comprising: one or more processors, a transceiver, a memory, a plurality of applications, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the server, cause the server to perform the steps of:
performing tree-shaped conversion processing on the second data to determine first data;
and the server sends the first data to a terminal so that the terminal can determine whether the accessed webpage contains target information or not.
18. The server according to claim 17, wherein the target information is advertisement information;
the information for accessing the web page includes: and the user accesses at least one of the URL of the page or the URL of each element of the webpage.
19. The server according to claim 17, wherein the server performs the steps of:
periodically acquiring at least one open source character string from an open source website;
selecting a plurality of character strings with the access quantity larger than a first threshold value from the at least one open source character string and historical data reported by the client within a preset time period as effective character strings;
acquiring a custom character string of a browser server;
and determining the second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
20. A server according to claim 17 or 19, characterized in that the server performs the steps of:
dividing the second data into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different;
each of the n preset rules comprises at least two character string categories, and each layer in the m levels is divided into at least two sub-nodes according to the character string categories;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are more than or equal to 1, and n is more than or equal to m;
each child node in the kth-level child node has a parent-child relationship with one child node in the k-1 level, the k-level child node is any one level child node in the m-level child nodes, and k is an integer greater than or equal to 1.
21. The server according to claim 20, wherein the n preset rules include at least one of the following rules: black and white list rules, positioning and preset matching rules, label attribute rules or character rules; the server performs the steps of:
and dividing a plurality of sub-nodes into m-level sub-nodes according to the black and white list rule, the positioning and preset matching rule, the label attribute rule and the character rule.
22. The server according to claim 21, wherein the server performs the steps of:
when the black-and-white list rule includes a category of a white list and a category of a black list, dividing level 1 of the m-level child nodes into two child nodes according to the category of the white list and the category of the black list, where one of the two child nodes includes a character string in the second data that belongs to the category of the white list, and the other child node includes a character string in the second data that belongs to the category of the black list.
23. The server according to claim 22, wherein the server performs the steps of:
when the positioning and preset matching rule includes a positioning matching category and a preset matching category, dividing level 2 of the m-level child nodes into two child nodes according to the positioning matching category and the preset matching category, where one child node of the two child nodes includes a character string in the second data belonging to the positioning matching category, and the other child node includes a character string in the second data belonging to the preset matching category, and where the two child nodes in the level 2 and a node where the character string in the level 1 belonging to the category of the blacklist is located are in a parent-child relationship.
24. The server according to claim 23, wherein the server performs the steps of:
when the label attribute rule includes a type with a label and a type without the label, dividing the level 3 of the m-level child nodes into two child nodes according to the type with the label and the type without the label, wherein one child node of the two child nodes includes a character string in the second data belonging to the type with the label, the other child node includes a character string in the second data belonging to the type without the label, and any child node in the level 3 is in a parent-child relationship with one child node of the level 2 child nodes.
25. The server according to claim 24, wherein the tagged categories specifically include: at least one of a host name only category, a host information only category for advertisement attributes only category, a host and domain name two-level classification category, a host and advertisement Uniform Resource Locator (URL) information category, or a domain name and advertisement URL information only category that is different.
26. A server according to claim 24 or 25, characterized in that the server performs the following steps:
when the character rule includes a category of a first character string and a category of a preset character string, dividing a level 4 of the m-level child nodes into two child nodes according to the category of the first character string and the category of the preset character string, where one of the two child nodes includes a character string in the second data that belongs to the category of the first character string, and the other child node includes a character string in the second data that belongs to the category of the preset character string, and where any one of the child nodes in the level 4 is in a parent-child relationship with one of the child nodes in the level 3.
27. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the steps of:
starting a browser to access a webpage;
acquiring information of the access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and intercepting the target information when the information of the access webpage comprises the target information.
28. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the steps of:
performing tree-shaped conversion processing on the second data to determine first data;
and the server sends the first data to a terminal so that the terminal can determine whether the accessed webpage contains target information or not.
29. A computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of:
starting a browser to access a webpage;
acquiring information of the access webpage;
matching the information of the accessed webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the accessed webpage comprises target information;
and intercepting the target information when the information of the access webpage comprises the target information.
30. A computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of:
performing tree-shaped conversion processing on the second data to determine first data;
and the server sends the first data to a terminal so that the terminal can determine whether the accessed webpage contains target information or not.
CN201811132493.9A 2018-09-27 2018-09-27 Information interception method, device and terminal Active CN110955855B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811132493.9A CN110955855B (en) 2018-09-27 2018-09-27 Information interception method, device and terminal
PCT/CN2019/106728 WO2020063448A1 (en) 2018-09-27 2019-09-19 Information blocking method, device and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811132493.9A CN110955855B (en) 2018-09-27 2018-09-27 Information interception method, device and terminal

Publications (2)

Publication Number Publication Date
CN110955855A true CN110955855A (en) 2020-04-03
CN110955855B CN110955855B (en) 2023-06-02

Family

ID=69951180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811132493.9A Active CN110955855B (en) 2018-09-27 2018-09-27 Information interception method, device and terminal

Country Status (2)

Country Link
CN (1) CN110955855B (en)
WO (1) WO2020063448A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112073374A (en) * 2020-08-05 2020-12-11 长沙市到家悠享网络科技有限公司 Information interception method, device and equipment
CN117093777A (en) * 2023-08-22 2023-11-21 北京领雁科技股份有限公司 Method and device for intercepting browser page, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641911B (en) * 2021-08-19 2024-03-08 郑州阿帕斯数云信息科技有限公司 Advertisement interception rule base establishing method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100325588A1 (en) * 2009-06-22 2010-12-23 Anoop Kandi Reddy Systems and methods for providing a visualizer for rules of an application firewall
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
JP2015118466A (en) * 2013-12-17 2015-06-25 ケーディーアイコンズ株式会社 Information processing apparatus and program
CN105824972A (en) * 2016-04-15 2016-08-03 广东欧珀移动通信有限公司 Method and device for blocking network advertisements
CN106033450A (en) * 2015-03-17 2016-10-19 中兴通讯股份有限公司 Method and device for blocking advertisement, and browser
CN107193889A (en) * 2017-05-02 2017-09-22 努比亚技术有限公司 Ad blocking method, terminal and computer-readable recording medium
CN107437026A (en) * 2017-07-13 2017-12-05 西北大学 A kind of malicious web pages commercial detection method based on advertising network topology
CN107835993A (en) * 2015-05-12 2018-03-23 极进网络公司 For generating method, system and the non-transitory computer-readable medium of the tree construction for comparing field and shear force with the node for being used for the full comparison for quickly setting traversal and the reduction quantity at leaf node

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
US9578042B2 (en) * 2015-03-06 2017-02-21 International Business Machines Corporation Identifying malicious web infrastructures
CN108170810A (en) * 2017-12-29 2018-06-15 南京邮电大学 A kind of commercial detection method based on dynamic behaviour

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100325588A1 (en) * 2009-06-22 2010-12-23 Anoop Kandi Reddy Systems and methods for providing a visualizer for rules of an application firewall
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
JP2015118466A (en) * 2013-12-17 2015-06-25 ケーディーアイコンズ株式会社 Information processing apparatus and program
CN106033450A (en) * 2015-03-17 2016-10-19 中兴通讯股份有限公司 Method and device for blocking advertisement, and browser
CN107835993A (en) * 2015-05-12 2018-03-23 极进网络公司 For generating method, system and the non-transitory computer-readable medium of the tree construction for comparing field and shear force with the node for being used for the full comparison for quickly setting traversal and the reduction quantity at leaf node
CN105824972A (en) * 2016-04-15 2016-08-03 广东欧珀移动通信有限公司 Method and device for blocking network advertisements
CN107193889A (en) * 2017-05-02 2017-09-22 努比亚技术有限公司 Ad blocking method, terminal and computer-readable recording medium
CN107437026A (en) * 2017-07-13 2017-12-05 西北大学 A kind of malicious web pages commercial detection method based on advertising network topology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
信学峰: "基于流氓软件的检测与拦截技术的研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112073374A (en) * 2020-08-05 2020-12-11 长沙市到家悠享网络科技有限公司 Information interception method, device and equipment
CN117093777A (en) * 2023-08-22 2023-11-21 北京领雁科技股份有限公司 Method and device for intercepting browser page, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110955855B (en) 2023-06-02
WO2020063448A1 (en) 2020-04-02

Similar Documents

Publication Publication Date Title
US10817663B2 (en) Dynamic native content insertion
US9928292B2 (en) Classifying uniform resource locators
CN102722563B (en) Method and device for displaying page
US8903800B2 (en) System and method for indexing food providers and use of the index in search engines
CA2610208C (en) Learning facts from semi-structured text
CN109033358B (en) Method for associating news aggregation with intelligent entity
WO2015196907A1 (en) Search pushing method and device which mine user requirements
US8478701B2 (en) Locating a user based on aggregated tweet content associated with a location
CN102053983B (en) Method, system and device for querying vertical search
US7797350B2 (en) System and method for processing downloaded data
US20150244670A1 (en) Browser and method for domain name resolution by the same
CN104850546B (en) Display method and system of mobile media information
CN102037464A (en) Search results with most clicked next objects
US11423096B2 (en) Method and apparatus for outputting information
CN110955855B (en) Information interception method, device and terminal
US20120310941A1 (en) System and method for web-based content categorization
CN103186666A (en) Method, device and equipment for searching based on favorites
CN104065736A (en) URL redirection method, device, and system
CN113656737B (en) Webpage content display method and device, electronic equipment and storage medium
CN108108381B (en) Page monitoring method and device
CN105488218A (en) Method and device for loading waterfall flows based on search
JP2007122398A (en) Method for determining identity of fragment, and computer program
CN111680247A (en) Local calling method, device, equipment and storage medium of webpage character string
CN107977381B (en) Data configuration method, index management method, related device and computing equipment
CN112486796B (en) Method and device for collecting information of vehicle-mounted intelligent terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220507

Address after: 523799 Room 101, building 4, No. 15, Huanhu Road, Songshanhu Park, Dongguan City, Guangdong Province

Applicant after: Petal cloud Technology Co.,Ltd.

Address before: 523808 Southern Factory Building (Phase I) Project B2 Production Plant-5, New Town Avenue, Songshan Lake High-tech Industrial Development Zone, Dongguan City, Guangdong Province

Applicant before: HUAWEI DEVICE Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant