CN110069693A - Method and apparatus for determining target pages - Google Patents

Method and apparatus for determining target pages Download PDF

Info

Publication number
CN110069693A
CN110069693A CN201910352767.3A CN201910352767A CN110069693A CN 110069693 A CN110069693 A CN 110069693A CN 201910352767 A CN201910352767 A CN 201910352767A CN 110069693 A CN110069693 A CN 110069693A
Authority
CN
China
Prior art keywords
page
detected
target pages
candidate
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910352767.3A
Other languages
Chinese (zh)
Other versions
CN110069693B (en
Inventor
苏晓东
刘广
董晓康
耿志峰
杜昆
杨皓
段海新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910352767.3A priority Critical patent/CN110069693B/en
Publication of CN110069693A publication Critical patent/CN110069693A/en
Application granted granted Critical
Publication of CN110069693B publication Critical patent/CN110069693B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application discloses method, apparatus, electronic equipment and computer-readable medium for determining target pages.One specific embodiment of this method includes: the domain name based on the page in page set to be detected, the candidate's page to be detected for meeting preset condition is extracted from page set to be detected, and be added to candidate page queue to be detected;The operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected, the operation for searching target pages includes: to carry out kind judging to the candidate page to be detected, the candidate's page to be detected for determining pre-set categories is target pages, corresponding top-level domain is extracted from the domain name of target pages, the association page of target pages is crawled based on the corresponding top-level domain of target pages, the association page of target pages is added in candidate page queue to be detected by the association page in response to determining target pages not in candidate page queue to be detected.The embodiment improves the search efficiency of target pages.

Description

Method and apparatus for determining target pages
Technical field
The invention relates to field of computer technology, and in particular to network information processing technical field, more particularly to Method and apparatus for determining target pages.
Background technique
Search engine is to collect information from internet and provide the system of search service, SEO (Search for user Engine Optimization, search engine optimization) it is to improve website in related search engine using the rule of search engine Natural ranking technology.
Occur some means that website ranking is promoted using fraudulent means such as rubbish link, hiding webpages at present. Such as search engine is set constantly to crawl these pages by automatically generating the page comprising promotional content that a large amount of interconnections link Content means.These websites generally can not provide the content for meeting user's search need, but can be in higher ranking In present search result.
Summary of the invention
The embodiment of the present application proposes method, apparatus, electronic equipment and computer-readable Jie for determining target pages Matter.
In a first aspect, embodiment of the disclosure provides a kind of method for determining target pages, comprising: based on to be checked The domain name for surveying the page in page set extracts the candidate's page to be detected for meeting preset condition from page set to be detected, and It is added to candidate page queue to be detected;Candidate's page to be detected in candidate page queue to be detected is executed and searches page object The operation in face, the operation for searching target pages include: to carry out kind judging to the candidate page to be detected, determine the time of pre-set categories Selecting the page to be detected is target pages, and corresponding top-level domain is extracted from the domain name of target pages, is based on target pages pair The top-level domain answered crawls the association page of target pages, in response to determining the association page of target pages not candidate to be detected In page queue, the association page of target pages is added in candidate page queue to be detected, with the association to target pages The page executes the operation for searching target pages.
In some embodiments, above-mentioned preset condition includes at least one of the following: the domain name of the page not in preset domain name In white list;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;The page is not belonging to page set and history to be detected The intersection of page set to be detected, the acquisition time of the acquisition time of history page set to be detected earlier than page set to be detected.
In some embodiments, above-mentioned that kind judging is carried out to the candidate page to be detected, comprising: based on candidate page to be detected The domain name in face generates at least two subdomain names at random;The web page contents of the corresponding website of at least two subdomain names are obtained, extraction obtains The feature for the web page contents got;Between the feature of web page contents in response to determining the corresponding website of at least two subdomain names Difference determines that the classification of the candidate page to be detected is pre-set categories in preset difference section.
In some embodiments, above-mentioned page queue to be detected includes distributed page queue to be detected;And it is above-mentioned right Candidate's page to be detected in candidate page queue to be detected executes the operation for searching target pages, comprising: uses multiple processes The operation for searching target pages is executed to candidate's page to be detected in distributed candidate page queue to be detected respectively.
In some embodiments, the above method further include: store target pages to database.
In some embodiments, the above method further include: shielding processing is carried out to the target pages found.
In some embodiments, above-mentioned that shielding processing is carried out to the target pages that find, include at least one of the following: by The uniform resource locator of target pages is added in the shielding page listings of search engine crawlers;In the rope of search engine Draw the index of the delete target page in library;In response to detecting that the page in search result includes target pages, searched for issuing The client push indicating risk information of request.
Second aspect, embodiment of the disclosure provide a kind of for determining the device of target pages, comprising: extract single Member is configured as the domain name based on the page in page set to be detected, extracts from page set to be detected and meets preset condition Candidate's page to be detected, and be added to candidate page queue to be detected;Searching unit is configured as to the candidate page to be detected Candidate's page to be detected in queue executes the operation for searching target pages, search target pages operation include: to candidate to It detects the page and carries out kind judging, determine that candidate's page to be detected of pre-set categories is target pages, from the domain name of target pages In extract corresponding top-level domain, the association page of target pages is crawled based on the corresponding top-level domain of target pages, respond In the association page for determining target pages not in candidate page queue to be detected, the association page of target pages is added to time It selects in page queue to be detected, the operation for searching target pages is executed with the association page to target pages.
In some embodiments, above-mentioned preset condition includes at least one of the following: the domain name of the page not in preset domain name In white list;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;The page is not belonging to page set and history to be detected The intersection of page set to be detected, the acquisition time of the acquisition time of history page set to be detected earlier than page set to be detected.
In some embodiments, above-mentioned searching unit is configured to as follows to the candidate page to be detected Carry out kind judging: the domain name based on the candidate page to be detected generates at least two subdomain names at random;Obtain at least two subdomains The web page contents of the corresponding website of name, extract the feature of the web page contents got;In response to determining at least two subdomain names pair Difference between the feature of the web page contents for the website answered determines the classification of the candidate page to be detected in preset difference section For pre-set categories.
In some embodiments, above-mentioned page queue to be detected includes distributed page queue to be detected;And it above-mentioned looks into Unit is looked for be configured to: to be detected to the candidate in distributed candidate page queue to be detected respectively using multiple processes The page executes the operation for searching target pages.
In some embodiments, above-mentioned apparatus further include: storage unit is configured as storing target pages to data Library.
In some embodiments, above-mentioned apparatus further include: screen unit is configured as carrying out the target pages found Shielding processing.
In some embodiments, above-mentioned screen unit is configured to according to following at least one mode to finding Target pages carry out shielding processing: the uniform resource locator of target pages is added to the shielding of search engine crawlers In page listings;The index of the delete target page in the index database of search engine;In response to detecting the page in search result Bread contains target pages, to the client push indicating risk information for issuing searching request.
The third aspect, embodiment of the disclosure provide a kind of electronic equipment, comprising: one or more processors;Storage Device, for storing one or more programs, when one or more programs are executed by one or more processors so that one or Multiple processors realize the method for determining target pages provided such as first aspect.
Fourth aspect, embodiment of the disclosure provide a kind of computer-readable medium, are stored thereon with computer program, Wherein, the method for determining target pages that first aspect provides is realized when program is executed by processor.
Above-described embodiment of the disclosure for determining the method and apparatus of target pages, electronic equipment and computer-readable Medium extracts from page set to be detected by the domain name based on the page in page set to be detected and meets preset condition The candidate page to be detected, and it is added to candidate page queue to be detected, it is to be detected to the candidate in candidate page queue to be detected The page executes the operation for searching target pages, and the operation for searching target pages includes: to carry out classification to the candidate page to be detected to sentence It is fixed, it determines that candidate's page to be detected of pre-set categories is target pages, corresponding level-one is extracted from the domain name of target pages Domain name crawls the association page of target pages based on the corresponding top-level domain of target pages, in response to determining the pass of target pages Join the page not in candidate page queue to be detected, the association page of target pages is added to candidate page queue to be detected In, the operation for searching target pages is executed with the association page to target pages, is realized and is utilized limited search engine resource It was found that the page based on the associated more pre-set categories of domain name, improves the search efficiency of the pre-set categories page.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that the embodiment of the present application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the method for determining target pages of the application;
Fig. 3 is the flow chart according to another embodiment of the method for determining target pages of the application;
Fig. 4 is shown in Fig. 3 for determining the principle configuration diagram of the method for target pages;
Fig. 5 is the flow chart according to another embodiment of the method for determining target pages of the application;
Fig. 6 is the structural schematic diagram of one embodiment of the device for determining target pages of the application;
Fig. 7 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can the method for determining target pages using the application or the dress for determining target pages The exemplary system architecture set.
As shown in Figure 1, may include client 101, network 102, search engine server 103, extremely in system architecture 100 A few Website server 104.Network 102 to client 101, search engine server 103 and Website server 104 it Between provide communication link medium.Network may include various connection types, such as wired, wireless communication link or optical fiber electricity Cable etc..
Client 101 can be the client with user interface, and user can access network money by client 101 Source.Client 101 specifically can be implemented as various electronic equipments, including but not limited to smart phone, laptop, desktop Brain, tablet computer, smartwatch, etc..
Search engine server 103 is to be run on search engine server 103 for providing the server of retrieval service For the search engine from interconnection online collection, arrangement information.
Website server 104 can be and provide the server of various resource informations on internet, different Website servers 104 can provide different classes of or separate sources resource information.For example, Website server 104 can be video resource service Device, the server of enterprise web site, server of Knowledge Sharing class website, etc..
Client 101 can establish connection by network 102 and search engine server 103.It can pacify in client 101 Equipped with browser, user can initiate searching request by client 101.Search engine server 103 receives searching request Afterwards, the content of each Website server 104 is grabbed by network 102 using crawlers, and be analyzed and processed, find out matching The information of the website found out is fed back to client 101 by the website of searching request.Client 101 can be with Receive the search result that search engine server 103 returns.
In the application scenarios of the embodiment of the present disclosure, search engine server 103 can crawl the offer of Website server 104 The page, detection Website server 104 provide the page whether be pre-set categories the page, obtain the target pages of pre-set categories Testing result.
It should be noted that search engine server 103, Website server 104 can be hardware, it is also possible to software. When search engine server 103, Website server 104 are hardware, the distributed clothes of multiple server compositions may be implemented into Business device cluster, also may be implemented into individual server.It, can when search engine server 103, Website server 104 are software It, can also be with to be implemented as multiple softwares or software module (such as providing multiple softwares of Distributed Services or software module) It is implemented as single software or software module.It is not specifically limited herein.
It should be noted that the method provided by embodiment of the disclosure for determining target pages can be drawn by search The execution of server 103 is held up, correspondingly, for determining that the device of target pages can be set in search engine server 103.
It should be understood that the client, network, search engine server, the number of Website server in Fig. 1 are only to illustrate Property.According to needs are realized, any number of client, network, search engine server, Website server can have.
With continued reference to Fig. 2, it illustrates according to one embodiment of the method for determining target pages of the application Process 200.The method for being used to determine target pages, comprising the following steps:
Step 201, it is default to extract satisfaction from page set to be detected for the domain name based on the page in page set to be detected Candidate's page to be detected of condition, and it is added to candidate page queue to be detected.
In the present embodiment, for determining the available page set to be detected of the executing subject of the method for target pages.To When detecting the set for the page that page set can be from internet collection, such as can be the page generation that search engine crawls Between/set of the page of the renewal time in preset time period (such as nearest one week).In some optional implementations, to Detecting the page in page set can also be obtained by above-mentioned executing subject from other electronic equipments, such as above-mentioned executing subject can be with Receive the problem of other electronic equipments the report page.
It can judge whether the page in page set to be detected meets preset condition based on the domain name of each page.If the page Meet preset condition, then the page can be added to candidate page queue to be detected.
Herein, the purpose detected to the page is to look for out target pages.Target pages can be to be searched by being directed to It indexes the web page crawl rule held up, obtain the page of higher search rank by fraudulent means.Specifically, target pages can be with It is the page with specific SEO behavioural characteristic, more specifically, target pages can be the page with black cap SEO behavior, example Such as the spider pond page.
Black cap SEO is the search that page rank is promoted using the fraudulent means such as rubbish link, hiding webpage, keyword stuffing Engine optimizes behavior.Wherein, spider pond is to be attracted by automatically generating the page largely to interlink with the update of mass data The crawler capturing page of search engine, and then promote the means of the search rank of the page.In the page to interlink in spider pond Valuable content is less, but search engine is easily trapped into crawling for a large amount of low value pages after crawling the spider pond page In, waste a large amount of resource.
It, can be based on feature possessed by target pages, such as mesh in some optional implementations of the present embodiment The SEO behavioural characteristic for marking the page, determines above-mentioned preset condition.Such as target pages are the spider pond page, are had and a large amount of pages The feature that face interlinks has similitude between the domain name of the page to interlink, can be based on the similarity feature of domain name Set above-mentioned preset condition.Then it will meet the page of above-mentioned preset condition in page set to be detected as the candidate page to be detected It is added in candidate page queue to be detected.
In some optional implementations of the present embodiment, the domain name list of the available trusted page will be in addition to Other pages except the trusted page are added to candidate page queue to be detected.Then above-mentioned preset condition may include: the page Domain name not in trusted domain name list.
In the biggish situation of page quantity in page set to be detected, candidate page to be detected is screened by preset condition Face can filter out the page for not needing detection largely, to reduce the quantity for needing the page detected.
Optionally, above-mentioned preset condition may include at least one of following: the domain name of the page is not in the white name of preset domain name Dan Zhong;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;It is to be checked with history that the page is not belonging to page set to be detected Survey the intersection of page set, the acquisition time of the acquisition time of history page set to be detected earlier than above-mentioned page set to be detected.
Preset domain name white list can be the domain name that the access times obtained based on search engine are more than the website of threshold value The set of the trusted domain name of set and/or user setting.
In natural language processing, puzzlement degree is for describing probability distribution of each participle in sentence in a sentence Language probabilistic model superiority and inferiority degree measurement.The domain name puzzlement degree of the page can be the finger of the degree of randomness of characterization domain name Mark.The randomness of domain name is higher, and domain name puzzlement degree is bigger.
Specifically, the puzzlement degree of domain name can be calculated according to following formula (1):
Wherein, U indicates the domain name character string of the page to be detected;The puzzlement degree of P (U) representative domain name character string U;N is domain name The character sum that the length of character string U, i.e. domain name character string U include.P(ω0) it is that the 0th character occurs in domain name character string U Probability, herein, the 0th character is placeholder, indicates the starting of character string.P(ωi|Ui) it is in prefix UiThe case where appearance Under, the probability that i-th of character occurs, herein, prefix UiBy the 0th character to (i-1)-th character in representative domain name character string U The prefix character string of composition.P(ωii-1) indicate the probability of i-th of character occur in the case where (i-1)-th character occurs.
When the domain name puzzlement degree of the page is greater than preset puzzled degree threshold value, show that the randomness of the domain name of the page is stronger. It can be added to the stronger page of the randomness of domain name as the candidate page to be detected in candidate page queue to be detected.Due to The randomness of the pages with search engine optimization cheating such as usual spider pond is stronger, by the puzzlement degree threshold that domain name is arranged Value can filter out the page that may have the cheating for search engine optimization.
History page set to be detected can be the page set being collected into before collecting current page set to be detected.Example Such as, search engine can carry out page collection to be detected with period regular time, and after being collected according to the page to be detected Determine target pages.It may determine that the page in above-mentioned page set to be detected whether in current page set and history to be detected In the intersection of page set to be detected, if so, the page is tested in history page set to be detected, it can not be for it again Secondary detection;If the page is not belonging to the intersection of current page set to be detected Yu the history page to be detected, can determine to be detected The page is the new page, adds it to candidate page queue to be detected.In this way, can be carried out according to history page set to be detected Incremental computations only extract the page newly-increased in the recent period as the candidate page to be detected.
Optionally, above-mentioned preset condition can also be set using other page filter methods, such as based on third party The list of trusted domain name or domain name testing result that search engine or third party's page analysis service provide are above-mentioned default to be arranged Condition extracts the fly-by-night page from page set to be detected.
Step 202, the operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected.
The operation for searching target pages includes: to carry out kind judging to the candidate page to be detected, determines the time of pre-set categories Selecting the page to be detected is target pages, and corresponding top-level domain is extracted from the domain name of target pages, is based on target pages pair The top-level domain answered crawls the association page of target pages, in response to determining the association page of target pages not candidate to be detected In page queue, the association page of target pages is added in candidate page queue to be detected, with the association to target pages The page executes the operation for searching target pages.
Specifically, the page to be detected candidate for each of candidate page queue to be detected, can determine this first The classification of the candidate page to be detected.Herein, the classification of the page can be carried out according to the feature for the target pages that expectation is found out It divides, the page of the feature with target pages can be divided into one kind, the feature without target pages divides other one into Class.Specifically, when target pages are the pages with the behavior for promoting search rank by abnormal fraudulent means, the page Classification can characterize whether the page has the behavior that search rank is promoted by abnormal fraudulent means, may include the cheating page And normal page.Optionally, when target pages are the spider pond pages, the classification of the page can characterize whether the page is spider pond The page, including the spider pond page and the non-spider pond page.
In the present embodiment, the feature of candidate content of pages to be detected can be extracted, and/or extracts the chain of the page to be detected Connect behavioural characteristic.Link behavioural characteristic can crawl time by executing to crawl the operation of the candidate page to be detected and capture crawler The behavioural characteristic when page to be detected is selected to obtain.It is then based on the feature extracted and determines whether the candidate page to be detected is pre- If the page of classification.Herein, pre-set categories can be with the behavior for promoting search rank by abnormal fraudulent means The classification of the page, or further, pre-set categories can be the classification of the spider pond page.
It is alternatively possible to carry out kind judging to the candidate page to be detected as follows:
The domain name for being primarily based on the candidate page to be detected generates at least two subdomain names at random.Then obtain at least two sons The web page contents of the corresponding website of domain name, extract the feature of the web page contents got.Assuming that two subdomain names generated are corresponding Website be respectively Pa, Pb, the feature of the web page contents of the corresponding website of two subdomain names be expressed as LSet (Pa) and LSet(Pb).Then the difference between the feature of the web page contents of the corresponding website of above-mentioned at least two subdomain name can be calculated, The difference diff between the feature of the web page contents of the corresponding website of two subdomain names is calculated for example, by using formula (2):
Wherein, | LSet (Pa)-LSet (Pb) | indicate the size of two set LSet (Pa) and the difference set of LSet (Pb), i.e., The quantity of feature in the difference set of LSet (Pa) and LSet (Pb);LSet (Pmin) is lesser in LSet (Pa) and LSet (Pb) Set;| LSet (Pmin) | for the size of lesser set in LSet (Pa) and LSet (Pb), i.e. LSet (Pa) and LSet (Pb) In feature in lesser set quantity.
By the difference between the feature for the web page contents for calculating the corresponding website of at least two subdomain names, can be derived that by The difference between the corresponding website of two subdomain names that the candidate page to be detected generates.Later, in response to determining at least two sons Difference between the feature of the web page contents of the corresponding website of domain name determines the candidate page to be detected in preset difference section Classification be pre-set categories.When the corresponding website of two subdomain names generated by the candidate page to be detected web page contents it Between difference it is smaller when, can determine that this waits for that the classification of the candidate page to be detected is pre-set categories.
The candidate's page to be detected that by kind judging result can be pre-set categories is target pages.It is then possible to be based on The domain name of the target pages found further searches for the page of more pre-set categories, to find more page objects Face.Specifically, it can be extracted from the domain name of fixed target pages top-level domain (i.e. top level domain), then basis should Top-level domain crawls the relevant page.It can parse URL (Uniform Resource Locator, the system of the page crawled One Resource Locator), more related pages are crawled using breadth first traversal algorithm to the URL, the page crawled is made For the association page of target pages.Later, it can be determined that whether the association page of target pages is in above-mentioned candidate's page to be detected In queue, if not, the association page of the target pages crawled can be added in candidate page queue to be detected, with Realize that the association page based on target pages continues by the operation for executing lookup target pages to candidate page queue to be detected Search new target pages.The operation for searching target pages is successively then being executed to the page in candidate page queue to be detected Cheng Zhong is added in candidate page queue to be detected by constantly crawling the association page as the candidate page to be detected, can It was found that more target pages.
Above-mentioned steps 202 are associated crawling for the page, and the page that will be crawled by the domain name based on target pages It is added in candidate page queue to be detected and carries out kind judging to determine whether to efficiently use search for target pages and draw It finds to the resource high-efficiency held up more target pages, promotes the efficiency that target pages are searched.
The method for determining target pages of above-described embodiment of the disclosure, by based on the page in page set to be detected The domain name in face extracts the candidate's page to be detected for meeting preset condition from page set to be detected, and is added to candidate to be checked Page queue is surveyed, the following operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected: Kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages, from page object Corresponding top-level domain is extracted in the domain name in face, and the association page of target pages is crawled based on the corresponding top-level domain of target pages Face, in response to determining the association pages of target pages not in candidate page queue to be detected, by the association page of target pages It is added in candidate page queue to be detected, the operation for searching target pages is executed with the association page to target pages, is realized Target pages of the limited search engine resource discovering based on the associated more pre-set categories of domain name are utilized, are improved default The search efficiency of the target pages of classification can be applied in the page lookup of large-scale pre-set categories.
Optionally, after determining target pages, the above-mentioned process 200 for determining the method for target pages can be with It include: to store target pages to database.It can be by saving the URL and/or target pages of target pages in the database Content of pages record target pages, with it is subsequent provide search service based on search engine when the mesh that is saved according to database The ranking of the URL of the page and/or the content of pages amendment target pages of target pages are marked, or is searched based on search engine offer The target pages saved in database are rejected when rope services.
Referring to FIG. 3, it illustrates another embodiments of the method according to the application for determining target pages Flow chart.As shown in figure 3, for determine target pages method process 300 the following steps are included:
Step 301, it is default to extract satisfaction from page set to be detected for the domain name based on the page in page set to be detected Candidate's page to be detected of condition, and it is added to distributed page queue to be detected.
The available page set to be detected of executing subject for determining the method for target pages, page set to be detected can be with It is the set for the page collected from internet, such as can be the page that search engine crawls and generate time/renewal time pre- If the set of the page in the period (such as nearest one week).In some optional implementations, in page set to be detected The page can also be obtained by above-mentioned executing subject from other electronic equipments, such as can receive the problem of other electronic equipments report The page.
It can judge whether the page in page set to be detected meets preset condition based on the domain name of the page.If the page is full The page can be then added to distributed candidate page queue to be detected by sufficient preset condition.
Above-mentioned preset condition is the condition for carrying out primary filtration to page set to be detected, can be and is determined based on expectation The condition of the feature-set of target pages out.For example, above-mentioned preset condition can be root when target pages are the spider pond pages According to the feature-set of similitude between domain name.In another example target pages can be the viral page, then above-mentioned preset condition can be According to the feature-set of the viral page.
Optionally, above-mentioned preset condition may include at least one of following: the domain name of the page is not in the white name of preset domain name Dan Zhong;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;It is to be checked with history that the page is not belonging to page set to be detected Survey the intersection of page set, the acquisition time of the acquisition time of history page set to be detected earlier than above-mentioned page set to be detected.
The candidate's page to be detected for meeting preset condition is extracted in the step 301 of the present embodiment from page set to be detected Operation and previous embodiment step 201 in the candidate's page to be detected for meeting preset condition is extracted from page set to be detected The operation in face is consistent, and details are not described herein again.
In the present embodiment, candidate page queue to be detected is distributed candidate page queue to be detected.It can create in advance Multiple candidate page queues to be detected are built, it, can be by page to be detected when determining that the page to be detected meets above-mentioned preset condition Face is added to one of candidate page queue to be detected.Herein, candidate queue to be detected can be Kafka or redis etc. Distributed queue.
Step 302, using multiple processes respectively to candidate's page to be detected in distributed candidate page queue to be detected Execute the operation for searching target pages.
Classification can be carried out to candidate's page to be detected in distributed page queue to be detected respectively using multiple processes Determine.N number of candidate page queue to be detected is handled respectively for example, by using M process, and M, N are positive integer, wherein each process pair At least one candidate page queue to be detected executes the operation for searching target pages, alternatively, can be using multiple processes to same The candidate page queue to be detected of difference in a candidate page queue to be detected executes the operation for searching target pages.Process and its Corresponding relationship between the candidate page queue to be detected of processing can determine according to preset resource dispatching strategy, this Do not do particular determination in place.Each process once search target pages operation in candidate's page to be detected at Reason.In this way, the detection efficiency of candidate page queue to be detected can be promoted by way of asynchronous multi-process, and then promote target The search speed of the page.
Operation to the lookup target pages that the candidate page to be detected executes may include: to distributed page team to be detected Candidate's page to be detected in column carries out kind judging, determines that candidate's page to be detected of pre-set categories is target pages, from mesh It marks in the domain name of the page and extracts corresponding top-level domain, the pass of target pages is crawled based on the corresponding top-level domain of target pages Join the page, in response to determining the association pages of target pages not in distributed candidate page queue to be detected, by target pages The association page be added in distributed candidate page queue to be detected, executed with the association page to target pages and search target The operation of the page.
The operation that target pages are searched in the operation of target pages and the step 202 of previous embodiment is searched in the present embodiment It is identical.The specific implementation that target pages are searched in step 202 is also applied for searching page object in the step 302 of the present embodiment The operation in face, details are not described herein again.
It optionally, is target in the candidate's page to be detected for determining pre-set categories in the operation of above-mentioned lookup target pages After the page, top-level domain is extracted for the target page, and can based on the operation that top-level domain crawls the association page of target pages To be executed by multiple threads, facilitate the further promotion associated page of target pages in this way crawls efficiency, to mention Rise the speed that other associated target pages are found by target pages.
The process 300 of the method for determining target pages of the present embodiment, by the way that the candidate page to be detected to be added to Distributed candidate page queue to be detected, it is to be checked to the candidate in distributed candidate page queue to be detected respectively using multi-process The operation for surveying page queue's performance objective page, can effectively promote the discovery speed of target pages.
Optionally, the above-mentioned process 300 for determining the method for target pages can also include: by target pages store to Database.In this way, when crawling the page during search engine provides search service, it can be according to the page object of database purchase Face executes corresponding filter operation.
With continued reference to Fig. 4, it illustrates shown in Fig. 3 for determining the principle configuration diagram of the method for target pages. As shown in figure 4, page set URLs to be detected is first applied to filter and is filtered, filter passes through domain name white list, domain name Puzzlement degree excludes the page trusty, and the history page set to be detected based on caching extracts recent increment page conduct The candidate page to be detected.The candidate page to be detected is dispensed to distributed candidate page queue to be detected.Then multiple page objects Face lookup process (Checker) handles the page in distributed candidate page queue to be detected respectively.Each target pages are searched After process carries out kind judging to the candidate page to be detected, using multiple threads (Finder) that crawl to the target pages found Be associated crawling for the page, and judge the association page crawled whether in distributed candidate page queue to be detected, If the association page crawled not in distributed candidate page queue to be detected, can add it to it is distributed it is candidate to It detects in page queue, and mode as described above continues to search target pages.It can be same by the target pages found Step is into database.
Referring to FIG. 5, it illustrates another embodiments of the method according to the application for determining target pages Flow chart.As shown in figure 5, the process 500 of the method for determining target pages, comprising the following steps:
Step 501, it is default to extract satisfaction from page set to be detected for the domain name based on the page in page set to be detected Candidate's page to be detected of condition, and it is added to candidate page queue to be detected;
Step 502, the operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected.
The operation for searching target pages includes: to carry out kind judging to the candidate page to be detected, determines the time of pre-set categories Selecting the page to be detected is target pages, and corresponding top-level domain is extracted from the domain name of target pages, is based on target pages pair The top-level domain answered crawls the association page of target pages, in response to determining the association page of target pages not candidate to be detected In page queue, the association page of target pages is added in candidate page queue to be detected, with the association to target pages The page executes the operation for searching target pages.
The step 501, step 502 of the present embodiment are consistent with the step 201 of previous embodiment, step 202 respectively, step 501, the specific implementation of step 502 can be no longer superfluous herein with reference to the description of step 201, step 202 in previous embodiment It states.
Optionally, in addition, candidate page queue to be detected can be distributed candidate page queue to be detected.Executing step It, can be by multiple processes respectively to distributed candidate page queue to be detected when the operation of the lookup target pages in rapid 502 In candidate's page to be detected handled.
Step 503, shielding processing is carried out to the target pages found.
In the present embodiment, target pages can be that the page for search engine crawls strategy and ranking mode passes through work Disadvantage means obtain the page of higher ranking, such as the spider pond page.Shielding processing can be carried out to the target pages found, To avoid target pages from influencing search result.Specifically, it can control search engine and skip page object when crawling the page Face, or filter out target pages in search result.
It is alternatively possible to carry out shielding processing to the target pages found as follows: by the system of target pages One Resource Locator is added in the shielding page listings of search engine crawlers;Mesh is deleted in the index database of search engine Mark the index of the page;In response to detecting that the page in search result includes target pages, to the client for issuing searching request Push indicating risk information.
In the method for above-mentioned shielding processing, when the crawlers of search engine crawl the page, its shielded page can be skipped The page in the list of face.And in the index database of search engine after the index of the delete target page, the crawlers of search engine The target pages that deleted index is directed toward can not be crawled.
Target pages are not handled in the crawlers of search engine, and in search result include target pages when, It can be used to prompt target pages to the client push for issuing searching request to be the indicating risk information with the risk page.Visitor Family end can determine whether that user shows the page according to the indicating risk information.Such as the plug-in unit of browser can be according to this Indicating risk information is filtered content of pages, or shields to full page.
Optionally, after step 502, before step 503, it is above-mentioned for determine target pages method process 500 It can also include: to store target pages to database.It can be by saving the URL and/or mesh of target pages in the database The content of pages of the page is marked to record target pages.In this way, in step 503 can the target pages based on database purchase into Row shielding processing.
The method for determining target pages of the present embodiment, by target pages carry out shielding processing, can effectively, Quickly, it blocks target pages to propagate by search engine in all directions, reduces the influence to the search result of user.
With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, this application provides one kind for determining mesh One embodiment of the device of the page is marked, the Installation practice is corresponding with Fig. 2, Fig. 3 and embodiment of the method shown in fig. 5, should Device specifically can be applied in various electronic equipments.
As shown in fig. 6, the device 600 for determining target pages of the present embodiment includes: extraction unit 601 and searches single Member 602.Wherein, extraction unit 601 can be configured as the domain name based on the page in page set to be detected, from the page to be detected It concentrates and extracts the candidate's page to be detected for meeting preset condition, and be added to candidate page queue to be detected;Searching unit 602 It can be configured as the operation for executing to candidate's page to be detected in candidate page queue to be detected and searching target pages, search The operation of target pages includes: to carry out kind judging to the candidate page to be detected, determines candidate's page to be detected of pre-set categories For target pages, corresponding top-level domain is extracted from the domain name of target pages, is based on the corresponding top-level domain of target pages The association page for crawling target pages, in response to determine the association pages of target pages not in candidate page queue to be detected, The association page of target pages is added in candidate page queue to be detected, lookup is executed with the association page to target pages The operation of target pages.
In some embodiments, above-mentioned preset condition includes at least one of the following: the domain name of the page not in preset domain name In white list;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;The page is not belonging to page set and history to be detected The intersection of page set to be detected, the acquisition time of the acquisition time of history page set to be detected earlier than page set to be detected.
In some embodiments, above-mentioned searching unit 602 can be configured to as follows to candidate to The detection page carries out kind judging: the domain name based on the candidate page to be detected generates at least two subdomain names at random;It obtains at least The web page contents of the corresponding website of two subdomain names, extract the feature of the web page contents got;In response to determining at least two Difference between the feature of the web page contents of the corresponding website of subdomain name determines candidate page to be detected in preset difference section The classification in face is pre-set categories.
In some embodiments, above-mentioned page queue to be detected includes distributed page queue to be detected;And it above-mentioned looks into Look for unit 602 that can be configured to: using multiple processes respectively to the time in distributed candidate page queue to be detected The page to be detected is selected to execute the operation for searching target pages.
In some embodiments, above-mentioned apparatus 600 can also include: storage unit, be configured as storing target pages To database.
In some embodiments, above-mentioned apparatus 600 can also include: screen unit, be configured as to the target found The page carries out shielding processing.
In some embodiments, above-mentioned screen unit can be configured to according to following at least one mode to looking into The target pages found carry out shielding processing: the uniform resource locator of target pages is added to search engine crawlers It shields in page listings;The index of the delete target page in the index database of search engine;In response to detecting in search result The page include target pages, to issue searching request client push indicating risk information.
It should be appreciated that each step in all units recorded in device 600 and the method for reference Fig. 2, Fig. 3 and Fig. 5 description It is rapid corresponding.It is equally applicable to device 600 and unit wherein included above with respect to the operation and feature of method description as a result, Details are not described herein.
The device 600 for being used to determine target pages of the above embodiments of the present application, by based in page set to be detected The domain name of the page extracts the candidate's page to be detected for meeting preset condition from page set to be detected, and be added to it is candidate to Page queue is detected, the following behaviour for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected Make: kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages, from target Corresponding top-level domain is extracted in the domain name of the page, and the association of target pages is crawled based on the corresponding top-level domain of target pages The page, in response to determining the association pages of target pages not in candidate page queue to be detected, by the association page of target pages Face is added in candidate page queue to be detected, is realized associated more based on domain name using limited search engine resource discovering The page of more pre-set categories improves the detection efficiency of the pre-set categories page.
Below with reference to Fig. 7, it illustrates the electronic equipment that is suitable for being used to realize embodiment of the disclosure, (example is as shown in figure 1 Search engine server) 700 structural schematic diagram.Electronic equipment shown in Fig. 7 is only an example, should not be to the disclosure The function and use scope of embodiment bring any restrictions.
As shown in fig. 7, electronic equipment 700 may include processing unit (such as central processing unit, graphics processor etc.) 701, random access can be loaded into according to the program being stored in read-only memory (ROM) 702 or from storage device 708 Program in memory (RAM) 703 and execute various movements appropriate and processing.In RAM 703, it is also stored with electronic equipment Various programs and data needed for 700 operations.Processing unit 701, ROM 702 and RAM703 are connected with each other by bus 704. Input/output (I/O) interface 705 is also connected to bus 704.
In general, following device can connect to I/O interface 705: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph As the input unit 706 of head, microphone, accelerometer, gyroscope etc.;Including such as liquid crystal display (LCD), loudspeaker, vibration The output device 707 of dynamic device etc.;Storage device 708 including such as hard disk etc.;And communication device 709.Communication device 709 can To allow electronic equipment 700 wirelessly or non-wirelessly to be communicated with other equipment to exchange data.Although Fig. 7 is shown with various The electronic equipment 700 of device, it should be understood that being not required for implementing or having all devices shown.It can be alternatively Implement or have more or fewer devices.Each box shown in Fig. 7 can represent a device, also can according to need Represent multiple devices.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communication device 709, or from storage device 708 It is mounted, or is mounted from ROM 702.When the computer program is executed by processing unit 701, the implementation of the disclosure is executed The above-mentioned function of being limited in the method for example.It should be noted that computer-readable medium described in embodiment of the disclosure can To be computer-readable signal media or computer readable storage medium either the two any combination.Computer can Reading storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device Or device, or any above combination.The more specific example of computer readable storage medium can include but is not limited to: tool There are electrical connection, the portable computer diskette, hard disk, random access storage device (RAM), read-only memory of one or more conducting wires (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In embodiment of the disclosure, computer Readable storage medium storing program for executing can be any tangible medium for including or store program, which can be commanded execution system, device Either device use or in connection.And in embodiment of the disclosure, computer-readable signal media may include In a base band or as the data-signal that carrier wave a part is propagated, wherein carrying computer-readable program code.It is this The data-signal of propagation can take various forms, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate Combination.Computer-readable signal media can also be any computer-readable medium other than computer readable storage medium, should Computer-readable signal media can send, propagate or transmit for by instruction execution system, device or device use or Person's program in connection.The program code for including on computer-readable medium can transmit with any suitable medium, Including but not limited to: electric wire, optical cable, RF (radio frequency) etc. or above-mentioned any appropriate combination.
Above-mentioned computer-readable medium can be included in above-mentioned electronic equipment;It is also possible to individualism, and not It is fitted into the electronic equipment.Above-mentioned computer-readable medium carries one or more program, when said one or more When a program is executed by the electronic equipment, so that the electronic equipment: the domain name based on the page in page set to be detected, to be checked It surveys and extracts the candidate's page to be detected for meeting preset condition in page set, and be added to candidate page queue to be detected;To time It selects candidate's page to be detected in page queue to be detected to execute the operation for searching target pages, searches the operation packet of target pages It includes: kind judging is carried out to the candidate page to be detected, determine that candidate's page to be detected of pre-set categories is target pages, from target Corresponding top-level domain is extracted in the domain name of the page, and the association of target pages is crawled based on the corresponding top-level domain of target pages The page, in response to determining the association pages of target pages not in candidate page queue to be detected, by the association page of target pages Face is added in candidate page queue to be detected, and the operation for searching target pages is executed with the association page to target pages.
The behaviour for executing embodiment of the disclosure can be write with one or more programming languages or combinations thereof The computer program code of work, programming language include object oriented program language-such as Java, Smalltalk, C++ further include conventional procedural programming language-such as " C " language or similar program design language Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet Include local area network (LAN) or wide area network (WAN) --- it is connected to subscriber computer, or, it may be connected to outer computer (such as It is connected using ISP by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include extraction unit and searching unit.Wherein, the title of these units does not constitute the limit to the unit itself under certain conditions It is fixed, for example, extraction unit is also described as the " domain name based on the page in page set to be detected, from page set to be detected In extract the candidate's page to be detected for meeting preset condition, and be added to the unit of candidate page queue to be detected ".
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (16)

1. a kind of method for determining target pages, comprising:
Based on the domain name of the page in page set to be detected, the time for meeting preset condition is extracted from the page set to be detected The page to be detected is selected, and is added to candidate page queue to be detected;
The operation for searching target pages, the lookup are executed to candidate's page to be detected in the candidate page queue to be detected The operation of target pages includes:
Kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages, Corresponding top-level domain is extracted from the domain name of target pages, is crawled based on the corresponding top-level domain of the target pages described The association page of target pages, in response to the determination target pages the association page not in the candidate page queue to be detected In, the association page of the target pages is added in the candidate page queue to be detected, to the target pages The association page executes the operation for searching target pages.
2. according to the method described in claim 1, wherein, the preset condition includes at least one of the following:
The domain name of the page is not in preset domain name white list;
The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;
The page is not belonging to the intersection of the page set to be detected Yu history page set to be detected, the history page set to be detected Acquisition time of the acquisition time earlier than the page set to be detected.
3. described to carry out kind judging to the candidate page to be detected according to the method described in claim 1, wherein, comprising:
Domain name based on candidate's page to be detected generates at least two subdomain names at random;
The web page contents for obtaining the corresponding website of at least two subdomain name, extract the feature of the web page contents got;
In response to the difference between the feature of the web page contents of the corresponding website of determination at least two subdomain name preset In difference section, determine that the classification of candidate's page to be detected is pre-set categories.
4. according to the method described in claim 1, wherein, the page queue to be detected includes distributed page team to be detected Column;And
Candidate's page to be detected in the candidate page queue to be detected executes the operation for searching target pages, packet It includes:
Lookup is executed to candidate's page to be detected in the distributed candidate page queue to be detected respectively using multiple processes The operation of target pages.
5. according to the method described in claim 1, wherein, the method also includes:
Target pages are stored to database.
6. method according to claim 1-5, wherein the method also includes:
Shielding processing is carried out to the target pages found.
7. according to the method described in claim 6, wherein, the described pair of target pages found carry out shielding processing, including with It is at least one of lower:
The uniform resource locator of the target pages is added in the shielding page listings of search engine crawlers;
The index of the target pages is deleted in the index database of search engine;
In response to detecting that the page in search result includes the target pages, to the client push wind for issuing searching request Dangerous prompt information.
8. a kind of for determining the device of target pages, comprising:
Extraction unit is configured as the domain name based on the page in page set to be detected, extracts from the page set to be detected Meet candidate's page to be detected of preset condition out, and is added to candidate page queue to be detected;
Searching unit is configured as executing candidate's page to be detected in the candidate page queue to be detected and searches page object The operation in face, the operation for searching target pages include:
Kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages, Corresponding top-level domain is extracted from the domain name of target pages, is crawled based on the corresponding top-level domain of the target pages described The association page of target pages, in response to the determination target pages the association page not in the candidate page queue to be detected In, the association page of the target pages is added in the candidate page queue to be detected, to the target pages The association page executes the operation for searching target pages.
9. device according to claim 8, wherein the preset condition includes at least one of the following:
The domain name of the page is not in preset domain name white list;
The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;
The page is not belonging to the intersection of the page set to be detected Yu history page set to be detected, the history page set to be detected Acquisition time of the acquisition time earlier than the page set to be detected.
10. device according to claim 8, wherein the searching unit is configured to right as follows Candidate's page to be detected carries out kind judging:
Domain name based on candidate's page to be detected generates at least two subdomain names at random;
The web page contents for obtaining the corresponding website of at least two subdomain name, extract the feature of the web page contents got;
In response to the difference between the feature of the web page contents of the corresponding website of determination at least two subdomain name preset In difference section, determine that the classification of candidate's page to be detected is pre-set categories.
11. device according to claim 8, wherein the page queue to be detected includes distributed page team to be detected Column;And
The searching unit is configured to:
Lookup is executed to candidate's page to be detected in the distributed candidate page queue to be detected respectively using multiple processes The operation of target pages.
12. device according to claim 8, wherein described device further include:
Storage unit is configured as storing target pages to database.
13. according to the described in any item devices of claim 8-12, wherein described device further include:
Screen unit is configured as carrying out shielding processing to the target pages found.
14. device according to claim 13, wherein the screen unit is configured to according to following at least one Kind mode carries out shielding processing to the target pages found:
The uniform resource locator of the target pages is added in the shielding page listings of search engine crawlers;
The index of the target pages is deleted in the index database of search engine;
In response to detecting that the page in search result includes the target pages, to the client push wind for issuing searching request Dangerous prompt information.
15. a kind of electronic equipment, comprising:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-7.
16. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor The now method as described in any in claim 1-7.
CN201910352767.3A 2019-04-29 2019-04-29 Method and device for determining target page Active CN110069693B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910352767.3A CN110069693B (en) 2019-04-29 2019-04-29 Method and device for determining target page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910352767.3A CN110069693B (en) 2019-04-29 2019-04-29 Method and device for determining target page

Publications (2)

Publication Number Publication Date
CN110069693A true CN110069693A (en) 2019-07-30
CN110069693B CN110069693B (en) 2021-12-24

Family

ID=67369334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910352767.3A Active CN110069693B (en) 2019-04-29 2019-04-29 Method and device for determining target page

Country Status (1)

Country Link
CN (1) CN110069693B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704782A (en) * 2019-09-30 2020-01-17 北京字节跳动网络技术有限公司 Page response method and device, electronic equipment and storage medium
CN111562913A (en) * 2020-04-28 2020-08-21 北京字节跳动网络技术有限公司 Pre-creation method, device, equipment and computer readable medium of view component
CN113378027A (en) * 2021-07-13 2021-09-10 杭州安恒信息技术股份有限公司 Cable excavation method, device, equipment and computer readable storage medium
CN113407802A (en) * 2021-06-10 2021-09-17 杭州安恒信息技术股份有限公司 Spider pool website identification method and device, electronic device and storage medium
CN115858959A (en) * 2022-12-27 2023-03-28 中国电子产业工程有限公司 Data processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104125209A (en) * 2014-01-03 2014-10-29 腾讯科技(深圳)有限公司 Malicious website prompt method and router
CN104503962A (en) * 2014-06-18 2015-04-08 北京邮电大学 Method for detecting hidden link of webpage
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108874802A (en) * 2017-05-09 2018-11-23 阿里巴巴集团控股有限公司 Page detection method and device
CN109274632A (en) * 2017-07-12 2019-01-25 ***通信集团广东有限公司 A kind of recognition methods of website and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104125209A (en) * 2014-01-03 2014-10-29 腾讯科技(深圳)有限公司 Malicious website prompt method and router
CN104503962A (en) * 2014-06-18 2015-04-08 北京邮电大学 Method for detecting hidden link of webpage
CN108874802A (en) * 2017-05-09 2018-11-23 阿里巴巴集团控股有限公司 Page detection method and device
CN109274632A (en) * 2017-07-12 2019-01-25 ***通信集团广东有限公司 A kind of recognition methods of website and device
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704782A (en) * 2019-09-30 2020-01-17 北京字节跳动网络技术有限公司 Page response method and device, electronic equipment and storage medium
CN111562913A (en) * 2020-04-28 2020-08-21 北京字节跳动网络技术有限公司 Pre-creation method, device, equipment and computer readable medium of view component
CN113407802A (en) * 2021-06-10 2021-09-17 杭州安恒信息技术股份有限公司 Spider pool website identification method and device, electronic device and storage medium
CN113378027A (en) * 2021-07-13 2021-09-10 杭州安恒信息技术股份有限公司 Cable excavation method, device, equipment and computer readable storage medium
CN115858959A (en) * 2022-12-27 2023-03-28 中国电子产业工程有限公司 Data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110069693B (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN110069693A (en) Method and apparatus for determining target pages
CN107273409B (en) Network data acquisition, storage and processing method and system
US10055762B2 (en) Deep application crawling
US7861151B2 (en) Web site structure analysis
US10789366B2 (en) Security information management system and security information management method
CN102473190B (en) Keyword assignment to a web page
US11238233B2 (en) Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities
US20110282860A1 (en) Data collection, tracking, and analysis for multiple media including impact analysis and influence tracking
CN108228906B (en) Method and apparatus for generating information
JP2020515944A (en) System and method for direct in-browser markup of elements in Internet content
CN111523677B (en) Method and device for realizing interpretation of prediction result of machine learning model
WO2010120941A2 (en) Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
CN103617213B (en) Method and system for identifying newspage attributive characters
CN102663060B (en) Method and device for identifying tampered webpage
CN107885873A (en) Method and apparatus for output information
US11989247B2 (en) Indexing access limited native applications
CN111259220B (en) Data acquisition method and system based on big data
KR100987330B1 (en) A system and method generating multi-concept networks based on user's web usage data
CN108280102A (en) Internet behavior recording method, device and user terminal
JP2007172091A (en) Information recommendation device
CN107622125B (en) Information crawling method and device and electronic equipment
JP6749865B2 (en) INFORMATION COLLECTION DEVICE AND INFORMATION COLLECTION METHOD
Li et al. Research of network data mining based on reliability source under big data environment
CN110298006A (en) For detecting the method and apparatus for usurping the website of link
CN111222918A (en) Keyword mining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant