CN110069693A - Method and apparatus for determining target pages - Google Patents
Method and apparatus for determining target pages Download PDFInfo
- Publication number
- CN110069693A CN110069693A CN201910352767.3A CN201910352767A CN110069693A CN 110069693 A CN110069693 A CN 110069693A CN 201910352767 A CN201910352767 A CN 201910352767A CN 110069693 A CN110069693 A CN 110069693A
- Authority
- CN
- China
- Prior art keywords
- page
- detected
- target pages
- candidate
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application discloses method, apparatus, electronic equipment and computer-readable medium for determining target pages.One specific embodiment of this method includes: the domain name based on the page in page set to be detected, the candidate's page to be detected for meeting preset condition is extracted from page set to be detected, and be added to candidate page queue to be detected;The operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected, the operation for searching target pages includes: to carry out kind judging to the candidate page to be detected, the candidate's page to be detected for determining pre-set categories is target pages, corresponding top-level domain is extracted from the domain name of target pages, the association page of target pages is crawled based on the corresponding top-level domain of target pages, the association page of target pages is added in candidate page queue to be detected by the association page in response to determining target pages not in candidate page queue to be detected.The embodiment improves the search efficiency of target pages.
Description
Technical field
The invention relates to field of computer technology, and in particular to network information processing technical field, more particularly to
Method and apparatus for determining target pages.
Background technique
Search engine is to collect information from internet and provide the system of search service, SEO (Search for user
Engine Optimization, search engine optimization) it is to improve website in related search engine using the rule of search engine
Natural ranking technology.
Occur some means that website ranking is promoted using fraudulent means such as rubbish link, hiding webpages at present.
Such as search engine is set constantly to crawl these pages by automatically generating the page comprising promotional content that a large amount of interconnections link
Content means.These websites generally can not provide the content for meeting user's search need, but can be in higher ranking
In present search result.
Summary of the invention
The embodiment of the present application proposes method, apparatus, electronic equipment and computer-readable Jie for determining target pages
Matter.
In a first aspect, embodiment of the disclosure provides a kind of method for determining target pages, comprising: based on to be checked
The domain name for surveying the page in page set extracts the candidate's page to be detected for meeting preset condition from page set to be detected, and
It is added to candidate page queue to be detected;Candidate's page to be detected in candidate page queue to be detected is executed and searches page object
The operation in face, the operation for searching target pages include: to carry out kind judging to the candidate page to be detected, determine the time of pre-set categories
Selecting the page to be detected is target pages, and corresponding top-level domain is extracted from the domain name of target pages, is based on target pages pair
The top-level domain answered crawls the association page of target pages, in response to determining the association page of target pages not candidate to be detected
In page queue, the association page of target pages is added in candidate page queue to be detected, with the association to target pages
The page executes the operation for searching target pages.
In some embodiments, above-mentioned preset condition includes at least one of the following: the domain name of the page not in preset domain name
In white list;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;The page is not belonging to page set and history to be detected
The intersection of page set to be detected, the acquisition time of the acquisition time of history page set to be detected earlier than page set to be detected.
In some embodiments, above-mentioned that kind judging is carried out to the candidate page to be detected, comprising: based on candidate page to be detected
The domain name in face generates at least two subdomain names at random;The web page contents of the corresponding website of at least two subdomain names are obtained, extraction obtains
The feature for the web page contents got;Between the feature of web page contents in response to determining the corresponding website of at least two subdomain names
Difference determines that the classification of the candidate page to be detected is pre-set categories in preset difference section.
In some embodiments, above-mentioned page queue to be detected includes distributed page queue to be detected;And it is above-mentioned right
Candidate's page to be detected in candidate page queue to be detected executes the operation for searching target pages, comprising: uses multiple processes
The operation for searching target pages is executed to candidate's page to be detected in distributed candidate page queue to be detected respectively.
In some embodiments, the above method further include: store target pages to database.
In some embodiments, the above method further include: shielding processing is carried out to the target pages found.
In some embodiments, above-mentioned that shielding processing is carried out to the target pages that find, include at least one of the following: by
The uniform resource locator of target pages is added in the shielding page listings of search engine crawlers;In the rope of search engine
Draw the index of the delete target page in library;In response to detecting that the page in search result includes target pages, searched for issuing
The client push indicating risk information of request.
Second aspect, embodiment of the disclosure provide a kind of for determining the device of target pages, comprising: extract single
Member is configured as the domain name based on the page in page set to be detected, extracts from page set to be detected and meets preset condition
Candidate's page to be detected, and be added to candidate page queue to be detected;Searching unit is configured as to the candidate page to be detected
Candidate's page to be detected in queue executes the operation for searching target pages, search target pages operation include: to candidate to
It detects the page and carries out kind judging, determine that candidate's page to be detected of pre-set categories is target pages, from the domain name of target pages
In extract corresponding top-level domain, the association page of target pages is crawled based on the corresponding top-level domain of target pages, respond
In the association page for determining target pages not in candidate page queue to be detected, the association page of target pages is added to time
It selects in page queue to be detected, the operation for searching target pages is executed with the association page to target pages.
In some embodiments, above-mentioned preset condition includes at least one of the following: the domain name of the page not in preset domain name
In white list;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;The page is not belonging to page set and history to be detected
The intersection of page set to be detected, the acquisition time of the acquisition time of history page set to be detected earlier than page set to be detected.
In some embodiments, above-mentioned searching unit is configured to as follows to the candidate page to be detected
Carry out kind judging: the domain name based on the candidate page to be detected generates at least two subdomain names at random;Obtain at least two subdomains
The web page contents of the corresponding website of name, extract the feature of the web page contents got;In response to determining at least two subdomain names pair
Difference between the feature of the web page contents for the website answered determines the classification of the candidate page to be detected in preset difference section
For pre-set categories.
In some embodiments, above-mentioned page queue to be detected includes distributed page queue to be detected;And it above-mentioned looks into
Unit is looked for be configured to: to be detected to the candidate in distributed candidate page queue to be detected respectively using multiple processes
The page executes the operation for searching target pages.
In some embodiments, above-mentioned apparatus further include: storage unit is configured as storing target pages to data
Library.
In some embodiments, above-mentioned apparatus further include: screen unit is configured as carrying out the target pages found
Shielding processing.
In some embodiments, above-mentioned screen unit is configured to according to following at least one mode to finding
Target pages carry out shielding processing: the uniform resource locator of target pages is added to the shielding of search engine crawlers
In page listings;The index of the delete target page in the index database of search engine;In response to detecting the page in search result
Bread contains target pages, to the client push indicating risk information for issuing searching request.
The third aspect, embodiment of the disclosure provide a kind of electronic equipment, comprising: one or more processors;Storage
Device, for storing one or more programs, when one or more programs are executed by one or more processors so that one or
Multiple processors realize the method for determining target pages provided such as first aspect.
Fourth aspect, embodiment of the disclosure provide a kind of computer-readable medium, are stored thereon with computer program,
Wherein, the method for determining target pages that first aspect provides is realized when program is executed by processor.
Above-described embodiment of the disclosure for determining the method and apparatus of target pages, electronic equipment and computer-readable
Medium extracts from page set to be detected by the domain name based on the page in page set to be detected and meets preset condition
The candidate page to be detected, and it is added to candidate page queue to be detected, it is to be detected to the candidate in candidate page queue to be detected
The page executes the operation for searching target pages, and the operation for searching target pages includes: to carry out classification to the candidate page to be detected to sentence
It is fixed, it determines that candidate's page to be detected of pre-set categories is target pages, corresponding level-one is extracted from the domain name of target pages
Domain name crawls the association page of target pages based on the corresponding top-level domain of target pages, in response to determining the pass of target pages
Join the page not in candidate page queue to be detected, the association page of target pages is added to candidate page queue to be detected
In, the operation for searching target pages is executed with the association page to target pages, is realized and is utilized limited search engine resource
It was found that the page based on the associated more pre-set categories of domain name, improves the search efficiency of the pre-set categories page.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that the embodiment of the present application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the method for determining target pages of the application;
Fig. 3 is the flow chart according to another embodiment of the method for determining target pages of the application;
Fig. 4 is shown in Fig. 3 for determining the principle configuration diagram of the method for target pages;
Fig. 5 is the flow chart according to another embodiment of the method for determining target pages of the application;
Fig. 6 is the structural schematic diagram of one embodiment of the device for determining target pages of the application;
Fig. 7 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can the method for determining target pages using the application or the dress for determining target pages
The exemplary system architecture set.
As shown in Figure 1, may include client 101, network 102, search engine server 103, extremely in system architecture 100
A few Website server 104.Network 102 to client 101, search engine server 103 and Website server 104 it
Between provide communication link medium.Network may include various connection types, such as wired, wireless communication link or optical fiber electricity
Cable etc..
Client 101 can be the client with user interface, and user can access network money by client 101
Source.Client 101 specifically can be implemented as various electronic equipments, including but not limited to smart phone, laptop, desktop
Brain, tablet computer, smartwatch, etc..
Search engine server 103 is to be run on search engine server 103 for providing the server of retrieval service
For the search engine from interconnection online collection, arrangement information.
Website server 104 can be and provide the server of various resource informations on internet, different Website servers
104 can provide different classes of or separate sources resource information.For example, Website server 104 can be video resource service
Device, the server of enterprise web site, server of Knowledge Sharing class website, etc..
Client 101 can establish connection by network 102 and search engine server 103.It can pacify in client 101
Equipped with browser, user can initiate searching request by client 101.Search engine server 103 receives searching request
Afterwards, the content of each Website server 104 is grabbed by network 102 using crawlers, and be analyzed and processed, find out matching
The information of the website found out is fed back to client 101 by the website of searching request.Client 101 can be with
Receive the search result that search engine server 103 returns.
In the application scenarios of the embodiment of the present disclosure, search engine server 103 can crawl the offer of Website server 104
The page, detection Website server 104 provide the page whether be pre-set categories the page, obtain the target pages of pre-set categories
Testing result.
It should be noted that search engine server 103, Website server 104 can be hardware, it is also possible to software.
When search engine server 103, Website server 104 are hardware, the distributed clothes of multiple server compositions may be implemented into
Business device cluster, also may be implemented into individual server.It, can when search engine server 103, Website server 104 are software
It, can also be with to be implemented as multiple softwares or software module (such as providing multiple softwares of Distributed Services or software module)
It is implemented as single software or software module.It is not specifically limited herein.
It should be noted that the method provided by embodiment of the disclosure for determining target pages can be drawn by search
The execution of server 103 is held up, correspondingly, for determining that the device of target pages can be set in search engine server 103.
It should be understood that the client, network, search engine server, the number of Website server in Fig. 1 are only to illustrate
Property.According to needs are realized, any number of client, network, search engine server, Website server can have.
With continued reference to Fig. 2, it illustrates according to one embodiment of the method for determining target pages of the application
Process 200.The method for being used to determine target pages, comprising the following steps:
Step 201, it is default to extract satisfaction from page set to be detected for the domain name based on the page in page set to be detected
Candidate's page to be detected of condition, and it is added to candidate page queue to be detected.
In the present embodiment, for determining the available page set to be detected of the executing subject of the method for target pages.To
When detecting the set for the page that page set can be from internet collection, such as can be the page generation that search engine crawls
Between/set of the page of the renewal time in preset time period (such as nearest one week).In some optional implementations, to
Detecting the page in page set can also be obtained by above-mentioned executing subject from other electronic equipments, such as above-mentioned executing subject can be with
Receive the problem of other electronic equipments the report page.
It can judge whether the page in page set to be detected meets preset condition based on the domain name of each page.If the page
Meet preset condition, then the page can be added to candidate page queue to be detected.
Herein, the purpose detected to the page is to look for out target pages.Target pages can be to be searched by being directed to
It indexes the web page crawl rule held up, obtain the page of higher search rank by fraudulent means.Specifically, target pages can be with
It is the page with specific SEO behavioural characteristic, more specifically, target pages can be the page with black cap SEO behavior, example
Such as the spider pond page.
Black cap SEO is the search that page rank is promoted using the fraudulent means such as rubbish link, hiding webpage, keyword stuffing
Engine optimizes behavior.Wherein, spider pond is to be attracted by automatically generating the page largely to interlink with the update of mass data
The crawler capturing page of search engine, and then promote the means of the search rank of the page.In the page to interlink in spider pond
Valuable content is less, but search engine is easily trapped into crawling for a large amount of low value pages after crawling the spider pond page
In, waste a large amount of resource.
It, can be based on feature possessed by target pages, such as mesh in some optional implementations of the present embodiment
The SEO behavioural characteristic for marking the page, determines above-mentioned preset condition.Such as target pages are the spider pond page, are had and a large amount of pages
The feature that face interlinks has similitude between the domain name of the page to interlink, can be based on the similarity feature of domain name
Set above-mentioned preset condition.Then it will meet the page of above-mentioned preset condition in page set to be detected as the candidate page to be detected
It is added in candidate page queue to be detected.
In some optional implementations of the present embodiment, the domain name list of the available trusted page will be in addition to
Other pages except the trusted page are added to candidate page queue to be detected.Then above-mentioned preset condition may include: the page
Domain name not in trusted domain name list.
In the biggish situation of page quantity in page set to be detected, candidate page to be detected is screened by preset condition
Face can filter out the page for not needing detection largely, to reduce the quantity for needing the page detected.
Optionally, above-mentioned preset condition may include at least one of following: the domain name of the page is not in the white name of preset domain name
Dan Zhong;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;It is to be checked with history that the page is not belonging to page set to be detected
Survey the intersection of page set, the acquisition time of the acquisition time of history page set to be detected earlier than above-mentioned page set to be detected.
Preset domain name white list can be the domain name that the access times obtained based on search engine are more than the website of threshold value
The set of the trusted domain name of set and/or user setting.
In natural language processing, puzzlement degree is for describing probability distribution of each participle in sentence in a sentence
Language probabilistic model superiority and inferiority degree measurement.The domain name puzzlement degree of the page can be the finger of the degree of randomness of characterization domain name
Mark.The randomness of domain name is higher, and domain name puzzlement degree is bigger.
Specifically, the puzzlement degree of domain name can be calculated according to following formula (1):
Wherein, U indicates the domain name character string of the page to be detected;The puzzlement degree of P (U) representative domain name character string U;N is domain name
The character sum that the length of character string U, i.e. domain name character string U include.P(ω0) it is that the 0th character occurs in domain name character string U
Probability, herein, the 0th character is placeholder, indicates the starting of character string.P(ωi|Ui) it is in prefix UiThe case where appearance
Under, the probability that i-th of character occurs, herein, prefix UiBy the 0th character to (i-1)-th character in representative domain name character string U
The prefix character string of composition.P(ωi|ωi-1) indicate the probability of i-th of character occur in the case where (i-1)-th character occurs.
When the domain name puzzlement degree of the page is greater than preset puzzled degree threshold value, show that the randomness of the domain name of the page is stronger.
It can be added to the stronger page of the randomness of domain name as the candidate page to be detected in candidate page queue to be detected.Due to
The randomness of the pages with search engine optimization cheating such as usual spider pond is stronger, by the puzzlement degree threshold that domain name is arranged
Value can filter out the page that may have the cheating for search engine optimization.
History page set to be detected can be the page set being collected into before collecting current page set to be detected.Example
Such as, search engine can carry out page collection to be detected with period regular time, and after being collected according to the page to be detected
Determine target pages.It may determine that the page in above-mentioned page set to be detected whether in current page set and history to be detected
In the intersection of page set to be detected, if so, the page is tested in history page set to be detected, it can not be for it again
Secondary detection;If the page is not belonging to the intersection of current page set to be detected Yu the history page to be detected, can determine to be detected
The page is the new page, adds it to candidate page queue to be detected.In this way, can be carried out according to history page set to be detected
Incremental computations only extract the page newly-increased in the recent period as the candidate page to be detected.
Optionally, above-mentioned preset condition can also be set using other page filter methods, such as based on third party
The list of trusted domain name or domain name testing result that search engine or third party's page analysis service provide are above-mentioned default to be arranged
Condition extracts the fly-by-night page from page set to be detected.
Step 202, the operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected.
The operation for searching target pages includes: to carry out kind judging to the candidate page to be detected, determines the time of pre-set categories
Selecting the page to be detected is target pages, and corresponding top-level domain is extracted from the domain name of target pages, is based on target pages pair
The top-level domain answered crawls the association page of target pages, in response to determining the association page of target pages not candidate to be detected
In page queue, the association page of target pages is added in candidate page queue to be detected, with the association to target pages
The page executes the operation for searching target pages.
Specifically, the page to be detected candidate for each of candidate page queue to be detected, can determine this first
The classification of the candidate page to be detected.Herein, the classification of the page can be carried out according to the feature for the target pages that expectation is found out
It divides, the page of the feature with target pages can be divided into one kind, the feature without target pages divides other one into
Class.Specifically, when target pages are the pages with the behavior for promoting search rank by abnormal fraudulent means, the page
Classification can characterize whether the page has the behavior that search rank is promoted by abnormal fraudulent means, may include the cheating page
And normal page.Optionally, when target pages are the spider pond pages, the classification of the page can characterize whether the page is spider pond
The page, including the spider pond page and the non-spider pond page.
In the present embodiment, the feature of candidate content of pages to be detected can be extracted, and/or extracts the chain of the page to be detected
Connect behavioural characteristic.Link behavioural characteristic can crawl time by executing to crawl the operation of the candidate page to be detected and capture crawler
The behavioural characteristic when page to be detected is selected to obtain.It is then based on the feature extracted and determines whether the candidate page to be detected is pre-
If the page of classification.Herein, pre-set categories can be with the behavior for promoting search rank by abnormal fraudulent means
The classification of the page, or further, pre-set categories can be the classification of the spider pond page.
It is alternatively possible to carry out kind judging to the candidate page to be detected as follows:
The domain name for being primarily based on the candidate page to be detected generates at least two subdomain names at random.Then obtain at least two sons
The web page contents of the corresponding website of domain name, extract the feature of the web page contents got.Assuming that two subdomain names generated are corresponding
Website be respectively Pa, Pb, the feature of the web page contents of the corresponding website of two subdomain names be expressed as LSet (Pa) and
LSet(Pb).Then the difference between the feature of the web page contents of the corresponding website of above-mentioned at least two subdomain name can be calculated,
The difference diff between the feature of the web page contents of the corresponding website of two subdomain names is calculated for example, by using formula (2):
Wherein, | LSet (Pa)-LSet (Pb) | indicate the size of two set LSet (Pa) and the difference set of LSet (Pb), i.e.,
The quantity of feature in the difference set of LSet (Pa) and LSet (Pb);LSet (Pmin) is lesser in LSet (Pa) and LSet (Pb)
Set;| LSet (Pmin) | for the size of lesser set in LSet (Pa) and LSet (Pb), i.e. LSet (Pa) and LSet (Pb)
In feature in lesser set quantity.
By the difference between the feature for the web page contents for calculating the corresponding website of at least two subdomain names, can be derived that by
The difference between the corresponding website of two subdomain names that the candidate page to be detected generates.Later, in response to determining at least two sons
Difference between the feature of the web page contents of the corresponding website of domain name determines the candidate page to be detected in preset difference section
Classification be pre-set categories.When the corresponding website of two subdomain names generated by the candidate page to be detected web page contents it
Between difference it is smaller when, can determine that this waits for that the classification of the candidate page to be detected is pre-set categories.
The candidate's page to be detected that by kind judging result can be pre-set categories is target pages.It is then possible to be based on
The domain name of the target pages found further searches for the page of more pre-set categories, to find more page objects
Face.Specifically, it can be extracted from the domain name of fixed target pages top-level domain (i.e. top level domain), then basis should
Top-level domain crawls the relevant page.It can parse URL (Uniform Resource Locator, the system of the page crawled
One Resource Locator), more related pages are crawled using breadth first traversal algorithm to the URL, the page crawled is made
For the association page of target pages.Later, it can be determined that whether the association page of target pages is in above-mentioned candidate's page to be detected
In queue, if not, the association page of the target pages crawled can be added in candidate page queue to be detected, with
Realize that the association page based on target pages continues by the operation for executing lookup target pages to candidate page queue to be detected
Search new target pages.The operation for searching target pages is successively then being executed to the page in candidate page queue to be detected
Cheng Zhong is added in candidate page queue to be detected by constantly crawling the association page as the candidate page to be detected, can
It was found that more target pages.
Above-mentioned steps 202 are associated crawling for the page, and the page that will be crawled by the domain name based on target pages
It is added in candidate page queue to be detected and carries out kind judging to determine whether to efficiently use search for target pages and draw
It finds to the resource high-efficiency held up more target pages, promotes the efficiency that target pages are searched.
The method for determining target pages of above-described embodiment of the disclosure, by based on the page in page set to be detected
The domain name in face extracts the candidate's page to be detected for meeting preset condition from page set to be detected, and is added to candidate to be checked
Page queue is surveyed, the following operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected:
Kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages, from page object
Corresponding top-level domain is extracted in the domain name in face, and the association page of target pages is crawled based on the corresponding top-level domain of target pages
Face, in response to determining the association pages of target pages not in candidate page queue to be detected, by the association page of target pages
It is added in candidate page queue to be detected, the operation for searching target pages is executed with the association page to target pages, is realized
Target pages of the limited search engine resource discovering based on the associated more pre-set categories of domain name are utilized, are improved default
The search efficiency of the target pages of classification can be applied in the page lookup of large-scale pre-set categories.
Optionally, after determining target pages, the above-mentioned process 200 for determining the method for target pages can be with
It include: to store target pages to database.It can be by saving the URL and/or target pages of target pages in the database
Content of pages record target pages, with it is subsequent provide search service based on search engine when the mesh that is saved according to database
The ranking of the URL of the page and/or the content of pages amendment target pages of target pages are marked, or is searched based on search engine offer
The target pages saved in database are rejected when rope services.
Referring to FIG. 3, it illustrates another embodiments of the method according to the application for determining target pages
Flow chart.As shown in figure 3, for determine target pages method process 300 the following steps are included:
Step 301, it is default to extract satisfaction from page set to be detected for the domain name based on the page in page set to be detected
Candidate's page to be detected of condition, and it is added to distributed page queue to be detected.
The available page set to be detected of executing subject for determining the method for target pages, page set to be detected can be with
It is the set for the page collected from internet, such as can be the page that search engine crawls and generate time/renewal time pre-
If the set of the page in the period (such as nearest one week).In some optional implementations, in page set to be detected
The page can also be obtained by above-mentioned executing subject from other electronic equipments, such as can receive the problem of other electronic equipments report
The page.
It can judge whether the page in page set to be detected meets preset condition based on the domain name of the page.If the page is full
The page can be then added to distributed candidate page queue to be detected by sufficient preset condition.
Above-mentioned preset condition is the condition for carrying out primary filtration to page set to be detected, can be and is determined based on expectation
The condition of the feature-set of target pages out.For example, above-mentioned preset condition can be root when target pages are the spider pond pages
According to the feature-set of similitude between domain name.In another example target pages can be the viral page, then above-mentioned preset condition can be
According to the feature-set of the viral page.
Optionally, above-mentioned preset condition may include at least one of following: the domain name of the page is not in the white name of preset domain name
Dan Zhong;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;It is to be checked with history that the page is not belonging to page set to be detected
Survey the intersection of page set, the acquisition time of the acquisition time of history page set to be detected earlier than above-mentioned page set to be detected.
The candidate's page to be detected for meeting preset condition is extracted in the step 301 of the present embodiment from page set to be detected
Operation and previous embodiment step 201 in the candidate's page to be detected for meeting preset condition is extracted from page set to be detected
The operation in face is consistent, and details are not described herein again.
In the present embodiment, candidate page queue to be detected is distributed candidate page queue to be detected.It can create in advance
Multiple candidate page queues to be detected are built, it, can be by page to be detected when determining that the page to be detected meets above-mentioned preset condition
Face is added to one of candidate page queue to be detected.Herein, candidate queue to be detected can be Kafka or redis etc.
Distributed queue.
Step 302, using multiple processes respectively to candidate's page to be detected in distributed candidate page queue to be detected
Execute the operation for searching target pages.
Classification can be carried out to candidate's page to be detected in distributed page queue to be detected respectively using multiple processes
Determine.N number of candidate page queue to be detected is handled respectively for example, by using M process, and M, N are positive integer, wherein each process pair
At least one candidate page queue to be detected executes the operation for searching target pages, alternatively, can be using multiple processes to same
The candidate page queue to be detected of difference in a candidate page queue to be detected executes the operation for searching target pages.Process and its
Corresponding relationship between the candidate page queue to be detected of processing can determine according to preset resource dispatching strategy, this
Do not do particular determination in place.Each process once search target pages operation in candidate's page to be detected at
Reason.In this way, the detection efficiency of candidate page queue to be detected can be promoted by way of asynchronous multi-process, and then promote target
The search speed of the page.
Operation to the lookup target pages that the candidate page to be detected executes may include: to distributed page team to be detected
Candidate's page to be detected in column carries out kind judging, determines that candidate's page to be detected of pre-set categories is target pages, from mesh
It marks in the domain name of the page and extracts corresponding top-level domain, the pass of target pages is crawled based on the corresponding top-level domain of target pages
Join the page, in response to determining the association pages of target pages not in distributed candidate page queue to be detected, by target pages
The association page be added in distributed candidate page queue to be detected, executed with the association page to target pages and search target
The operation of the page.
The operation that target pages are searched in the operation of target pages and the step 202 of previous embodiment is searched in the present embodiment
It is identical.The specific implementation that target pages are searched in step 202 is also applied for searching page object in the step 302 of the present embodiment
The operation in face, details are not described herein again.
It optionally, is target in the candidate's page to be detected for determining pre-set categories in the operation of above-mentioned lookup target pages
After the page, top-level domain is extracted for the target page, and can based on the operation that top-level domain crawls the association page of target pages
To be executed by multiple threads, facilitate the further promotion associated page of target pages in this way crawls efficiency, to mention
Rise the speed that other associated target pages are found by target pages.
The process 300 of the method for determining target pages of the present embodiment, by the way that the candidate page to be detected to be added to
Distributed candidate page queue to be detected, it is to be checked to the candidate in distributed candidate page queue to be detected respectively using multi-process
The operation for surveying page queue's performance objective page, can effectively promote the discovery speed of target pages.
Optionally, the above-mentioned process 300 for determining the method for target pages can also include: by target pages store to
Database.In this way, when crawling the page during search engine provides search service, it can be according to the page object of database purchase
Face executes corresponding filter operation.
With continued reference to Fig. 4, it illustrates shown in Fig. 3 for determining the principle configuration diagram of the method for target pages.
As shown in figure 4, page set URLs to be detected is first applied to filter and is filtered, filter passes through domain name white list, domain name
Puzzlement degree excludes the page trusty, and the history page set to be detected based on caching extracts recent increment page conduct
The candidate page to be detected.The candidate page to be detected is dispensed to distributed candidate page queue to be detected.Then multiple page objects
Face lookup process (Checker) handles the page in distributed candidate page queue to be detected respectively.Each target pages are searched
After process carries out kind judging to the candidate page to be detected, using multiple threads (Finder) that crawl to the target pages found
Be associated crawling for the page, and judge the association page crawled whether in distributed candidate page queue to be detected,
If the association page crawled not in distributed candidate page queue to be detected, can add it to it is distributed it is candidate to
It detects in page queue, and mode as described above continues to search target pages.It can be same by the target pages found
Step is into database.
Referring to FIG. 5, it illustrates another embodiments of the method according to the application for determining target pages
Flow chart.As shown in figure 5, the process 500 of the method for determining target pages, comprising the following steps:
Step 501, it is default to extract satisfaction from page set to be detected for the domain name based on the page in page set to be detected
Candidate's page to be detected of condition, and it is added to candidate page queue to be detected;
Step 502, the operation for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected.
The operation for searching target pages includes: to carry out kind judging to the candidate page to be detected, determines the time of pre-set categories
Selecting the page to be detected is target pages, and corresponding top-level domain is extracted from the domain name of target pages, is based on target pages pair
The top-level domain answered crawls the association page of target pages, in response to determining the association page of target pages not candidate to be detected
In page queue, the association page of target pages is added in candidate page queue to be detected, with the association to target pages
The page executes the operation for searching target pages.
The step 501, step 502 of the present embodiment are consistent with the step 201 of previous embodiment, step 202 respectively, step
501, the specific implementation of step 502 can be no longer superfluous herein with reference to the description of step 201, step 202 in previous embodiment
It states.
Optionally, in addition, candidate page queue to be detected can be distributed candidate page queue to be detected.Executing step
It, can be by multiple processes respectively to distributed candidate page queue to be detected when the operation of the lookup target pages in rapid 502
In candidate's page to be detected handled.
Step 503, shielding processing is carried out to the target pages found.
In the present embodiment, target pages can be that the page for search engine crawls strategy and ranking mode passes through work
Disadvantage means obtain the page of higher ranking, such as the spider pond page.Shielding processing can be carried out to the target pages found,
To avoid target pages from influencing search result.Specifically, it can control search engine and skip page object when crawling the page
Face, or filter out target pages in search result.
It is alternatively possible to carry out shielding processing to the target pages found as follows: by the system of target pages
One Resource Locator is added in the shielding page listings of search engine crawlers;Mesh is deleted in the index database of search engine
Mark the index of the page;In response to detecting that the page in search result includes target pages, to the client for issuing searching request
Push indicating risk information.
In the method for above-mentioned shielding processing, when the crawlers of search engine crawl the page, its shielded page can be skipped
The page in the list of face.And in the index database of search engine after the index of the delete target page, the crawlers of search engine
The target pages that deleted index is directed toward can not be crawled.
Target pages are not handled in the crawlers of search engine, and in search result include target pages when,
It can be used to prompt target pages to the client push for issuing searching request to be the indicating risk information with the risk page.Visitor
Family end can determine whether that user shows the page according to the indicating risk information.Such as the plug-in unit of browser can be according to this
Indicating risk information is filtered content of pages, or shields to full page.
Optionally, after step 502, before step 503, it is above-mentioned for determine target pages method process 500
It can also include: to store target pages to database.It can be by saving the URL and/or mesh of target pages in the database
The content of pages of the page is marked to record target pages.In this way, in step 503 can the target pages based on database purchase into
Row shielding processing.
The method for determining target pages of the present embodiment, by target pages carry out shielding processing, can effectively,
Quickly, it blocks target pages to propagate by search engine in all directions, reduces the influence to the search result of user.
With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, this application provides one kind for determining mesh
One embodiment of the device of the page is marked, the Installation practice is corresponding with Fig. 2, Fig. 3 and embodiment of the method shown in fig. 5, should
Device specifically can be applied in various electronic equipments.
As shown in fig. 6, the device 600 for determining target pages of the present embodiment includes: extraction unit 601 and searches single
Member 602.Wherein, extraction unit 601 can be configured as the domain name based on the page in page set to be detected, from the page to be detected
It concentrates and extracts the candidate's page to be detected for meeting preset condition, and be added to candidate page queue to be detected;Searching unit 602
It can be configured as the operation for executing to candidate's page to be detected in candidate page queue to be detected and searching target pages, search
The operation of target pages includes: to carry out kind judging to the candidate page to be detected, determines candidate's page to be detected of pre-set categories
For target pages, corresponding top-level domain is extracted from the domain name of target pages, is based on the corresponding top-level domain of target pages
The association page for crawling target pages, in response to determine the association pages of target pages not in candidate page queue to be detected,
The association page of target pages is added in candidate page queue to be detected, lookup is executed with the association page to target pages
The operation of target pages.
In some embodiments, above-mentioned preset condition includes at least one of the following: the domain name of the page not in preset domain name
In white list;The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;The page is not belonging to page set and history to be detected
The intersection of page set to be detected, the acquisition time of the acquisition time of history page set to be detected earlier than page set to be detected.
In some embodiments, above-mentioned searching unit 602 can be configured to as follows to candidate to
The detection page carries out kind judging: the domain name based on the candidate page to be detected generates at least two subdomain names at random;It obtains at least
The web page contents of the corresponding website of two subdomain names, extract the feature of the web page contents got;In response to determining at least two
Difference between the feature of the web page contents of the corresponding website of subdomain name determines candidate page to be detected in preset difference section
The classification in face is pre-set categories.
In some embodiments, above-mentioned page queue to be detected includes distributed page queue to be detected;And it above-mentioned looks into
Look for unit 602 that can be configured to: using multiple processes respectively to the time in distributed candidate page queue to be detected
The page to be detected is selected to execute the operation for searching target pages.
In some embodiments, above-mentioned apparatus 600 can also include: storage unit, be configured as storing target pages
To database.
In some embodiments, above-mentioned apparatus 600 can also include: screen unit, be configured as to the target found
The page carries out shielding processing.
In some embodiments, above-mentioned screen unit can be configured to according to following at least one mode to looking into
The target pages found carry out shielding processing: the uniform resource locator of target pages is added to search engine crawlers
It shields in page listings;The index of the delete target page in the index database of search engine;In response to detecting in search result
The page include target pages, to issue searching request client push indicating risk information.
It should be appreciated that each step in all units recorded in device 600 and the method for reference Fig. 2, Fig. 3 and Fig. 5 description
It is rapid corresponding.It is equally applicable to device 600 and unit wherein included above with respect to the operation and feature of method description as a result,
Details are not described herein.
The device 600 for being used to determine target pages of the above embodiments of the present application, by based in page set to be detected
The domain name of the page extracts the candidate's page to be detected for meeting preset condition from page set to be detected, and be added to it is candidate to
Page queue is detected, the following behaviour for searching target pages is executed to candidate's page to be detected in candidate page queue to be detected
Make: kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages, from target
Corresponding top-level domain is extracted in the domain name of the page, and the association of target pages is crawled based on the corresponding top-level domain of target pages
The page, in response to determining the association pages of target pages not in candidate page queue to be detected, by the association page of target pages
Face is added in candidate page queue to be detected, is realized associated more based on domain name using limited search engine resource discovering
The page of more pre-set categories improves the detection efficiency of the pre-set categories page.
Below with reference to Fig. 7, it illustrates the electronic equipment that is suitable for being used to realize embodiment of the disclosure, (example is as shown in figure 1
Search engine server) 700 structural schematic diagram.Electronic equipment shown in Fig. 7 is only an example, should not be to the disclosure
The function and use scope of embodiment bring any restrictions.
As shown in fig. 7, electronic equipment 700 may include processing unit (such as central processing unit, graphics processor etc.)
701, random access can be loaded into according to the program being stored in read-only memory (ROM) 702 or from storage device 708
Program in memory (RAM) 703 and execute various movements appropriate and processing.In RAM 703, it is also stored with electronic equipment
Various programs and data needed for 700 operations.Processing unit 701, ROM 702 and RAM703 are connected with each other by bus 704.
Input/output (I/O) interface 705 is also connected to bus 704.
In general, following device can connect to I/O interface 705: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph
As the input unit 706 of head, microphone, accelerometer, gyroscope etc.;Including such as liquid crystal display (LCD), loudspeaker, vibration
The output device 707 of dynamic device etc.;Storage device 708 including such as hard disk etc.;And communication device 709.Communication device 709 can
To allow electronic equipment 700 wirelessly or non-wirelessly to be communicated with other equipment to exchange data.Although Fig. 7 is shown with various
The electronic equipment 700 of device, it should be understood that being not required for implementing or having all devices shown.It can be alternatively
Implement or have more or fewer devices.Each box shown in Fig. 7 can represent a device, also can according to need
Represent multiple devices.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium
On computer program, which includes the program code for method shown in execution flow chart.In such reality
It applies in example, which can be downloaded and installed from network by communication device 709, or from storage device 708
It is mounted, or is mounted from ROM 702.When the computer program is executed by processing unit 701, the implementation of the disclosure is executed
The above-mentioned function of being limited in the method for example.It should be noted that computer-readable medium described in embodiment of the disclosure can
To be computer-readable signal media or computer readable storage medium either the two any combination.Computer can
Reading storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device
Or device, or any above combination.The more specific example of computer readable storage medium can include but is not limited to: tool
There are electrical connection, the portable computer diskette, hard disk, random access storage device (RAM), read-only memory of one or more conducting wires
(ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-
ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In embodiment of the disclosure, computer
Readable storage medium storing program for executing can be any tangible medium for including or store program, which can be commanded execution system, device
Either device use or in connection.And in embodiment of the disclosure, computer-readable signal media may include
In a base band or as the data-signal that carrier wave a part is propagated, wherein carrying computer-readable program code.It is this
The data-signal of propagation can take various forms, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate
Combination.Computer-readable signal media can also be any computer-readable medium other than computer readable storage medium, should
Computer-readable signal media can send, propagate or transmit for by instruction execution system, device or device use or
Person's program in connection.The program code for including on computer-readable medium can transmit with any suitable medium,
Including but not limited to: electric wire, optical cable, RF (radio frequency) etc. or above-mentioned any appropriate combination.
Above-mentioned computer-readable medium can be included in above-mentioned electronic equipment;It is also possible to individualism, and not
It is fitted into the electronic equipment.Above-mentioned computer-readable medium carries one or more program, when said one or more
When a program is executed by the electronic equipment, so that the electronic equipment: the domain name based on the page in page set to be detected, to be checked
It surveys and extracts the candidate's page to be detected for meeting preset condition in page set, and be added to candidate page queue to be detected;To time
It selects candidate's page to be detected in page queue to be detected to execute the operation for searching target pages, searches the operation packet of target pages
It includes: kind judging is carried out to the candidate page to be detected, determine that candidate's page to be detected of pre-set categories is target pages, from target
Corresponding top-level domain is extracted in the domain name of the page, and the association of target pages is crawled based on the corresponding top-level domain of target pages
The page, in response to determining the association pages of target pages not in candidate page queue to be detected, by the association page of target pages
Face is added in candidate page queue to be detected, and the operation for searching target pages is executed with the association page to target pages.
The behaviour for executing embodiment of the disclosure can be write with one or more programming languages or combinations thereof
The computer program code of work, programming language include object oriented program language-such as Java,
Smalltalk, C++ further include conventional procedural programming language-such as " C " language or similar program design language
Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence
Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or
It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet
Include local area network (LAN) or wide area network (WAN) --- it is connected to subscriber computer, or, it may be connected to outer computer (such as
It is connected using ISP by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use
The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet
Include extraction unit and searching unit.Wherein, the title of these units does not constitute the limit to the unit itself under certain conditions
It is fixed, for example, extraction unit is also described as the " domain name based on the page in page set to be detected, from page set to be detected
In extract the candidate's page to be detected for meeting preset condition, and be added to the unit of candidate page queue to be detected ".
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (16)
1. a kind of method for determining target pages, comprising:
Based on the domain name of the page in page set to be detected, the time for meeting preset condition is extracted from the page set to be detected
The page to be detected is selected, and is added to candidate page queue to be detected;
The operation for searching target pages, the lookup are executed to candidate's page to be detected in the candidate page queue to be detected
The operation of target pages includes:
Kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages,
Corresponding top-level domain is extracted from the domain name of target pages, is crawled based on the corresponding top-level domain of the target pages described
The association page of target pages, in response to the determination target pages the association page not in the candidate page queue to be detected
In, the association page of the target pages is added in the candidate page queue to be detected, to the target pages
The association page executes the operation for searching target pages.
2. according to the method described in claim 1, wherein, the preset condition includes at least one of the following:
The domain name of the page is not in preset domain name white list;
The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;
The page is not belonging to the intersection of the page set to be detected Yu history page set to be detected, the history page set to be detected
Acquisition time of the acquisition time earlier than the page set to be detected.
3. described to carry out kind judging to the candidate page to be detected according to the method described in claim 1, wherein, comprising:
Domain name based on candidate's page to be detected generates at least two subdomain names at random;
The web page contents for obtaining the corresponding website of at least two subdomain name, extract the feature of the web page contents got;
In response to the difference between the feature of the web page contents of the corresponding website of determination at least two subdomain name preset
In difference section, determine that the classification of candidate's page to be detected is pre-set categories.
4. according to the method described in claim 1, wherein, the page queue to be detected includes distributed page team to be detected
Column;And
Candidate's page to be detected in the candidate page queue to be detected executes the operation for searching target pages, packet
It includes:
Lookup is executed to candidate's page to be detected in the distributed candidate page queue to be detected respectively using multiple processes
The operation of target pages.
5. according to the method described in claim 1, wherein, the method also includes:
Target pages are stored to database.
6. method according to claim 1-5, wherein the method also includes:
Shielding processing is carried out to the target pages found.
7. according to the method described in claim 6, wherein, the described pair of target pages found carry out shielding processing, including with
It is at least one of lower:
The uniform resource locator of the target pages is added in the shielding page listings of search engine crawlers;
The index of the target pages is deleted in the index database of search engine;
In response to detecting that the page in search result includes the target pages, to the client push wind for issuing searching request
Dangerous prompt information.
8. a kind of for determining the device of target pages, comprising:
Extraction unit is configured as the domain name based on the page in page set to be detected, extracts from the page set to be detected
Meet candidate's page to be detected of preset condition out, and is added to candidate page queue to be detected;
Searching unit is configured as executing candidate's page to be detected in the candidate page queue to be detected and searches page object
The operation in face, the operation for searching target pages include:
Kind judging is carried out to the candidate page to be detected, determines that candidate's page to be detected of pre-set categories is target pages,
Corresponding top-level domain is extracted from the domain name of target pages, is crawled based on the corresponding top-level domain of the target pages described
The association page of target pages, in response to the determination target pages the association page not in the candidate page queue to be detected
In, the association page of the target pages is added in the candidate page queue to be detected, to the target pages
The association page executes the operation for searching target pages.
9. device according to claim 8, wherein the preset condition includes at least one of the following:
The domain name of the page is not in preset domain name white list;
The puzzlement degree of the domain name of the page is greater than preset puzzled degree threshold value;
The page is not belonging to the intersection of the page set to be detected Yu history page set to be detected, the history page set to be detected
Acquisition time of the acquisition time earlier than the page set to be detected.
10. device according to claim 8, wherein the searching unit is configured to right as follows
Candidate's page to be detected carries out kind judging:
Domain name based on candidate's page to be detected generates at least two subdomain names at random;
The web page contents for obtaining the corresponding website of at least two subdomain name, extract the feature of the web page contents got;
In response to the difference between the feature of the web page contents of the corresponding website of determination at least two subdomain name preset
In difference section, determine that the classification of candidate's page to be detected is pre-set categories.
11. device according to claim 8, wherein the page queue to be detected includes distributed page team to be detected
Column;And
The searching unit is configured to:
Lookup is executed to candidate's page to be detected in the distributed candidate page queue to be detected respectively using multiple processes
The operation of target pages.
12. device according to claim 8, wherein described device further include:
Storage unit is configured as storing target pages to database.
13. according to the described in any item devices of claim 8-12, wherein described device further include:
Screen unit is configured as carrying out shielding processing to the target pages found.
14. device according to claim 13, wherein the screen unit is configured to according to following at least one
Kind mode carries out shielding processing to the target pages found:
The uniform resource locator of the target pages is added in the shielding page listings of search engine crawlers;
The index of the target pages is deleted in the index database of search engine;
In response to detecting that the page in search result includes the target pages, to the client push wind for issuing searching request
Dangerous prompt information.
15. a kind of electronic equipment, comprising:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now method as described in any in claim 1-7.
16. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor
The now method as described in any in claim 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910352767.3A CN110069693B (en) | 2019-04-29 | 2019-04-29 | Method and device for determining target page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910352767.3A CN110069693B (en) | 2019-04-29 | 2019-04-29 | Method and device for determining target page |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110069693A true CN110069693A (en) | 2019-07-30 |
CN110069693B CN110069693B (en) | 2021-12-24 |
Family
ID=67369334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910352767.3A Active CN110069693B (en) | 2019-04-29 | 2019-04-29 | Method and device for determining target page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110069693B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704782A (en) * | 2019-09-30 | 2020-01-17 | 北京字节跳动网络技术有限公司 | Page response method and device, electronic equipment and storage medium |
CN111562913A (en) * | 2020-04-28 | 2020-08-21 | 北京字节跳动网络技术有限公司 | Pre-creation method, device, equipment and computer readable medium of view component |
CN113378027A (en) * | 2021-07-13 | 2021-09-10 | 杭州安恒信息技术股份有限公司 | Cable excavation method, device, equipment and computer readable storage medium |
CN113407802A (en) * | 2021-06-10 | 2021-09-17 | 杭州安恒信息技术股份有限公司 | Spider pool website identification method and device, electronic device and storage medium |
CN115858959A (en) * | 2022-12-27 | 2023-03-28 | 中国电子产业工程有限公司 | Data processing method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104125209A (en) * | 2014-01-03 | 2014-10-29 | 腾讯科技(深圳)有限公司 | Malicious website prompt method and router |
CN104503962A (en) * | 2014-06-18 | 2015-04-08 | 北京邮电大学 | Method for detecting hidden link of webpage |
CN107885820A (en) * | 2017-11-07 | 2018-04-06 | 北京小度互娱科技有限公司 | Breadth traversal orientation grasping means based on crawler system |
CN108874802A (en) * | 2017-05-09 | 2018-11-23 | 阿里巴巴集团控股有限公司 | Page detection method and device |
CN109274632A (en) * | 2017-07-12 | 2019-01-25 | ***通信集团广东有限公司 | A kind of recognition methods of website and device |
-
2019
- 2019-04-29 CN CN201910352767.3A patent/CN110069693B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104125209A (en) * | 2014-01-03 | 2014-10-29 | 腾讯科技(深圳)有限公司 | Malicious website prompt method and router |
CN104503962A (en) * | 2014-06-18 | 2015-04-08 | 北京邮电大学 | Method for detecting hidden link of webpage |
CN108874802A (en) * | 2017-05-09 | 2018-11-23 | 阿里巴巴集团控股有限公司 | Page detection method and device |
CN109274632A (en) * | 2017-07-12 | 2019-01-25 | ***通信集团广东有限公司 | A kind of recognition methods of website and device |
CN107885820A (en) * | 2017-11-07 | 2018-04-06 | 北京小度互娱科技有限公司 | Breadth traversal orientation grasping means based on crawler system |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704782A (en) * | 2019-09-30 | 2020-01-17 | 北京字节跳动网络技术有限公司 | Page response method and device, electronic equipment and storage medium |
CN111562913A (en) * | 2020-04-28 | 2020-08-21 | 北京字节跳动网络技术有限公司 | Pre-creation method, device, equipment and computer readable medium of view component |
CN113407802A (en) * | 2021-06-10 | 2021-09-17 | 杭州安恒信息技术股份有限公司 | Spider pool website identification method and device, electronic device and storage medium |
CN113378027A (en) * | 2021-07-13 | 2021-09-10 | 杭州安恒信息技术股份有限公司 | Cable excavation method, device, equipment and computer readable storage medium |
CN115858959A (en) * | 2022-12-27 | 2023-03-28 | 中国电子产业工程有限公司 | Data processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110069693B (en) | 2021-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110069693A (en) | Method and apparatus for determining target pages | |
CN107273409B (en) | Network data acquisition, storage and processing method and system | |
US10055762B2 (en) | Deep application crawling | |
US7861151B2 (en) | Web site structure analysis | |
US10789366B2 (en) | Security information management system and security information management method | |
CN102473190B (en) | Keyword assignment to a web page | |
US11238233B2 (en) | Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities | |
US20110282860A1 (en) | Data collection, tracking, and analysis for multiple media including impact analysis and influence tracking | |
CN108228906B (en) | Method and apparatus for generating information | |
JP2020515944A (en) | System and method for direct in-browser markup of elements in Internet content | |
CN111523677B (en) | Method and device for realizing interpretation of prediction result of machine learning model | |
WO2010120941A2 (en) | Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata | |
CN103617213B (en) | Method and system for identifying newspage attributive characters | |
CN102663060B (en) | Method and device for identifying tampered webpage | |
CN107885873A (en) | Method and apparatus for output information | |
US11989247B2 (en) | Indexing access limited native applications | |
CN111259220B (en) | Data acquisition method and system based on big data | |
KR100987330B1 (en) | A system and method generating multi-concept networks based on user's web usage data | |
CN108280102A (en) | Internet behavior recording method, device and user terminal | |
JP2007172091A (en) | Information recommendation device | |
CN107622125B (en) | Information crawling method and device and electronic equipment | |
JP6749865B2 (en) | INFORMATION COLLECTION DEVICE AND INFORMATION COLLECTION METHOD | |
Li et al. | Research of network data mining based on reliability source under big data environment | |
CN110298006A (en) | For detecting the method and apparatus for usurping the website of link | |
CN111222918A (en) | Keyword mining method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |