CN103116660A - Method and device for acquiring website authority values - Google Patents

Method and device for acquiring website authority values Download PDF

Info

Publication number
CN103116660A
CN103116660A CN2013100845997A CN201310084599A CN103116660A CN 103116660 A CN103116660 A CN 103116660A CN 2013100845997 A CN2013100845997 A CN 2013100845997A CN 201310084599 A CN201310084599 A CN 201310084599A CN 103116660 A CN103116660 A CN 103116660A
Authority
CN
China
Prior art keywords
website
authority
value
credible
territory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100845997A
Other languages
Chinese (zh)
Inventor
白俊良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLE SEARCH NETWORK AG
Original Assignee
PEOPLE SEARCH NETWORK AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLE SEARCH NETWORK AG filed Critical PEOPLE SEARCH NETWORK AG
Priority to CN2013100845997A priority Critical patent/CN103116660A/en
Publication of CN103116660A publication Critical patent/CN103116660A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for acquiring website authority values. The method comprises the following steps of: acquiring a credible vote number of each site according to the number of in-chains from a credible site set in all in-chains of each site on the internet; filing all of the sites into different preset grades according to the credible vote number of all of the sites on the internet, distributing a file authority value for each grade, and setting the site authority value of the sites in each grade into a grade authority value; and adding the sites of which the site authority values are greater than authority thresholds and which do not belong to the credible site set into the credible site set. By utilizing the method and the device, the influences on the authority value calculation caused by problems of spam and the like are avoided and at the same time the authority of newly on-line sites is correctly reflected.

Description

Acquisition methods and the device of website authority's value
Technical field
The present invention relates to the computer information retrieval field, in particular to acquisition methods and the device of a kind of website authority's value.
Background technology
Search engine is the major way that the netizen obtains data information, authoritative information that the netizen satisfies the demands, true that search engine should offer.The technorati authority of Search Results, confidence level are one of key factors of weighing the search engine quality.
At present, the authoritative evaluation method great majority in webpage or website all adopt the way according to authority's value of the linking relationship iterative computation page of the page, the PageRank algorithm is the important and primary method of evaluating network page authority value, constantly occurs its improved Algorithms for Page Ranking and website sort algorithm afterwards.The core concept of PageRank algorithm is that this page is likely the important page if a page is quoted by many other pages; Although a page is not repeatedly quoted, quoted by an important page, this page is also probably the important page so; The page that it is quoted is divided equally and be delivered to the importance of a page.Therefore, the importance of the page can be measured with PageRank.The target of HITS algorithm is exactly by the most valuable webpage of certain calculating (iterative computation) method to obtain puing question to for certain retrieval, the i.e. the highest Authoritative Web pages of rank.
Development along with network, recommendation between webpage acts on decline, some business websites are connected to each other in order to obtain better rank, cause coming in some Search Results webpage and the user's request and uncorrelated of front, some derivative algorithms also occurred on this basis.But under the mode of operation of anti-cheating strategy " As viocerises one foot, virtues rise toil ", always some spam link or seo link effects to the calculating of page authority value, cause some numerical results inaccurate.
In addition, because the webpage quantity in whole internet is astronomical figure, search engine can't all be included the calculating of page authority value in.Therefore, search engine generally all designs various policy filtering and falls part and be worth little link.Due to the choice of strategy, some do not have the website/webpage of outer chain can be filtered unavoidably, make them can't obtain authority's value, thereby have affected the quality of Search Results.
Following content relates to the concepts such as website, first these concepts is suitably introduced so that understand here:
Website: refer to first '/' front part in URL.For example, news.sina.com.cn, sports.sina.com.cn are considered to two websites.
Main territory: i.e. Main Domain refers to domain name registration people's online title.For example, " jike.com " (sees for details Http:// baike.***.com/view/3444440.htm).
Subdomain: refer to three grades of domain names, level Four domain name even than main territory even lower level.For example, the cq.soufun.com in homebbs.cq.soufun.com.
Take this host of mil.news.sina.com.cn as example, its main territory is sina.com.cn, its one-level subdomain is new.sina.com.cn, if certainly also have xxx.mil.news.sina.com.cn, also there is secondary subdomain mil.news.sina.com.cn in this host so.
Here please refer to Fig. 1, Fig. 1 is the tree structure schematic diagram with main territory and subdomain according to correlation technique, and as shown in Figure 1, box indicating does not comprise website, the frame representative domain of capsule-type, and wherein tree root is main territory.Therefore, for any one node, on the path from this node to root, all territories of removing outside itself and root node are all its subdomains, and the root node main territory that is exactly it.
Credible station: automatically generate or the manual Website Hosting that arranges.Think that site contents authority is credible, the chain that goes out of website has recommendatoryly, almost there is no the chain that goes out that goes out chain or SEO of spam.
Credible votes: the main territory number that only comes from the link of credible station.Wherein, no matter what are connected to from the chain in same territory, all calculate and do once counting.
User-generated content (User Generated Content is referred to as UGC): the form that includes the site format such as forum, blog, share web, microblogging and the pages such as comment, reply.
Tree is transmitted in main territory: can authority's value in expression main territory entail subdomain, the website under it, and can authority's value of subdomain entail the tree structure that it descends website.
The localization website: for example " bj.ganji.com ", just be equivalent to main website for the netizen of Beijing area, can be considered with " WWW.ganji.com " and be equal to.
For some high-quality resource website or channels of newly reaching the standard grade, generally do not have enough super chains to point to them, this just causes new site to be discriminated against, and the authority who calculates is worth on the low side or there is no authoritative value at all.
No matter be PageRank algorithm or HITS algorithm, all ignored the attribute of webpage as a website part.On traditional sense, network is comprised of webpage and link two parts, respectively the content and structure of map network.PageRank algorithm and HITS algorithm are all the research for network structure.In recent years, increasing researcher recognizes that the website is the ingredient of network equally, and is playing the part of therein important role.Compare with single webpage, the website can provide more semantic information.At first, the webpage of same website is having very high similarity usually aspect content, page layout and link; Secondly, from topological viewpoint, compare the web page interlinkage set of different web sites, the replica detection of same website has higher closeness usually.
The website is as the component units of higher abstraction hierarchy in the internet, can represent than webpage more fully information and message structure, and more difficult for spam behavior or the seo behavior meeting of website, so the website ordering techniques has become very important technology in search engine.Common are at present two kinds of website sort algorithms of SiteRank and AggregateRank.SiteRank algorithm and PageRank are similar, just are based on site link figure and adopt the PageRank algorithm to calculate the sequence of domain name.AggregateRank be to one of PageRank approximate, and simplified computation complexity.Therefore, the website as the component units of network structure, is to estimate authoritative indispensable dimension, and the website ordering techniques is very important technology in becoming search engine also.
Yet present website authority's value calculating method is still take PageRank as the basis, and inevitably also introduced some shortcomings of PageRank: authoritative calculating is subject to the impact of the problems such as spam, causes some numerical results inaccurate; Authoritative calculating tended to old website, can't react the authority of the high-quality resource website of newly reaching the standard grade.
Exist authoritative calculating to be subject to the impact of the problems such as spam for website authority's value calculating method in correlation technique, and can't react the authoritative problem of the high-quality resource website of newly reaching the standard grade, not yet propose at present effective solution.
Summary of the invention
The invention provides acquisition methods and the device of a kind of website authority's value, to address the above problem at least.
According to an aspect of the present invention, provide the acquisition methods of a kind of website authority's value, having comprised: entered to come from chain the quantity that enters chain of credible station in gathering according to all of each website on the internet, obtain the credible votes of each website; In different gears all website filings are extremely default according to the credible votes of all websites on the internet, and be that each gear distributes a gear authority value, the authoritative value of the website of website in each gear is set as gear authority value; Website authority value is worth threshold value greater than authority and does not belong to the website of gathering at credible station add in the set of credible station.
Preferably, all of each website enter the quantity that enters chain that comes from chain in the set of credible station on according to the internet, before obtaining the credible votes of each website, comprising: extract the whole network link data and form the raw data that satisfies target URL(uniform resource locator) Dest URL source URL(uniform resource locator) Source URL anchor text Anchor Text form; Obtain the url list page in a plurality of fields from raw data, and the url list page is aggregated into the seed stations set; Reject inferior quality website, search engine optimization SEO website and cheating SPAM website from the seed stations set, obtain the set of credible station.
Preferably, in the website that website authority value is worth threshold value greater than authority and does not belong to the set of credible station adds the set of credible station to after, comprising: obtain the site information of all websites under main territory, determine the authoritative value in territory in main territory according to site information; Main territory heredity tree according to site information and the main territory of authority's value generation, territory; According to main territory heredity tree, territory authority's value, and predetermined authority's value successively decrease authority's value of the subdomain that rule definite main territory comprises and authority's value of the website that subdomain comprises.
Preferably, in the website that website authority value is worth threshold value greater than authority and does not belong to the set of credible station adds the set of credible station to after, comprising: excavate other main territories that have the website relation with current main territory, wherein, the website relation comprises: redirect or station group; Authority's value of determining current main territory in current main territory website and the transfer mode between the website in other main territories.
Preferably, all of each website enter the quantity that enters chain that comes from chain in the set of credible station on according to the internet, before obtaining the credible votes of each website, comprising: excavate other main territories that have the website relation with current main territory, wherein, the website relation comprises: redirect or station group; Authority's value of determining current main territory in current main territory website and the transfer mode between the website in other main territories.
Preferably, site information comprises: the website number under the affiliated subdomain of website number, website under the gear of gear authority value under the credible votes of website, website, the website owner of institute territory.
According to a further aspect in the invention, the deriving means of a kind of website authority's value is provided, comprise: acquisition module, be used for entering chain according to all of each website on the internet and come from the quantity that enters chain in the set of credible station, obtain the credible votes of each website; Processing module is used for different gears all website filings are extremely default according to the credible votes of all websites on the internet, and is that each gear distributes a gear authority value, and the authoritative value of the website of website in each gear is set as gear authority value; Add module, be used for website authority value is worth threshold value greater than authority and does not belong to the website of gathering at credible station adding the set of credible station to.
Preferably, this device also comprises: obtain determination module, be used for obtaining the site information of all websites under main territory, determine territory authority's value in main territory according to site information; Generation module is used for the main territory heredity tree according to site information and the main territory of authority's value generation, territory; The first determination module is used for according to main territory heredity tree, territory authority's value, and predetermined authority's value successively decrease authority's value of the subdomain that rule definite main territory comprises and authority's value of the website that subdomain comprises.
Preferably, this device also comprises: excavate module, be used for excavating other main territories that have the website relation with current main territory, wherein, the website relation comprises: redirect or station group; The second determination module, the transfer mode of authority's value of be used for determining current main territory between the website in the website in current main territory and other main territories.
Preferably, site information comprises: the website number under the affiliated subdomain of website number, website under the gear of gear authority value under the credible votes of website, website, the website owner of institute territory.
by the present invention, adopt and first select a collection of credible station set that comprises the high-quality website, and determine authority's value of these high-quality websites by the ballot of high-quality website in the set of credible station, again with satisfy authority's value and be not comprised in other website in the set of credible station and add the mode of credible station in gathering to yet, solved that in the correlation technique, there is the authoritative impact that is subject to the problems such as spam of calculating in website authority's value calculating method, and can't react the authoritative problem of the high-quality resource website of newly reaching the standard grade, and then reached the impact of avoiding authority's value calculating to be subject to the problems such as spam, the authoritative effect that can correctly reflect the website of newly reaching the standard grade simultaneously.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 is the tree structure schematic diagram with main territory and subdomain according to correlation technique;
Fig. 2 is the acquisition methods process flow diagram according to website authority's value of the embodiment of the present invention;
Fig. 3 is that schematic diagram is transmitted in main according to the preferred embodiment of the invention territory;
Fig. 4 is according to the preferred embodiment of the invention based on the process flow diagram of ballot and website authority's value calculating method of transmitting;
Fig. 5 is the structured flowchart according to the deriving means of website authority's value of the embodiment of the present invention;
Fig. 6 is the structured flowchart of the deriving means of website authority value according to the preferred embodiment of the invention.
Embodiment
Hereinafter also describe in conjunction with the embodiments the present invention in detail with reference to accompanying drawing.Need to prove, in the situation that do not conflict, embodiment and the feature in embodiment in the application can make up mutually.
Fig. 2 is that as shown in Figure 2, the method mainly comprises the following steps (step S202-step S206) according to the acquisition methods process flow diagram of website authority's value of the embodiment of the present invention:
Step S202 enters the quantity that enters chain that comes from chain in the set of credible station according to all of each website on the internet, obtain the credible votes of each website;
Step S204 in different gears all website filings are extremely default according to the credible votes of all websites on the internet, and is that each gear distributes a gear authority value, and the authoritative value of the website of website in each gear is set as gear authority value;
Step S206 is worth website authority value threshold value greater than authority and does not belong to the website of gathering at credible station and adds in the set of credible station.
In the present embodiment, all of each website enter the quantity that enters chain that comes from chain in the set of credible station on according to the internet, before obtaining the credible votes of each website, comprising: extract the whole network link data and form the raw data that satisfies target URL(uniform resource locator) Dest URL source URL(uniform resource locator) Source URL anchor text Anchor Text form; Obtain the url list page in a plurality of fields from raw data, and the url list page is aggregated into the seed stations set; Reject inferior quality website, search engine optimization SEO website and cheating SPAM website from the seed stations set, obtain the set of credible station.
In the present embodiment, in the website that website authority value is worth threshold value greater than authority and does not belong to the set of credible station adds the set of credible station to after, comprising: obtain the site information of all websites under main territory, determine the authoritative value in territory in main territory according to site information; Main territory heredity tree according to site information and the main territory of authority's value generation, territory; According to main territory heredity tree, territory authority's value, and predetermined authority's value successively decrease authority's value of the subdomain that rule definite main territory comprises and authority's value of the website that subdomain comprises.
In the present embodiment, the website that website authority value is worth threshold value greater than authority and does not belong to the set of credible station add to credible station gather in after, comprise: excavate other main territories that have the website relation with current main territory, wherein, the website relation comprises: redirect or station group; Authority's value of determining current main territory in current main territory website and the transfer mode between the website in other main territories.
In the present embodiment, all of each website enter the quantity that enters chain that comes from chain in the set of credible station on according to the internet, before obtaining the credible votes of each website, comprise: excavate other main territories that have the website relation with current main territory, wherein, the website relation comprises: redirect or station group; Authority's value of determining current main territory in current main territory website and the transfer mode between the website in other main territories.
In the present embodiment, site information comprises: the website number under the affiliated subdomain of website number, website under the gear of gear authority value under the credible votes of website, website, the website owner of institute territory.
The below suitably describes above-described embodiment, and for example, the acquisition methods of website authority's value that the present embodiment provides can adopt following step to realize:
(1), extract the whole network link data, the raw data of formation destURL sourceURLAnchor form;
(2), excavate the hub page of every field on the internet, gather an initial list of websites, i.e. seed stations set;
(3), the inferior quality website in the set of cleaning seed stations and easily seo or the easily website of spam.Form the set of credible station;
(4), add up the chain that enters that entering in chain of each website comes from credible station in (3), calculate the credible votes of each website;
(5), according to the result of calculation of (4), the threshold value of design stepping is divided into several gears to website, each gear is given authority's value.Further, those have enough credible station votes but the original not website in trusting the station, also can be added into and trust in the station;
(6), all site information of website in the statistics territory, gather authority's value in main territory;
(7), according to the intermediate result of (6), further calculate the feature between main territory, subdomain, website, generates the heredity of main territory and sets;
(8), according to the main territory heredity tree of (7), judge how the authoritative value in main territory entails the website in the territory;
(9), excavate website relation between different main territories (comprise redirect, group etc. stand), judgement authority is worth the transitive relation of website between difference main territory;
(10), according to the result of calculation of (5), (8), (9), comprehensively provide authority's value of each website.
In actual applications, the set of credible station can be referred to as " expert website " and judge the base-level of website by the outer chain that comes from " expert website " and the hereditary information that comes from main territory, then adjust by some the authoritative rank marking that strategy finally provides a website.
The acquisition methods of the website authority's value that above-described embodiment is provided below in conjunction with Fig. 3, Fig. 4 and preferred embodiment is further described in more detail.
Please also refer to Fig. 3, Fig. 4, Fig. 3 is that schematic diagram is transmitted in main according to the preferred embodiment of the invention territory, Fig. 4 is according to the preferred embodiment of the invention based on the process flow diagram of ballot and website authority's value calculating method of transmitting, as shown in Figure 4, this flow process mainly comprises the following steps (step S402-step S422):
Step S402, link data extracts;
Step S404 excavates seed stations;
Step S406, cleaning low side station;
Step S408, " the credible votes " of adding up a website; It comprises following substep:
Step S4082: weight is composed in ballot; " credible station " voting right distinguishes according to the authority of himself, for example is divided into 0-4 totally 5 kinds of weights, and initial weight is 1, and follow-up weight is determined by iterative computation;
Step S4084: the ballot number of times of restriction " credible station ", avoid the impact of SPAM; For example each " credible station " can only throw a ticket; Namely regardless of the outer chain from " expert website " master territory, place, what are arranged individual, only can think single ballot;
Step S4086: non-" credible station " can not vote; Namely the outer chain from non-" credible station " is not counted in votes, and simultaneously the outer chain of UGC website is not counted in votes, also is not counted in votes from the outer chain of the UGC page at " credible station ".
Step S410, the technorati authority rank is namely calculated in the grading of credible votes, according to " credible votes " size of website, the technorati authority of website is divided into several ranks, for example 0-4 totally 5 ranks; It comprises following substep:
Step S4102: new site technorati authority stepping threshold value more; Initial threshold is set according to artificial experience, and follow-up threshold value is according to iterative computation; According to the change words situation of " the credible votes " of last Critical Grading point (threshold value) annex website, recomputate the threshold value of classification;
Step S4104: website authority value classification; According to " the credible votes " of website and the relation of classification thresholds, the authority of website stepping to correspondence is worth in gear; If before certain website in " credible station " list, normal classification, otherwise its technorati authority rank is that normal classification cuts 1 minute, when next round is calculated, then adjusts to normal level according to new " credible votes ";
Step S4106: withdraw from the arena in insincere station; For the website in " credible station " set, if its " credible votes " less than certain threshold value, for example 10, he is rejected from " credible station " set;
Step S412: site information statistics in the territory, the technorati authority of adding up main territory; This step gathers all websites under main territory, selects the technorati authority rank that optimum station is used for this main territory of expression; Comprise following substep:
Step S4122: select under main territory " from the ballot sum of expert website " the highest website and represent this main territory, its ballot sum also is used for expression to the votes in this main territory;
Step S4142: the maximal value of selecting all website technorati authority under main territory represents the technorati authority rank in this main territory;
Step S414 generates main territory and transmits tree; Whether the authority level that this step judges main territory the available website that entails under it, and whether the technorati authority that further judges website the available substation point that entails under it, and so forth, forms one " main territory heredity tree "; Comprise following substep:
Step S4142: statistics is the station very; Use the data such as DNS, dead chain, filter out the website that to access under main territory;
Step S4144: judgement transitivity; Along each nonleaf node of " tree is transmitted in main territory ", between the nodes of use " credible votes " and its next stage, the size of ratio, judge whether the authority of this node can entail its each child nodes; Threshold value relatively sets manually;
Step S4146: excavate the channel station; For those the platform stations that can register, the judged result of " tree is transmitted in main territory " is that authority can not pass to its child nodes.But in fact there are some child nodes to safeguard, namely can inherit by the head of a station.The present invention excavates by modes such as chains in excavation site maps, homepage the channel station that can inherit technorati authority;
Step S416: the transitivity grading, namely judge mode of inheritance; The mode that this step is transmitted by statistics Type of website judgement decay one by one heredity and unattenuated heredity; It comprises following steps:
Step S4162: excavate " localization " website; If more than the localized website number in territory accounts for the certain proportion that in the territory, master station counts, for example 0.25.Think that this territory is the territory of " localization ";
Step S4164: if a website is " localization " website, and the territory is " localization " territory, and authority's value in territory can be directly passed to this localization website; Otherwise authority's value of website will be given a discount on authority's value in territory, for example minus fifteen;
Step S418, cleaning low side station;
Step S420, between standing, relation is adjusted a wage scale, the adjustment that namely between the station, relation is transmitted authority's value; It comprises following substep:
Step S4202: if website sitel jumps to website site2, and the authority of sitel value is higher than site2, and site2 can obtain authority's value of sitel;
Step S4204: if website siteA, siteB ..., siteN is the station group, and the authority of known siteM value, other websites in the group of station can be given a discount on the basis of siteM authority value and be obtained authority value, for example a minus fifteen;
Step S422 according to the result of calculation of step S410, step S416, step S420, comprehensively provides authority's value of each website.For example go the maximal value of above-mentioned three steps authority value.
The acquisition methods of website authority's value that employing above-described embodiment provides, can improve the accuracy rate of search engine, can improve the sequence quality of Search Results simultaneously, reduce the quantity of inauthoritativeness website in Search Results, make when Search Results is offered the user, the user obtains better experience.
Fig. 5 is the structured flowchart according to the deriving means of website authority's value of the embodiment of the present invention, this device is in order to the acquisition methods of website authority's value of realizing above-described embodiment and providing, as shown in Figure 5, this device mainly comprises: acquisition module 10, processing module 20 and interpolation module 30.Wherein, acquisition module 10 is used for entering chain according to all of each website on the internet and comes from the quantity that enters chain in the set of credible station, obtains the credible votes of each website; Processing module 20 is used for different gears all website filings are extremely default according to the credible votes of all websites on the internet, and is that each gear distributes a gear authority value, and the authoritative value of the website of website in each gear is set as gear authority value; Add module 30, be used for website authority value is worth threshold value greater than authority and does not belong to the website of gathering at credible station adding the set of credible station to.
Fig. 6 is the structured flowchart of the deriving means of website authority value according to the preferred embodiment of the invention, as shown in Figure 6, this device also comprises: obtain determination module 40, be used for obtaining the site information of all websites under main territory, determine territory authority's value in main territory according to site information; Generation module 50 is used for the main territory heredity tree according to site information and the main territory of authority's value generation, territory; The first determination module 60 is used for according to main territory heredity tree, territory authority's value, and predetermined authority's value successively decrease authority's value of the subdomain that rule definite main territory comprises and authority's value of the website that subdomain comprises.
In the preferred embodiment, this device also comprises: excavate module 70, be used for excavating other main territories that have the website relation with current main territory, wherein, the website relation comprises: redirect or station group; The second determination module 80, the transfer mode of authority's value of be used for determining current main territory between the website in the website in current main territory and other main territories.
In the preferred embodiment, site information comprises: the website number under the affiliated subdomain of website number, website under the gear of gear authority value under the credible votes of website, website, the website owner of institute territory.
Certainly, the deriving means of website that this preferred embodiment provides authority's value has only represented a better formation structure, need not be confined in actual applications this fully, gets final product so long as can implement the acquisition methods of the authoritative value in website that above-described embodiment provides.For example, also can realize by the device that is consisted of by following each functional module fully:
(1) pretreatment module: the outer chain data of this resume module the whole network form the " raw data of destURL sourceURLAnchor' ' form.
(2) chain feature statistical module: this module is added up the chain that enters that comes from credible station in the outer chain of each website, calculates credible station votes.
(3) link grading module: this module is set the threshold value of authority's value stepping automatically, and the authority to each website is worth grading, is divided into different gears according to the statistics of chain feature statistical module.Be responsible for improving simultaneously the recall rate at credible station.
(4) low side station cleaning module: this module is used for clearing up the inferior quality station at credible station and the website of easy seo, easy spam.Improve the purity at credible station, guarantee the technorati authority of credible station ballot.
(5) module is transmitted in main territory: this module is used for converging main territory, other authority's value of subdomain level, and then determines how authority's value passes to other websites in the territory.
(6) transmit module between standing: this module is used for the transitive relation of judgement authority's value before website.Comprise the redirect transmission, the submodule such as group's transmissions etc. of standing, the redirect relation between treatment station and the group relation of standing respectively.
(7) composite rating module: this module gathers the output result of above-mentioned modules, provides final authority's value of website.
The deriving means of website authority's value that employing above-described embodiment provides, can improve the accuracy rate of search engine, can improve the sequence quality of Search Results simultaneously, reduce the quantity of inauthoritativeness website in Search Results, make when Search Results is offered the user, the user obtains better experience.
from above description, can find out, the present invention has realized following technique effect: adopt and first select a collection of credible station set that comprises the high-quality website, and determine authority's value of these high-quality websites by the ballot of high-quality website in the set of credible station, again with satisfy authority's value and be not comprised in other website in the set of credible station and add the mode of credible station in gathering to yet, solved that in the correlation technique, there is the authoritative impact that is subject to the problems such as spam of calculating in website authority's value calculating method, and can't react the authoritative problem of the high-quality resource website of newly reaching the standard grade, and then reached the impact of avoiding authority's value calculating to be subject to the problems such as spam, the authoritative effect that can correctly reflect the website of newly reaching the standard grade.
obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in memory storage and be carried out by calculation element, and in some cases, can carry out step shown or that describe with the order that is different from herein, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step being made into the single integrated circuit module realizes.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is only the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. the acquisition methods of website authority's value, is characterized in that, comprising:
Enter the quantity that enters chain that comes from chain in the set of credible station according to all of each website on the internet, obtain the credible votes of described each website;
According to the credible votes of all websites on the internet, all websites are filed to default different gears, and be that each described gear distributes a gear authority value, the authoritative value of the website of website in each described gear is set as described gear authority value;
Website authority value is worth threshold value greater than authority and does not belong to the website of gathering at described credible station and add in described credible station set.
2. method according to claim 1, is characterized in that, all of each website enter the quantity that enters chain that comes from chain in the set of credible station on according to the internet, before obtaining the credible votes of described each website, comprising:
Extract the whole network link data and form the raw data that satisfies target URL(uniform resource locator) Dest URL source URL(uniform resource locator) Source URL anchor text Anchor Text form;
Obtain the url list page in a plurality of fields from described raw data, and described url list page is aggregated into the seed stations set;
Reject inferior quality website, search engine optimization SEO website and cheating SPAM website from described seed stations set, obtain described credible station set.
3. method according to claim 1 and 2, is characterized in that, the website that website authority value is worth threshold value greater than authority and does not belong to described credible station set add to described credible station gather in after, comprising:
Obtain the site information of all websites under main territory, determine territory authority's value in described main territory according to described site information;
Main territory heredity tree according to described site information and the described main territory of authority's value generation, described territory;
According to described main territory heredity tree, described territory authority's value, and predetermined authority's value successively decrease authority's value of the subdomain that rule definite described main territory comprises and authority's value of the website that described subdomain comprises.
4. method according to claim 1 and 2, is characterized in that, the website that website authority value is worth threshold value greater than authority and does not belong to described credible station set add to described credible station gather in after, comprising:
Excavate other main territories that have the website relation with current main territory, wherein, described website relation comprises: redirect or station group;
Authority's value of determining described current main territory in described current main territory website and the transfer mode between the website in described other main territories.
5. method according to claim 3, is characterized in that, all of each website enter the quantity that enters chain that comes from chain in the set of credible station on according to the internet, before obtaining the credible votes of described each website, comprising:
Excavate other main territories that have the website relation with current main territory, wherein, described website relation comprises: redirect or station group;
Authority's value of determining described current main territory in described current main territory website and the transfer mode between the website in described other main territories.
6. method according to claim 4, is characterized in that, described site information comprises: the website number under the affiliated subdomain of website number, website under the gear of gear authority value under the credible votes of website, website, the website owner of institute territory.
7. the deriving means of website authority's value, is characterized in that, comprising:
Acquisition module is used for entering chain according to all of each website on the internet and comes from the quantity that enters chain in the set of credible station, obtains the credible votes of described each website;
Processing module, be used for different gears all website filings are extremely default according to the credible votes of all websites on the internet, and be that each described gear distributes a gear authority value, the authoritative value of the website of website in each described gear is set as described gear authority value;
Add module, be used for website authority value being worth threshold value greater than authority and not belonging to the website of gathering at described credible station adding described credible station set to.
8. device according to claim 7, is characterized in that, described device also comprises:
Obtain determination module, be used for obtaining the site information of all websites under main territory, determine territory authority's value in described main territory according to described site information;
Generation module is used for the main territory heredity tree according to described site information and the described main territory of authority's value generation, described territory;
The first determination module is used for according to described main territory heredity tree, described territory authority's value, and predetermined authority's value successively decrease authority's value of the subdomain that rule definite described main territory comprises and authority's value of the website that described subdomain comprises.
9. according to claim 7 or 8 described devices, is characterized in that, described device also comprises:
Excavate module, be used for excavating other main territories that have the website relation with current main territory, wherein, described website relation comprises: redirect or station group;
The second determination module, the transfer mode of authority's value of be used for determining described current main territory between the website in the website in described current main territory and described other main territories.
10. device according to claim 9, is characterized in that, described site information comprises: the website number under the affiliated subdomain of website number, website under the gear of gear authority value under the credible votes of website, website, the website owner of institute territory.
CN2013100845997A 2013-03-15 2013-03-15 Method and device for acquiring website authority values Pending CN103116660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100845997A CN103116660A (en) 2013-03-15 2013-03-15 Method and device for acquiring website authority values

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100845997A CN103116660A (en) 2013-03-15 2013-03-15 Method and device for acquiring website authority values

Publications (1)

Publication Number Publication Date
CN103116660A true CN103116660A (en) 2013-05-22

Family

ID=48415033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100845997A Pending CN103116660A (en) 2013-03-15 2013-03-15 Method and device for acquiring website authority values

Country Status (1)

Country Link
CN (1) CN103116660A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287444A (en) * 2019-07-02 2019-09-27 郑州悉知信息科技股份有限公司 Website detection method, device and storage medium
CN111523049A (en) * 2020-04-15 2020-08-11 苏州跃盟信息科技有限公司 Method and device for determining authority value of object, storage medium and processor
CN111966946A (en) * 2020-09-10 2020-11-20 北京百度网讯科技有限公司 Method, device, equipment and storage medium for identifying authority value of page
CN113360798A (en) * 2021-06-02 2021-09-07 北京百度网讯科技有限公司 Flooding data identification method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site
JP2010108363A (en) * 2008-10-31 2010-05-13 Yahoo Japan Corp Retrieval processor, retrieval processing method and program which perform seed selection of crawler for specialty retrieval by utilizing click log
CN102541949A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
US20120246134A1 (en) * 2011-03-22 2012-09-27 Brightedge Technologies, Inc. Detection and analysis of backlink activity
CN102915369A (en) * 2012-11-01 2013-02-06 吉林大学 Method for ranking web pages on basis of hyperlink source analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site
JP2010108363A (en) * 2008-10-31 2010-05-13 Yahoo Japan Corp Retrieval processor, retrieval processing method and program which perform seed selection of crawler for specialty retrieval by utilizing click log
CN102541949A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
US20120246134A1 (en) * 2011-03-22 2012-09-27 Brightedge Technologies, Inc. Detection and analysis of backlink activity
CN102915369A (en) * 2012-11-01 2013-02-06 吉林大学 Method for ranking web pages on basis of hyperlink source analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吕林涛等: "面向垂直搜索引擎的主题提取算法", 《计算机工程》 *
李绍华等: "基于层次分类的页面排序算法", 《计算机工程》 *
李绍华等: "搜索引擎页面排序算法研究综述", 《计算机应用研究》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287444A (en) * 2019-07-02 2019-09-27 郑州悉知信息科技股份有限公司 Website detection method, device and storage medium
CN110287444B (en) * 2019-07-02 2021-06-25 郑州悉知信息科技股份有限公司 Website detection method and device and storage medium
CN111523049A (en) * 2020-04-15 2020-08-11 苏州跃盟信息科技有限公司 Method and device for determining authority value of object, storage medium and processor
CN111523049B (en) * 2020-04-15 2023-06-13 苏州跃盟信息科技有限公司 Method, device, storage medium and processor for determining authority value of object
CN111966946A (en) * 2020-09-10 2020-11-20 北京百度网讯科技有限公司 Method, device, equipment and storage medium for identifying authority value of page
CN113360798A (en) * 2021-06-02 2021-09-07 北京百度网讯科技有限公司 Flooding data identification method, device, equipment and medium
CN113360798B (en) * 2021-06-02 2024-02-27 北京百度网讯科技有限公司 Method, device, equipment and medium for identifying flooding data

Similar Documents

Publication Publication Date Title
CN103049440B (en) A kind of recommendation process method of related article and disposal system
CN102737050B (en) Keyword dynamic regulating method and system applied in search engine optimization
CN101957847B (en) Searching system and implementation method thereof
CN103974097B (en) Personalized user original video forecasting method based on popularity and social networkies and system
CN105868291A (en) Website address recommendation method, apparatus and system
CN101246502B (en) Method and system for searching pictures in network
CN105045931A (en) Video recommendation method and system based on Web mining
KR20140071417A (en) Mobile advertising using data networks based on intelligence data associated with internet-connectable devices derived using graph models
CN101324948A (en) Method and apparatus of recommending information
CN103259805A (en) Domain name access control method and system based on user evaluation
CN103365902A (en) Method and device for evaluating Internet News
CN106446189A (en) Message-recommending method and system
CN103020066A (en) Method and device for recognizing search demand
CN103116660A (en) Method and device for acquiring website authority values
CN103324645A (en) Method and device for recommending webpage
CN103544150B (en) For browser of mobile terminal provides the method and system of recommendation information
CN103780625B (en) User interest finds method and apparatus
CN105550275A (en) Microblog forwarding quantity prediction method
CN102171713A (en) System and method for sharing profits with one or more content providers
CN102254018A (en) Method and system for generating navigation website based on Internet use behaviour analysis system
CN101493818A (en) Network information searching method based on human relation network
CN101526951B (en) Search method and system
CN104123321B (en) A kind of determining method and device for recommending picture
CN104820712A (en) Method for providing individual book recommendation for mobile user
EP2680209A1 (en) Device and method for automatic generation of a recommendation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130522