CN108255866A - Check the method and apparatus linked in website - Google Patents

Check the method and apparatus linked in website Download PDF

Info

Publication number
CN108255866A
CN108255866A CN201611248655.6A CN201611248655A CN108255866A CN 108255866 A CN108255866 A CN 108255866A CN 201611248655 A CN201611248655 A CN 201611248655A CN 108255866 A CN108255866 A CN 108255866A
Authority
CN
China
Prior art keywords
page
data set
text
linked object
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611248655.6A
Other languages
Chinese (zh)
Other versions
CN108255866B (en
Inventor
潘峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611248655.6A priority Critical patent/CN108255866B/en
Publication of CN108255866A publication Critical patent/CN108255866A/en
Application granted granted Critical
Publication of CN108255866B publication Critical patent/CN108255866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method and apparatus for checking and being linked in website.Wherein, this method includes:The linked object in the first page and first page of website to be checked is obtained, wherein, linked object is used to jump to second page;Obtain the second data set that the first data set that linked object is included is included with second page;The data element included in first data set with the data element included in the second data set is compared, obtains comparison result;Determine whether linked object is false links according to comparison result.The present invention solves the prior art and checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high using manual type.

Description

Check the method and apparatus linked in website
Technical field
The present invention relates to website testing field, in particular to a kind of method and apparatus for checking and being linked in website.
Background technology
With the development of Internet technology, website has become the main tool for obtaining information from internet for people, thus, The quality of Website quality is directly related to the experience of user.In Website quality detection, an important index is website memory False links quantity, which can directly affect experience of the user in website.Assuming that user opens a certain website, see To a webpage as shown in Fig. 1 (a), a row links header is shown on the left of the page, user can be by clicking the row In any one links header, jump to a webpage corresponding with the links header.In actual mechanical process, if user's point One is hit in the row entitled " to build dream, chase after life, interpret a dream:After the links header of the travel of No. two successful connection illusions of Heavenly Palace ", redirect It is news content that the various regions masses celebrate the Mid-autumn Festival that webpage, which is the introduction as shown in Fig. 1 (b), in this case, Yong Huhui Think that linking for this website has the behavior that deception is clicked, under serious situation, also result in the customer loss of the website.By This, the inspection for false links existing in website, it appears particularly significant.
At present, for the inspection of false links existing in website, the prior art depends on artificial progress, passes through people Each links header in work webpage clicking to check whether each links header is consistent with actually opened content of pages, is come Judge whether the links header is false links.The drawback is that hand inspection has significant limitation, current website is usual Comprising many webpages, need to expend great cost of labor, inefficiency;In addition, hand inspection compares subjective rely on of people Seriously, the interference of various factors may all influence judging result, and accuracy is not high.
Check that false links present in website cause that efficiency is low, accuracy using manual type for the above-mentioned prior art The problem of not high, currently no effective solution has been proposed.
Invention content
An embodiment of the present invention provides a kind of method and apparatus for checking and being linked in website, are adopted at least solving the prior art Manually mode checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high.
One side according to embodiments of the present invention provides a kind of method for checking and being linked in website, including:Acquisition is treated Check the linked object in the first page and first page of website, wherein, linked object is used to jump to second page;Extraction The second data set that the first data set that linked object is included is included with second page;The number that will be included in first data set It is compared according to element with the data element included in the second data set, obtains comparison result;It determines to link according to comparison result Whether object is false links.
Further, the data element included in the data element included in the first data set and the second data set is carried out It compares, obtains comparison result, including:Search the first data set data element identical with the second data set;Count identical The quantity of data element;Calculate the ratio of number data elements included in the quantity of identical data element and the first data set Value.
Further, determine whether linked object is false links according to comparison result, including:If ratio is more than or equal to Predetermined threshold value, it is determined that linked object is normal link;If ratio is less than predetermined threshold value, it is determined that linked object is wrong chain It connects.
Further, the linked object in the first page and first page of website to be checked is obtained, including:Pass through reptile Mode website to be checked is crawled, obtain the linked object in the first page and first page of website to be checked.
Further, the second data set that the first data set that linked object is included is included with second page is obtained, Including:The second text-string that the first text-string and second page that extraction linked object is included are included;By One text-string and the second text-string carry out word segmentation processing, obtain third data set and the 4th data set;According to default Algorithm model extracts first object data element in third data set and is put into the first data set, and extract in the 4th data set Second target data element is put into the second data set.
Further, the second text included in the first text-string for being included of extraction linked object and second page Before this character string, method further includes:Based on text density extraction algorithm, the content of pages of second page is extracted, the step packet It includes:Obtain the document tree of second page;The text character in each label node in document tree is extracted, and each in statistic document tree The text character number of a label node;The text density of each label node is calculated, wherein, text density is each label node Text character number account for document tree total text character number ratio;Extract the text of the label node of text character density maximum Content, the content of pages as second page.
Another aspect according to embodiments of the present invention additionally provides a kind of device for checking and being linked in website, including:It obtains Module, for obtaining the linked object in the first page of website to be checked and first page, wherein, linked object is used to redirect To second page;Extraction module, for extract that the first data set that linked object included is included with second page second Data set;Comparing module, for the data element that will be included in the data element included in the first data set and the second data set It is compared, obtains comparison result;Determining module, for determining whether linked object is false links according to comparison result.
Further, comparing module includes:Searching module is identical with the second data set for searching the first data set Data element;Statistical module, for counting the quantity of identical data element;First computing module, for calculating identical number According to the ratio of number data elements included in the quantity of element and the first data set.
Further, the first determining module includes:Second determining module, if being more than or equal to predetermined threshold value for ratio, Then determine linked object for normal link;Third determining module, if being less than predetermined threshold value for ratio, it is determined that linked object For false links.
Further, the first acquisition module includes:Third acquisition module, for by way of reptile to website to be checked It is crawled, obtains the linked object in the first page and first page of website to be checked.
Further, the second acquisition module includes:First extraction module, for extracting the first text that linked object is included The second text-string that this character string and second page are included;First processing module, for by the first text-string and Second text-string carries out word segmentation processing, obtains third data set and the 4th data set;Second extraction module, for according to pre- If algorithm model, extract first object data element in third data set and be put into the first data set, and extract in the 4th data set The second target data element be put into the second data set.
Further, device further includes:Third extraction module for being based on text density extraction algorithm, extracts second page The content of pages in face;Wherein, third extraction module includes:4th acquisition module, for obtaining the document tree of second page;Second Processing module, for extracting the text character in document tree in each label node, and each label node in statistic document tree Text character number;Second computing module, for calculating the text density of each label node, wherein, text density is each The text character number of label node accounts for the ratio of total text character number of document tree;4th extraction module, for extracting text word Accord with the content of text of the label node of density maximum, the content of pages as second page.
In embodiments of the present invention, the linked object in the first page and first page by obtaining website to be checked, Wherein, linked object is used to jump to second page;The first data set that extraction linked object is included is wrapped with second page The second data set contained;The data element included in first data set is compared with the data element included in the second data set It is right, obtain comparison result;It determines whether linked object is false links according to comparison result, has reached by comparing net to be checked Stand all pages with its belonging to links header described in things whether unanimously check present in website wrong chain The purpose connect it is achieved thereby that improving the efficiency of chaining check and the technique effect of accuracy in website, and then solves existing Technology checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high using manual type.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 (a) is a kind of Website page schematic diagram according to prior art;
Fig. 1 (b) is a kind of Website page schematic diagram according to prior art;
Fig. 2 is the method flow diagram linked in a kind of inspection website according to embodiments of the present invention;
Fig. 3 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 4 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 5 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 6 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 7 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;And
Fig. 8 is the schematic device linked in a kind of inspection website according to embodiments of the present invention.
Specific embodiment
In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
First, the part noun or term occurred during the embodiment of the present application is described is suitable for following solution It releases:
Cable chain:Full name is false links, and the actual pages content for referring mainly to website internal chaining title direction links mark with this Situation about not being consistent is inscribed, cable chain is linked as title.Cable chain is different from chain rupture in the embodiment of the present application, and chain rupture refers to not visit The link interrupted when the link or access asked, and the chain that cable chain is inconsistent with page-describing content that it is directed toward for links header It connects.
Embodiment 1
According to embodiments of the present invention, a kind of embodiment of the method for checking and being linked in website is provided, it should be noted that The step of flow of attached drawing illustrates can perform in the computer system of such as a group of computer-executable instructions, also, It, in some cases, can be to be different from shown in sequence herein performs although showing logical order in flow charts The step of going out or describing.
Fig. 2 is the method flow diagram linked in a kind of inspection website according to embodiments of the present invention, as shown in Fig. 2, the party Method includes the following steps:
Step S202 obtains the linked object in the first page and first page of website to be checked, wherein, linked object For jumping to second page.
Specifically, in above-mentioned steps, first page can be by any one in multiple pages that website to be checked includes A or multiple pages;Above-mentioned linked object can be the passage for having embedded chained address or one shown in first page Picture clicks the word or picture, can jump to other webpages of website to be checked, i.e. second page.To existing in website False links checked, need to obtain the institute included on all pages and the page that website to be detected is included first There is linked object.
In a kind of optional embodiment, by taking the webpage shown in Fig. 1 (a) and Fig. 1 (b) as an example, the webpage shown in Fig. 1 (a) can Think above-mentioned first page, be links header respectively comprising 6 linked objects in the page:" vacation in mid-autumn next day china railway It is expected that send passenger 7,800,000 ", " preheating follow premier go to America:The trip of three states, those watching focuses are worthy of expecting ", " build dream, chase after dream, It interprets a dream:The travel of No. two successful connection illusions of Heavenly Palace ", " it is same to enjoy together a wheel moon at " China's Space station future can in orbit more than ten years " Read family's national conditions --- the various regions masses celebrate the Mid-autumn Festival ", " allow urban and rural environment more clean and tidy beautiful ", each links header all points to one A page, i.e., above-mentioned second page.User can be entered corresponding by clicking any one links header in first page Second page, the links header of " enjoy together a wheel moon with reading family's national conditions --- the various regions masses celebrate the Mid-autumn Festival " for example, click, into The second page entered is the page shown in Fig. 1 (b).
Herein it should be noted that a molding website, generally comprises multiple webpages, would generally be included on each webpage One or more linked objects, for being directed toward a hyperlink target, which can be a page or identical Other positions on the page.During existing false links check in website, need to check the institute that website is included Have whether the linked object on the page is false links.
Step S204 obtains the second data set that the first data set that linked object is included is included with second page.
Specifically, in above-mentioned steps, by linked object for for passage, above-mentioned first data set can be from the The entity set extracted in some links header in one page can be one or several words, can represent the links header Meaning;Above-mentioned second data set can be the entity set that is extracted from the second page that the links header is directed toward or One or several words can be used for characterizing the information that second page is included.
In a kind of optional embodiment, still by taking the webpage shown in Fig. 1 (a) and Fig. 1 (b) as an example, from the link of first page Title " builds dream, chases after dream, interprets a dream:Entities such as " Heavenly Palaces two " can be extracted in the travel of No. two successful connection illusions of Heavenly Palace ", then The first data set that the links header is included is " Heavenly Palace two ";From the links header of first page " a wheel moon is enjoyed together with thought Family national conditions --- the various regions masses celebrate the Mid-autumn Festival " in can extract the entities such as " moon ", " Mid-autumn Festival ", then the links header is wrapped The first data set contained is " moon, the Mid-autumn Festival ";Can be extracted in content of pages from the second page shown in Fig. 1 (b) " moon ", The entities such as " Mid-autumn Festival ", " university ", " school badge moon cake ", then the second data set that second page is included are " moon, the Mid-autumn Festival, big It learns, school badge moon cake ... ".
Herein it should be noted that since linked object can be word or picture, thus, the first data set and the second number According to concentrate the entity type that includes can be not limited to word, can also the other forms such as picture.
Step S206 carries out the data element included in the data element included in the first data set and the second data set It compares, obtains comparison result.
Specifically, in above-mentioned steps, data element can include in above-mentioned first data set and the second data set Entity;The second page that the first data set that some linked object is included in first page is extracted is directed toward with the linked object After the second data set that face is included, the data element that will be included in the data element included in the first data set and the second data set Element is compared, and obtains corresponding comparison result.
Herein it should be noted that the quantity of the principal entities extracted from the page generally will be far more than from links header The quantity of the principal entities of middle extraction, thus, in a kind of optional embodiment, above-mentioned comparison result can be in the second data set Whether in first data set whole data element is contained.
Step S208 determines whether linked object is false links according to comparison result.
Specifically, in above-mentioned steps, by the data element included in the first data set with being included in the second data set Data element be compared after, in a kind of optional embodiment, can by check from second page extract second number The links header is usually determined according to whether concentration contains data element whole in the first data set extracted from links header Whether it is false links.If the first number extracted from links header is contained in the second data set extracted from second page According to the data element for concentrating whole, that is, extracted in the entity sets extracted from the content of pages of second page comprising links header Entity, it is determined that the links header is normal link;If do not include in the second data set extracted from second page from chain Connect the data element in the first data set extracted in title, that is, in the entity sets extracted from the content of pages of second page Entity not comprising links header extraction, it is determined that the links header is false links.
In a kind of optional embodiment, still by taking Fig. 1 (a) and webpage shown in Fig. 1 (b) as an example, from the shown in Fig. 1 (b) The second data set that two pages are included is " moon, the Mid-autumn Festival, university, school badge moon cake ... ", which contains from first The first data extracted in the links header " enjoy together a wheel moon with reading family's national conditions --- the various regions masses celebrate the Mid-autumn Festival " of the page Collect " moon, the Mid-autumn Festival ";The links header that the data set does not include first page " builds dream, chases after dream, interprets a dream:Heavenly Palace two successfully connects The first data set " Heavenly Palace two " extracted in the travel of continuous illusion ";Therefore, if links header " enjoys together a wheel moon and reads house together National conditions --- the various regions masses celebrate the Mid-autumn Festival " be directed toward Fig. 1 (b) shown in second page, then the links header be directed toward page Face describes same things, it may be said that the bright links header is normal link;If links header " builds dream, chases after dream, circle Dream:The second page shown in Fig. 1 (b) that the travel of No. two successful connection illusions of Heavenly Palace " is directed toward, then the links header be directed toward with it Page-describing is not same things, and it is false links to illustrate the links header.
From the foregoing, it will be observed that in the above embodiments of the present application, by obtaining all content of pages in website to be checked and its institute The links header of category, and entity extraction is carried out to the links header belonging to content of pages and its, obtaining content of pages and its institute After the principal entities object of the links header of category, compare content of pages with its belonging to links header in principal entities object, According to comparison result determine the links header whether false links, it should be noted that due to the master extracted from content of pages It wants the principal entities quantity extracted in the generally extra links header of physical quantities, in a kind of optional embodiment, can check page Whether the entity of its affiliated links header extraction is contained in the entity sets of face contents extraction to determine the chain belonging to the page Connect whether title is false links.By scheme disclosed in above-described embodiment, reach all by comparing website to be checked The page with its belonging to links header described in things whether unanimously check the purpose of false links present in website, It is achieved thereby that improving the efficiency of chaining check and the technique effect of accuracy in website, and then solves the prior art using people Work mode checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high.
In a kind of optional embodiment, as shown in figure 3, by the data element included in the first data set and the second data The data element included is concentrated to be compared, comparison result is obtained, may include steps of:
Step S302 searches the first data set data element identical with the second data set;
Step S304 counts the quantity of identical data element;
Step S306 calculates the ratio of number data elements included in the quantity of identical data element and the first data set Value.
Specifically, in the above-described embodiments, the first data set is extracted from some links header in first page, and After the second data set being extracted in the second page being directed toward from the links header, the data element that can will be included in the first data set Element is compared with the data element included in the second data set, finds out the first data set data identical with the second data set Element, that is, the identical entity that links header is included with the second page that it is directed toward, and the quantity of identical entity is counted, one In kind of optional embodiment, can calculate identical entity quantity and the first data set in the ratio of physical quantities that includes.
In a kind of optional embodiment, as shown in figure 4, determining whether linked object is wrong chain according to comparison result It connects, including:
Step S402, if ratio is more than or equal to predetermined threshold value, it is determined that linked object is normal link.
Step S404, if ratio is less than predetermined threshold value, it is determined that linked object is false links.
Specifically, in the above-described embodiments, the reality included in the quantity that identical entity is calculated and the first data set After the ratio of body quantity, the size of the ratio and predetermined threshold value is judged to determine whether linked object is false links;It if should Ratio is more than or equal to predetermined threshold value, it is determined that the links header is consistent with the described things of the page that it is directed toward, the link mark Entitled normal link;If the ratio is less than predetermined threshold value, it is determined that the links header and the described thing of the page that it is directed toward Object is inconsistent, which is false links.
In a kind of optional embodiment, above-mentioned predetermined threshold value can be more than 0.5~1 in any one value.
By above-described embodiment, it can realize that machine judges whether links header and the webpage that it is directed toward are consistent, avoid The subjective factor of artificial judgment influences, and makes Rule of judgment more standardized.
In a kind of optional embodiment, as shown in figure 5, obtaining the first page and of website to be checked in step S202 Linked object in one page can include:
Step S502 crawls website to be checked by way of reptile, obtains the first page of website to be checked And the linked object in first page.
Specifically, in the above-described embodiments, all pages of website to be checked and all pages are obtained using means are crawled Linked object on face.
It should be noted that the source title for marking each page, i.e. chain belonging to the page are needed in the process crawling Title is connect, by clicking the links header, the page can be jumped to.
By above-described embodiment, what it is due to reptile progress is crawling for entire web site contents, and the covering surface checked will It is more more comprehensive than manually.
In a kind of optional embodiment, as shown in fig. 6, obtaining the first data set and second page that linked object is included The second data set that face is included, including:
Step S602, the second text that the first text-string and second page that extraction linked object is included are included Character string;
First text-string and the second text-string are carried out word segmentation processing, obtain third data set by step S604 With the 4th data set;
Step S606 according to preset algorithm model, extracts first object data element in third data set and is put into the first number According to collection, and the second target data element extracted in the 4th data set is put into the second data set.
Specifically, in the above-described embodiments, by using the technology of natural language analysis, to content of pages and links header Entity extraction is carried out respectively, gets the principal entities object of links header and content of pages, specifically, obtains link pair first As the text-string that the text-string and second page that are included are included, the first text-string and the second text character String carries out word segmentation processing, obtains containing the third data set of all words in links header and contains in second page and owns 4th data set of word, and according to preset extraction algorithm model, the entity object for extracting links header is put into the first data Collection and the entity object of second page be put into the second data set, these entity objects can characterize including for links header The semantic information that semantic information and second page are included.
In a kind of optional embodiment, first object in third data set can be extracted using natural language analysis technology The second target data element that data element is put into the first data set and the 4th data set is put into the second data set.It is above-mentioned Preset algorithm model includes but not limited to following several:KNN algorithms, NB Algorithm, decision Tree algorithms, neural network Method, linear least square, K-Means algorithms, cosine similarity scheduling algorithm.
By natural language analysis technology, can more intelligently get included in content of pages and links header Information, may thereby determine that content of pages description things whether with its belonging to the described things of links header it is consistent, Improve efficiency and the accuracy of inspection.
In a kind of optional embodiment, as shown in fig. 7, in the first text-string for being included of extraction linked object and Before the second text-string that second page is included, the above method can also include:
Step S702 based on text density extraction algorithm, extracts the content of pages of second page, which includes:
Step S7021 obtains the document tree of second page;
Step S7023 extracts the text character in each label node in document tree, and each label in statistic document tree The text character number of node;
Step S7025 calculates the text density of each label node, wherein, text density is the text of each label node This number of characters accounts for the ratio of total text character number of document tree;
Step S7027 extracts the content of text of the label node of text character density maximum, the page as second page Content.
Specifically, in the above-described embodiments, it before the entity object included in extraction webpage, first has to obtain the page Content of pages, in a kind of optional embodiment, the extraction of content of pages can be realized using text density extraction algorithm, Meet the tree-like knot of DOM (Document Object Model) standard of W3C tissue publications using the HTML content foundation of webpage Structure;Then each label node of the dom tree of traversal webpage, label where positioning text using Chinese punctuate and link information, Second decimation is carried out to this label substance, extracts accurate body matter;In the text extracted in each label node Rong Hou counts the text character number included in each label node, and calculates the text density in each label node, wherein, The body matter of content of text, the most likely page in the label node of text character density maximum, thus, by text word Accord with content of pages of the content of text in the label node of density maximum as second page.
It should be noted that the upper data of Web page are occurred in the form of HTMl, html document is by marking Note and element composition, most of HTML markups occur in pairs, are used separately as beginning label and terminate to mark, for example, webpage Show that the title of content is usually marked by < TITLE > </TITLE >, and the subject content of webpage then mainly has several < P > </P > To mark.Thus, during information extraction, the characteristics of being write using html document, extract < TITLE > </TITLE > With < P > </P > marker characters and in-between content.
In a kind of optional embodiment, by taking the page shown in Fig. 1 (b) as an example, only from the content of text of body part The entity elements of extraction could be used to characterize the information of second page, and in the text of the parts such as web page navigation column, link label Can only extraction result be interfered by holding, thus, based on above-described embodiment, the text of multiple labels in second page is got It, can more accurately using the content of text in the label of text density maximum as the content of pages of second page after content Get the entity elements for characterizing second page meaning.
By above-described embodiment, extraction can be realized for characterizing the content of text of page semantic information, and is cast out Incoherent content of text, improves accuracy rate.
As a kind of preferred embodiment, scheme disclosed in the above embodiments of the present application can crawl mould by web site contents Block, title and content substance extraction module, entity comparison module three modules realize, wherein, web site contents crawl module and bear Duty obtains all content of pages and its affiliated links header in website to be checked;Title and content substance extraction module are responsible for Web site contents are crawled to the links header crawled in module and content of pages analyzes and processes, use natural language analysis Technology carries out entity extraction to content of pages and links header, gets the principal entities of links header and content of pages respectively Object;Entity comparison module compares the entity object of links header and content of pages, so as to finally determine whether the link is wrong Chain.
By scheme disclosed in above-described embodiment, flow is judged to the entire cable chain that entity compares again from crawling to parsing, Hand inspection logic is simulated by summary, realizes the inspection that the false links in website are quickly carried out by program, greatly Reduce hand inspection cost;Technology used in modules has certain retractility, with each module of technological progress With good alternative solution.
Embodiment 2
According to embodiments of the present invention, a kind of device embodiment for checking and being linked in website is additionally provided.The embodiment of the present invention 1 In inspection website in the method that links can be performed in the device of the embodiment of the present invention 2.
Fig. 8 is the schematic device linked in a kind of inspection website according to embodiments of the present invention, as shown in figure 8, the dress Put including:First acquisition module 801, the second acquisition module 803,805 and first determining module 807 of comparing module.
Wherein, the first acquisition module 801, for obtaining the link pair in the first page of website to be checked and first page As, wherein, linked object is used to jump to second page;Second acquisition module 803, for extract that linked object is included The second data set that one data set is included with second page;Comparing module 805, for the data that will be included in the first data set Element is compared with the data element included in the second data set, obtains comparison result;First determining module 807, for root Determine whether linked object is false links according to comparison result.
From the foregoing, it will be observed that in the above embodiments of the present application, by obtaining all content of pages in website to be checked and its institute The links header of category, and entity extraction is carried out to the links header belonging to content of pages and its, obtaining content of pages and its institute After the principal entities object of the links header of category, compare content of pages with its belonging to links header in principal entities object, According to comparison result determine the links header whether false links, it should be noted that due to the master extracted from content of pages It wants the principal entities quantity extracted in the generally extra links header of physical quantities, in a kind of optional embodiment, can check page Whether the entity of its affiliated links header extraction is contained in the entity sets of face contents extraction to determine the chain belonging to the page Connect whether title is false links.By scheme disclosed in above-described embodiment, reach all by comparing website to be checked The page with its belonging to links header described in things whether unanimously check the purpose of false links present in website, It is achieved thereby that improving the efficiency of chaining check and the technique effect of accuracy in website, and then solves the prior art using people Work mode checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high.
In a kind of optional embodiment, above-mentioned comparing module 805 includes:Searching module, for searching the first data set The data element identical with the second data set;Statistical module, for counting the quantity of identical data element;First calculates mould Block, for calculating the ratio of number data elements included in the quantity of identical data element and the first data set.
In a kind of optional embodiment, above-mentioned first determining module 807 includes:Second determining module, if for than Value is more than or equal to predetermined threshold value, it is determined that linked object is normal link.Third determining module, if be less than for ratio default Threshold value, it is determined that linked object is false links.
In a kind of optional embodiment, above-mentioned first acquisition module 801 includes:Third acquisition module, is climbed for passing through The mode of worm crawls website to be checked, obtains the linked object in the first page and first page of website to be checked.
In a kind of optional embodiment, above-mentioned second acquisition module 803 includes:First extraction module, for extracting chain Connect the first text-string that object included and the second text-string that second page is included;First processing module is used In the first text-string and the second text-string are carried out word segmentation processing, third data set and the 4th data set are obtained;The Two extraction modules, for according to preset algorithm model, extracting first object data element in third data set and being put into the first data Collection, and the second target data element extracted in the 4th data set is put into the second data set.
In a kind of optional embodiment, above device further includes:Third extraction module, for being based on text density extraction Algorithm extracts the content of pages of second page;Wherein, third extraction module includes:4th acquisition module, for obtaining second page The document tree in face;Second processing module, for extracting the text character in document tree in each label node, and statistic document tree In each label node text character number;Second computing module, for calculating the text density of each label node, wherein, Text density accounts for the ratio of total text character number of document tree for the text character number of each label node;4th extraction module, For extracting the content of text of the label node of text character density maximum, the content of pages as second page.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or Part steps.And aforementioned storage medium includes:USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code Medium.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

1. a kind of check the method linked in website, which is characterized in that including:
Obtain website to be checked first page and the first page in linked object, wherein, the linked object is used for Jump to second page;
Obtain the second data set that the first data set that the linked object is included is included with the second page;
The data element included in first data set is compared with the data element included in second data set, Obtain comparison result;
Determine whether the linked object is false links according to the comparison result.
2. according to the method described in claim 1, it is characterized in that, by the data element included in first data set and institute It states the data element included in the second data set to be compared, obtains comparison result, including:
Search first data set data element identical with second data set;
Count the quantity of the identical data element;
Calculate the ratio of number data elements included in the quantity of the identical data element and first data set.
3. according to the method described in claim 2, it is characterized in that, whether the linked object is determined according to the comparison result For false links, including:
If the ratio is more than or equal to predetermined threshold value, it is determined that the linked object is normal link;
If the ratio is less than the predetermined threshold value, it is determined that the linked object is false links.
4. according to the method described in claim 1, it is characterized in that, obtain the first page of website to be checked and the first page Linked object in face, including:
The website to be checked is crawled by way of reptile, obtains the first page of the website to be checked and described Linked object in first page.
5. according to the method described in claim 1, it is characterized in that, obtain the first data set that the linked object included with The second data set that the second page is included, including:
Extract the first text-string that the linked object included and the second text character that the second page is included String;
First text-string and second text-string are subjected to word segmentation processing, obtain third data set and the 4th Data set;
According to preset algorithm model, extract first object data element in the third data set and be put into first data set, And the second target data element extracted in the 4th data set is put into the second data set.
6. according to the method described in claim 5, it is characterized in that, the first text word included in the extraction linked object Before the second text-string that symbol string and the second page are included, the method further includes:
Based on text density extraction algorithm, the content of pages of the second page is extracted, which includes:
Obtain the document tree of the second page;
The text character in each label node in the document tree is extracted, and counts the text word in each label node Accord with number;
The text density of each label node is calculated, wherein, the text density is the text in each label node This number of characters accounts for the ratio of total text character number of the document tree;
Extract the content of text of the label node of text character density maximum, the content of pages as the second page.
7. a kind of check the device linked in website, which is characterized in that including:
First acquisition module, for obtaining the linked object in the first page of website to be checked and the first page, wherein, The linked object is used to jump to second page;
Second acquisition module is included with the second page for obtaining the first data set that the linked object included Second data set;
Comparing module, for the data that will be included in the data element included in first data set and second data set Element is compared, and obtains comparison result;
First determining module, for determining whether the linked object is false links according to the comparison result.
8. device according to claim 7, which is characterized in that the comparing module includes:
Searching module, for searching first data set data element identical with second data set;
Statistical module, for counting the quantity of the identical data element;
First computing module, for calculating the data included in the quantity of the identical data element and first data set The ratio of number of elements.
9. device according to claim 8, which is characterized in that first determining module includes:
Second determining module, if being more than or equal to predetermined threshold value for the ratio, it is determined that the linked object is normal, chain It connects;
Third determining module, if being less than the predetermined threshold value for the ratio, it is determined that the linked object is wrong chain It connects.
10. device according to claim 7, which is characterized in that first acquisition module includes:
Third acquisition module for being crawled by way of reptile to the website to be checked, obtains the net to be checked Linked object in the first page and the first page stood.
CN201611248655.6A 2016-12-29 2016-12-29 Method and device for checking links in website Active CN108255866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611248655.6A CN108255866B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611248655.6A CN108255866B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Publications (2)

Publication Number Publication Date
CN108255866A true CN108255866A (en) 2018-07-06
CN108255866B CN108255866B (en) 2020-10-27

Family

ID=62721341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611248655.6A Active CN108255866B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Country Status (1)

Country Link
CN (1) CN108255866B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408760A (en) * 2018-09-30 2019-03-01 东软集团股份有限公司 The method and apparatus for obtaining the information of necrosis link
CN110889051A (en) * 2018-09-10 2020-03-17 阿里巴巴集团控股有限公司 Page hyperlink detection method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000628A (en) * 2006-01-13 2007-07-18 国际商业机器公司 Wrong hyperlink detection equipment and method
CN101510195A (en) * 2008-02-15 2009-08-19 刘峰 Website safety protection and test diagnosis system structure method based on crawler technology
CN102436564A (en) * 2011-12-30 2012-05-02 奇智软件(北京)有限公司 Method and device for identifying falsified webpage
KR101443071B1 (en) * 2013-12-10 2014-09-22 주식회사 브이시스템즈 Error Check System of Webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000628A (en) * 2006-01-13 2007-07-18 国际商业机器公司 Wrong hyperlink detection equipment and method
CN101510195A (en) * 2008-02-15 2009-08-19 刘峰 Website safety protection and test diagnosis system structure method based on crawler technology
CN102436564A (en) * 2011-12-30 2012-05-02 奇智软件(北京)有限公司 Method and device for identifying falsified webpage
KR101443071B1 (en) * 2013-12-10 2014-09-22 주식회사 브이시스템즈 Error Check System of Webpage

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889051A (en) * 2018-09-10 2020-03-17 阿里巴巴集团控股有限公司 Page hyperlink detection method, device and equipment
CN109408760A (en) * 2018-09-30 2019-03-01 东软集团股份有限公司 The method and apparatus for obtaining the information of necrosis link

Also Published As

Publication number Publication date
CN108255866B (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN101464905B (en) Web page information extraction system and method
CN104408093B (en) A kind of media event key element abstracting method and device
CN102841920B (en) Method and device for extracting webpage frame information
CN113051500B (en) Phishing website identification method and system fusing multi-source data
CN103942340A (en) Microblog user interest recognizing method based on text mining
JP2006004417A (en) Method and device for recognizing specific type of information file
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN102663023A (en) Implementation method for extracting web content
CN103336766A (en) Short text garbage identification and modeling method and device
CN106934275A (en) A kind of password intensity evaluating method based on personal information
CN106951571A (en) A kind of method and apparatus for giving application mark label
CN109271627A (en) Text analyzing method, apparatus, computer equipment and storage medium
CN109194677A (en) A kind of SQL injection attack detection, device and equipment
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN104537028B (en) A kind of Web information processing method and device
CN108170678A (en) A kind of text entities abstracting method and system
CN104199838B (en) A kind of user model constructing method based on label disambiguation
CN112633431A (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
US20160283582A1 (en) Device and method for detecting similar text, and application
CN107145591A (en) Title-based webpage effective metadata content extraction method
CN108255866A (en) Check the method and apparatus linked in website
CN106485525A (en) Information processing method and device
CN114780709A (en) Text matching method and device and electronic equipment
CN107239520A (en) A kind of universal forum context extraction method
CN109347873A (en) A kind of detection method, device and the computer equipment of order injection attacks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant