CN108255866B - Method and device for checking links in website - Google Patents

Method and device for checking links in website Download PDF

Info

Publication number
CN108255866B
CN108255866B CN201611248655.6A CN201611248655A CN108255866B CN 108255866 B CN108255866 B CN 108255866B CN 201611248655 A CN201611248655 A CN 201611248655A CN 108255866 B CN108255866 B CN 108255866B
Authority
CN
China
Prior art keywords
page
data set
link
text
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611248655.6A
Other languages
Chinese (zh)
Other versions
CN108255866A (en
Inventor
潘峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611248655.6A priority Critical patent/CN108255866B/en
Publication of CN108255866A publication Critical patent/CN108255866A/en
Application granted granted Critical
Publication of CN108255866B publication Critical patent/CN108255866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for checking links in a website. Wherein, the method comprises the following steps: acquiring a first page of a website to be checked and a link object in the first page, wherein the link object is used for jumping to a second page; acquiring a first data set contained in a link object and a second data set contained in a second page; comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result; and determining whether the link object is a wrong link according to the comparison result. The invention solves the technical problems of low efficiency and low accuracy caused by adopting a manual mode to check the wrong links in the website in the prior art.

Description

Method and device for checking links in website
Technical Field
The invention relates to the field of website testing, in particular to a method and a device for checking links in a website.
Background
With the development of internet technology, websites have become a main tool for people to acquire information from the internet, so the quality of websites is directly related to the experience of users. In website quality detection, an important index is the number of wrong links existing in a website, and the index directly influences the experience of a user in the website. Assuming that a user opens a certain website to see a web page as shown in fig. 1(a), a column of link titles is displayed on the left side of the page, and the user can jump to a web page corresponding to the link title by clicking any one of the link titles in the column. In the actual operation process, if a user clicks one of the columns, namely 'dreaming, pursuing life and dream roundabout': after the skyscraper number two is successfully connected with the link title of the fantasy trip ", the skipped web page is the news content introduced in the mid-autumn festival of the popularity of all the areas as shown in fig. 1(b), in this case, the user thinks that the link of the website has the behavior of cheating click, and in the severe case, the user of the website is lost. Therefore, it is very important to check for a wrong link existing in the website.
At present, for the check of the error link existing in the website, the prior art mainly relies on manual operation, and by manually clicking each link title on the webpage, whether each link title is consistent with the actually opened page content is checked, so as to determine whether the link title is the error link. The method has the defects that manual inspection has great limitation, the current website usually comprises a plurality of webpages, great labor cost is consumed, and the efficiency is low; in addition, the subjective dependence of manual inspection on people is serious, and the interference of various factors can influence the judgment result, so that the accuracy is not high.
Aiming at the problems of low efficiency and low accuracy caused by the fact that the prior art adopts a manual mode to check the wrong links in the website, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for checking links in a website, which at least solve the technical problems of low efficiency and low accuracy caused by checking wrong links in the website in a manual mode in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a method for checking a link in a website, including: acquiring a first page of a website to be checked and a link object in the first page, wherein the link object is used for jumping to a second page; extracting a first data set contained in the link object and a second data set contained in a second page; comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result; and determining whether the link object is a wrong link according to the comparison result.
Further, comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result, including: searching the same data elements in the first data set and the second data set; counting the number of the same data elements; a ratio of the number of identical data elements to the number of data elements contained in the first data set is calculated.
Further, determining whether the link object is a wrong link according to the comparison result, including: if the ratio is larger than or equal to a preset threshold value, determining that the link object is a normal link; and if the ratio is smaller than a preset threshold value, determining that the link object is an error link.
Further, acquiring a first page of the website to be checked and a link object in the first page, including: crawling is carried out on the website to be checked in a crawler mode, and a first page of the website to be checked and a link object in the first page are obtained.
Further, acquiring a first data set contained in the link object and a second data set contained in the second page includes: extracting a first text character string contained in the link object and a second text character string contained in the second page; performing word segmentation processing on the first text character string and the second text character string to obtain a third data set and a fourth data set; and extracting a first target data element in the third data set and putting the first target data element in the first data set according to a preset algorithm model, and extracting a second target data element in the fourth data set and putting the second target data element in the second data set.
Further, before extracting the first text string included in the link object and the second text string included in the second page, the method further includes: extracting the page content of the second page based on a text density extraction algorithm, wherein the step comprises the following steps: acquiring a document tree of a second page; extracting text characters in each label node in the document tree, and counting the number of the text characters of each label node in the document tree; calculating the text density of each label node, wherein the text density is the proportion of the number of text characters of each label node to the total number of text characters of the document tree; and extracting the text content of the label node with the maximum text character density as the page content of the second page.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus for checking a link in a website, including: the system comprises an acquisition module, a search module and a display module, wherein the acquisition module is used for acquiring a first page of a website to be checked and a link object in the first page, and the link object is used for jumping to a second page; the extraction module is used for extracting a first data set contained in the link object and a second data set contained in the second page; the comparison module is used for comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result; and the determining module is used for determining whether the link object is a wrong link according to the comparison result.
Further, the alignment module comprises: the searching module is used for searching the same data elements in the first data set and the second data set; the counting module is used for counting the number of the same data elements; a first calculation module for calculating a ratio of the number of identical data elements to the number of data elements contained in the first data set.
Further, the first determining module includes: the second determining module is used for determining that the link object is a normal link if the ratio is greater than or equal to a preset threshold; and the third determining module is used for determining that the link object is an error link if the ratio is smaller than the preset threshold.
Further, the first obtaining module comprises: and the third acquisition module is used for crawling the website to be checked in a crawler mode to obtain the first page of the website to be checked and the link object in the first page.
Further, the second obtaining module includes: the first extraction module is used for extracting a first text character string contained in the link object and a second text character string contained in the second page; the first processing module is used for carrying out word segmentation processing on the first text character string and the second text character string to obtain a third data set and a fourth data set; and the second extraction module is used for extracting a first target data element in the third data set and putting the first target data element into the first data set according to the preset algorithm model, and extracting a second target data element in the fourth data set and putting the second target data element into the second data set.
Further, the apparatus further comprises: the third extraction module is used for extracting the page content of the second page based on a text density extraction algorithm; wherein, the third extraction module comprises: the fourth acquisition module is used for acquiring the document tree of the second page; the second processing module is used for extracting text characters in each label node in the document tree and counting the number of the text characters of each label node in the document tree; the second calculation module is used for calculating the text density of each label node, wherein the text density is the proportion of the number of text characters of each label node to the total number of text characters of the document tree; and the fourth extraction module is used for extracting the text content of the label node with the maximum text character density as the page content of the second page.
In the embodiment of the invention, a first page of a website to be checked and a link object in the first page are obtained, wherein the link object is used for jumping to a second page; extracting a first data set contained in the link object and a second data set contained in a second page; comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result; and determining whether the link object is a wrong link according to the comparison result, so that the purpose of checking the wrong link in the website by comparing whether all pages of the website to be checked are consistent with the things described in the link title to which the pages belong is achieved, the technical effect of improving the efficiency and accuracy of link check in the website is achieved, and the technical problems of low efficiency and low accuracy caused by the fact that the wrong link in the website is checked manually in the prior art are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1(a) is a schematic diagram of a web page according to the prior art;
FIG. 1(b) is a schematic diagram of a website page according to the prior art;
FIG. 2 is a flow diagram of a method for checking links in a web site, according to an embodiment of the present invention;
FIG. 3 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 4 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 5 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 6 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 7 is a flowchart of an alternative method of checking links in a web site, according to an embodiment of the present invention; and
FIG. 8 is a diagram illustrating an apparatus for checking links in a website according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:
chain staggering: the link is called a wrong link, and mainly refers to the condition that the actual page content pointed by the link title in the website does not conform to the link title, and the link is called a wrong link. In the embodiment of the application, the broken link is different from the broken link, the broken link refers to a link which cannot be accessed or a link which is interrupted during access, and the broken link is a link with a link title which is inconsistent with the page description content pointed by the link title.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for checking for links in a website, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that described herein.
Fig. 2 is a flowchart of a method for checking links in a website according to an embodiment of the present invention, as shown in fig. 2, the method includes the following steps:
step S202, a first page of a website to be checked and a link object in the first page are obtained, wherein the link object is used for jumping to a second page.
Specifically, in the above step, the first page may be any one or more pages in a plurality of pages included in the website to be checked; the link object may be a segment of text or a picture embedded with a link address displayed on the first page, and when the text or the picture is clicked, the link object may jump to another webpage of the website to be checked, that is, the second page. The method comprises the steps of checking error links in a website, and firstly acquiring all pages contained in the website to be detected and all link objects contained in the pages.
In an alternative embodiment, taking the web pages shown in fig. 1(a) and fig. 1(b) as an example, the web page shown in fig. 1(a) may be the first page, and the page includes 6 link objects, which are link titles: "the national railway of the next day of mid-autumn holiday is expected to send 780 ten thousand passengers", "preheating follows general management to go to america: those views are worth expecting "" dreaming, dream-disturbed sleep: the Tiangong No. two is successfully connected with a fantastic trip, the Chinese space station can be operated for more than ten years on the rail in the future, the people enjoy a round of moon with the family conditions of China-the popularity of the masses in various regions and the mid-autumn festival, and the urban and rural environments are cleaner and more beautiful, and each link title points to one page, namely the second page. The user can enter the corresponding second page by clicking any one of the link titles in the first page, for example, clicking a link title of "enjoying the national conditions of the same souvenir in a month-the popularity of each region" and the entered second page is the page shown in fig. 1 (b).
It should be noted that a formed website usually includes multiple web pages, and each web page usually includes one or more link objects for pointing to a link target, where the link target may be a page or other positions on the same page. In the process of checking for a wrong link existing in a website, it is necessary to check whether link objects on all pages included in the website are wrong links.
Step S204, a first data set contained in the link object and a second data set contained in the second page are obtained.
Specifically, in the above step, taking the link object as a segment of text as an example, the first data set may be an entity set extracted from a certain link title in the first page, and may be one or several words capable of representing the meaning of the link title; the second data set may be an entity set extracted from the second page pointed to by the link title, or may be one or more words, which may be used to represent information contained in the second page.
In an alternative embodiment, still taking the web pages shown in fig. 1(a) and 1(b) as an example, the link title "dreaming, dream rounding" from the first page: if the Tiangong No. two is successfully connected with the fantasy trip, entities such as the Tiangong No. two can be extracted, and the first data set contained in the linking title is the Tiangong No. two; entities such as a month and a mid-autumn festival can be extracted from a link title of a first page, namely a co-appreciation round of the moon commemorative family national conditions, namely the popularity of each region in the mid-autumn festival, and a first data set contained in the link title is the month and the mid-autumn festival; entities such as "month", "mid-autumn festival", "university", "school badge and moon cake" can be extracted from the content of the page in the second page shown in fig. 1(b), and the second data set included in the second page is "month, mid-autumn festival, university and school badge and moon cake … …".
It should be noted here that, since the link object may be a text or a picture, the entity types included in the first data set and the second data set may not be limited to the text or the picture.
Step S206, comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result.
Specifically, in the above step, the data element may be an entity contained in the first data set and the second data set; after a first data set contained in a certain link object in the first page and a second data set contained in a second page pointed by the link object are extracted, comparing data elements contained in the first data set with data elements contained in the second data set to obtain a corresponding comparison result.
It should be noted that the number of main entities extracted from the page is generally much larger than the number of main entities extracted from the link header, and thus, in an alternative embodiment, the comparison result may be whether all data elements in the first data set are included in the second data set.
Step S208, determining whether the link object is an error link according to the comparison result.
Specifically, in the above step, after comparing the data elements contained in the first data set with the data elements contained in the second data set, in an alternative embodiment, it may be determined whether the link header is a false link by checking whether all the data elements in the first data set extracted from the link header are contained in the second data set extracted from the second page. If the second data set extracted from the second page contains all the data elements in the first data set extracted from the link header, namely the entity set extracted from the page content of the second page contains the entity extracted from the link header, determining that the link header is a normal link; if the second data set extracted from the second page does not contain the data elements in the first data set extracted from the link header, i.e., the entity set extracted from the page content of the second page does not contain the entity extracted from the link header, the link header is determined to be a false link.
In an alternative embodiment, still taking the web pages shown in fig. 1(a) and 1(b) as an example, the second data set contained in the second page shown in fig. 1(b) is "moon, mid-autumn festival, university, school badge moon cake … …", which contains the first data set "moon, mid-autumn festival" extracted from the link title "enjoy the same round of moon with the same idea of the country — the mid-autumn festival of the popularity in each place" of the first page; the dataset does not contain the link title "dreams, dreams" of the first page: tiangong II succeeds in the first data set ' Tiangong II ' extracted from the dream trip '; therefore, if the link title "enjoys the family's national conditions of the same idea in one round of the month-autumn festival in popularity among the various regions" points to the second page shown in fig. 1(b), the link title and the pointed page describe the same thing, and the link title can be described as a normal link; if the title "dream, dreaming, dream coming to your eye" is linked: if the skyscraper number two succeeds in following the second page shown in fig. 1(b) pointed by the fantasy trip ", the link title and the page pointed by the link title describe something other than the same thing, indicating that the link title is a wrong link.
As can be seen from the above, in the above embodiments of the present application, all the page contents of the website to be checked and the link titles to which the page contents belong are obtained, and entity extraction is performed on the page contents and the link titles to which the page contents belong, after the main entity objects of the page contents and the link titles to which the page contents belong are obtained, the page contents and the main entity objects in the link titles to which the page contents belong are compared, and whether the link titles are linked incorrectly is determined according to the comparison result. Through the scheme disclosed by the embodiment, the purpose of checking the wrong links in the website by comparing whether all pages of the website to be checked are consistent with the things described in the link titles to which the pages belong is achieved, so that the technical effect of improving the efficiency and the accuracy of link check in the website is achieved, and the technical problems of low efficiency and low accuracy caused by the fact that the wrong links in the website are checked manually in the prior art are solved.
In an alternative embodiment, as shown in fig. 3, comparing the data elements included in the first data set with the data elements included in the second data set to obtain a comparison result, the method may include the following steps:
step S302, searching the same data elements in the first data set and the second data set;
step S304, counting the number of the same data elements;
step S306, calculate the ratio of the number of identical data elements to the number of data elements contained in the first data set.
Specifically, in the above embodiment, after a first data set is extracted from a certain link header in a first page and a second data set is extracted from a second page pointed to by the link header, data elements included in the first data set may be compared with data elements included in the second data set to find out the same data elements in the first data set and the second data set, that is, the link header and the same entities included in the second page pointed to by the link header are the same, and the number of the same entities is counted.
In an alternative embodiment, as shown in fig. 4, determining whether the link object is an error link according to the comparison result includes:
in step S402, if the ratio is greater than or equal to the preset threshold, it is determined that the link object is a normal link.
In step S404, if the ratio is smaller than the preset threshold, it is determined that the link object is an error link.
Specifically, in the above embodiment, after the ratio between the number of the same entities and the number of the entities included in the first data set is obtained through calculation, the ratio and the preset threshold are determined to determine whether the link object is an error link; if the ratio is larger than or equal to a preset threshold value, determining that the link title is consistent with the things described by the page pointed by the link title, and the link title is a normal link; and if the ratio is smaller than a preset threshold value, determining that the link title is inconsistent with the object described by the page pointed to by the link title, and determining that the link title is a wrong link.
In an alternative embodiment, the predetermined threshold may be any value greater than 0.5 to 1.
By the embodiment, whether the link title is consistent with the webpage pointed by the link title can be judged by a machine, so that the influence of subjective factors of manual judgment is avoided, and the judgment condition is more standardized.
In an alternative embodiment, as shown in fig. 5, the acquiring the first page of the website to be checked and the link object in the first page in step S202 may include:
step S502, crawling the website to be checked in a crawler mode to obtain a first page of the website to be checked and a link object in the first page.
Specifically, in the above-described embodiment, all pages of the website to be checked and link objects on all pages are acquired by using a crawling means.
It should be noted that in the crawling process, the source title of each page, that is, the link title to which the page belongs, needs to be labeled, and the page can be skipped by clicking the link title.
With the above embodiment, since the crawler crawls the entire website content, the coverage of the inspection will be more comprehensive than manual.
In an alternative embodiment, as shown in fig. 6, the obtaining a first data set contained in the link object and a second data set contained in the second page includes:
step S602, extracting a first text character string contained in the link object and a second text character string contained in the second page;
step S604, performing word segmentation processing on the first text character string and the second text character string to obtain a third data set and a fourth data set;
step S606, according to the preset algorithm model, extracting a first target data element in the third data set and putting the first target data element in the third data set, and extracting a second target data element in the fourth data set and putting the second target data element in the fourth data set.
Specifically, in the above embodiment, by using a natural language analysis technology, entity extraction is performed on the page content and the link title respectively, and a main entity object of the link title and the page content is obtained, specifically, a text character string included in the link object and a text character string included in the second page are obtained first, and a first text character string and a second text character string are subjected to word segmentation processing, so that a third data set including all words in the link title and a fourth data set including all words in the second page are obtained, and according to a preset extraction algorithm model, the entity object of the link title is extracted and placed in the first data set, and the entity object of the second page is placed in the second data set, and these entity objects can represent semantic information included in the link title and semantic information included in the second page.
In an alternative embodiment, natural language analysis techniques may be used to extract a first target data element in the third data set into the first data set and a second target data element in the fourth data set into the second data set. The preset algorithm model includes, but is not limited to, the following: KNN algorithm, naive Bayes algorithm, decision tree algorithm, neural network method, linear least square method, K-Means algorithm, cosine similarity algorithm and the like.
Through the natural language analysis technology, the information contained in the page content and the link title can be acquired more intelligently, so that whether the things described by the page content are consistent with the things described by the link title to which the things described by the page content belong can be determined, and the checking efficiency and accuracy are improved.
In an alternative embodiment, as shown in fig. 7, before extracting the first text string included in the link object and the second text string included in the second page, the method may further include:
step S702, extracting the page content of the second page based on the text density extraction algorithm, wherein the step S comprises the following steps:
step S7021, obtaining a document tree of a second page;
step S7023, extracting text characters in each label node in the document tree, and counting the number of the text characters of each label node in the document tree;
step S7025, calculating the text density of each label node, wherein the text density is the proportion of the number of text characters of each label node to the total number of text characters of the document tree;
step S7027, the text content of the tag node with the maximum text character density is extracted as the page content of the second page.
Specifically, in the above embodiment, before extracting the entity Object included in the web page, the page content of the web page is first obtained, in an alternative implementation, the extraction of the page content may be implemented by using a text density extraction algorithm, and a tree structure conforming to the dom (document Object model) standard issued by the W3C organization is established by using the HTML content of the web page; traversing each label node of a DOM tree of the webpage, positioning a label where the text is located by using Chinese punctuations and link information, performing secondary extraction on the content of the label, and extracting accurate text content; after the text content in each label node is extracted, counting the number of text characters contained in each label node, and calculating the text density in each label node, wherein the text content in the label node with the highest text character density is most probably the text content of the page, so the text content in the label node with the highest text character density is used as the page content of the second page.
It should be noted that most of the upper data of the Web page is in the form of HTMl, the HTMl document is composed of tags and elements, most of the HTMl tags are paired and used as the start tag and the end tag, for example, the TITLE of the display content of the Web page is usually marked by < TITLE > </TITLE >, while the subject content of the Web page is mainly marked by a plurality of < P >. Therefore, in the information extraction process, the characteristics of HTML document writing can be used to extract < TITLE > and < P > tags and the content in between.
In an optional embodiment, taking the page shown in fig. 1(b) as an example, only the entity elements extracted from the text content of the body part can be used to represent the information of the second page, and the text content of the parts such as the web page navigation bar and the link tag only interferes with the extraction result, so that, based on the above embodiment, after the text content of the multiple tags in the second page is obtained, the text content in the tag with the highest text density is used as the page content of the second page, and the entity elements used to represent the meaning of the second page can be more accurately obtained.
By the embodiment, the text content used for representing the page semantic information can be extracted, some irrelevant text content is omitted, and the accuracy is improved.
As a preferred embodiment, the scheme disclosed in the above embodiment of the present application can be implemented by a website content crawling module, a title and content entity extracting module, and an entity comparing module, wherein the website content crawling module is responsible for acquiring all page contents in a website to be checked and their affiliated link titles; the title and content entity extraction module is responsible for analyzing and processing the link title and the page content crawled in the website content crawling module, and respectively performing entity extraction on the page content and the link title by using a natural language analysis technology to obtain main entity objects of the link title and the page content; the entity comparison module compares the link title with the entity object of the page content to finally determine whether the link is a mischain.
Through the scheme disclosed by the embodiment, the whole fault chain judgment process from crawling to analysis to entity comparison is realized, and the fault chain in the website is quickly checked through a program by summarizing and simulating the manual checking logic, so that the manual checking cost is greatly reduced; the technology used by the individual modules is somewhat flexible, with each module having a good alternative as technology advances.
Example 2
According to the embodiment of the invention, the embodiment of the device for checking the links in the website is also provided. The method of checking links in a website in embodiment 1 of the present invention may be performed in the apparatus in embodiment 2 of the present invention.
Fig. 8 is a schematic diagram of an apparatus for checking links in a website according to an embodiment of the present invention, as shown in fig. 8, the apparatus includes: a first obtaining module 801, a second obtaining module 803, a comparing module 805 and a first determining module 807.
The first obtaining module 801 is configured to obtain a first page of a website to be checked and a link object in the first page, where the link object is used to jump to a second page; a second obtaining module 803, configured to extract a first data set included in the link object and a second data set included in the second page; a comparison module 805, configured to compare data elements included in the first data set with data elements included in the second data set to obtain a comparison result; a first determining module 807, configured to determine whether the link object is a wrong link according to the comparison result.
As can be seen from the above, in the above embodiments of the present application, all the page contents of the website to be checked and the link titles to which the page contents belong are obtained, and entity extraction is performed on the page contents and the link titles to which the page contents belong, after the main entity objects of the page contents and the link titles to which the page contents belong are obtained, the page contents and the main entity objects in the link titles to which the page contents belong are compared, and whether the link titles are linked incorrectly is determined according to the comparison result. Through the scheme disclosed by the embodiment, the purpose of checking the wrong links in the website by comparing whether all pages of the website to be checked are consistent with the things described in the link titles to which the pages belong is achieved, so that the technical effect of improving the efficiency and the accuracy of link check in the website is achieved, and the technical problems of low efficiency and low accuracy caused by the fact that the wrong links in the website are checked manually in the prior art are solved.
In an alternative embodiment, the alignment module 805 includes: the searching module is used for searching the same data elements in the first data set and the second data set; the counting module is used for counting the number of the same data elements; a first calculation module for calculating a ratio of the number of identical data elements to the number of data elements contained in the first data set.
In an alternative embodiment, the first determining module 807 comprises: and the second determining module is used for determining that the link object is a normal link if the ratio is greater than or equal to a preset threshold. And the third determining module is used for determining that the link object is an error link if the ratio is smaller than the preset threshold.
In an alternative embodiment, the first obtaining module 801 includes: and the third acquisition module is used for crawling the website to be checked in a crawler mode to obtain the first page of the website to be checked and the link object in the first page.
In an alternative embodiment, the second obtaining module 803 includes: the first extraction module is used for extracting a first text character string contained in the link object and a second text character string contained in the second page; the first processing module is used for carrying out word segmentation processing on the first text character string and the second text character string to obtain a third data set and a fourth data set; and the second extraction module is used for extracting a first target data element in the third data set and putting the first target data element into the first data set according to the preset algorithm model, and extracting a second target data element in the fourth data set and putting the second target data element into the second data set.
In an optional embodiment, the apparatus further comprises: the third extraction module is used for extracting the page content of the second page based on a text density extraction algorithm; wherein, the third extraction module comprises: the fourth acquisition module is used for acquiring the document tree of the second page; the second processing module is used for extracting text characters in each label node in the document tree and counting the number of the text characters of each label node in the document tree; the second calculation module is used for calculating the text density of each label node, wherein the text density is the proportion of the number of text characters of each label node to the total number of text characters of the document tree; and the fourth extraction module is used for extracting the text content of the label node with the maximum text character density as the page content of the second page.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (4)

1. A method for checking links in a website, comprising:
acquiring a first page of a website to be checked and a link object in the first page, wherein the link object is used for jumping to a second page;
acquiring a first data set contained in the link object and a second data set contained in the second page;
comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result;
determining whether the link object is a wrong link according to the comparison result;
wherein, comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result comprises: searching for the same data elements in the first data set and the second data set; counting the number of the same data elements; calculating the ratio of the number of the same data elements to the number of data elements contained in the first data set;
wherein, determining whether the link object is a wrong link according to the comparison result comprises: if the ratio is larger than or equal to a preset threshold value, determining that the link object is a normal link; if the ratio is smaller than the preset threshold value, determining that the link object is an error link;
wherein, acquiring the first data set contained in the link object and the second data set contained in the second page comprises: extracting a first text character string contained in the link object and a second text character string contained in the second page; performing word segmentation processing on the first text character string and the second text character string to obtain a third data set and a fourth data set; extracting a first target data element in the third data set and putting the first target data element into the first data set according to a preset algorithm model, and extracting a second target data element in the fourth data set and putting the second target data element into a second data set; and the preset algorithm model at least comprises at least one of the following: KNN algorithm, naive Bayes algorithm, decision tree algorithm, neural network method, linear least square method, K-Means algorithm, cosine similarity algorithm;
before extracting the first text string included in the link object and the second text string included in the second page, the method further includes: extracting the page content of the second page based on a text density extraction algorithm, wherein the step comprises the following steps: acquiring a document tree of the second page; extracting text characters in each label node in the document tree, and counting the number of the text characters in each label node; calculating the text density of each label node, wherein the text density is the proportion of the number of text characters in each label node to the total number of text characters in the document tree; and extracting the text content of the label node with the maximum text character density as the page content of the second page.
2. The method of claim 1, wherein obtaining a first page of a website to be checked and a link object in the first page comprises:
and crawling the website to be checked in a crawler mode to obtain a first page of the website to be checked and a link object in the first page.
3. An apparatus for checking links in a website, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first page of a website to be checked and a link object in the first page, and the link object is used for jumping to a second page;
a second obtaining module, configured to obtain a first data set included in the link object and a second data set included in the second page;
the comparison module is used for comparing the data elements contained in the first data set with the data elements contained in the second data set to obtain a comparison result;
the first determining module is used for determining whether the link object is a wrong link according to the comparison result;
wherein, the comparison module comprises: a searching module for searching for the same data elements in the first data set and the second data set; the counting module is used for counting the number of the same data elements; a first calculation module, configured to calculate a ratio of the number of the same data elements to the number of data elements included in the first data set;
wherein the first determining module comprises: a second determining module, configured to determine that the link object is a normal link if the ratio is greater than or equal to a preset threshold; a third determining module, configured to determine that the link object is an error link if the ratio is smaller than the preset threshold;
wherein, the second acquisition module includes: the first extraction module is used for extracting a first text character string contained in the link object and a second text character string contained in the second page; the first processing module is used for carrying out word segmentation processing on the first text character string and the second text character string to obtain a third data set and a fourth data set; the second extraction module is used for extracting a first target data element in the third data set and putting the first target data element into the first data set according to a preset algorithm model, and extracting a second target data element in the fourth data set and putting the second target data element into the second data set; and the preset algorithm model at least comprises at least one of the following: KNN algorithm, naive Bayes algorithm, decision tree algorithm, neural network method, linear least square method, K-Means algorithm, cosine similarity algorithm;
wherein, the device still includes: the third extraction module is used for extracting the page content of the second page based on a text density extraction algorithm; wherein, the third extraction module comprises: the fourth acquisition module is used for acquiring the document tree of the second page; the second processing module is used for extracting text characters in each label node in the document tree and counting the number of the text characters of each label node in the document tree; the second calculation module is used for calculating the text density of each label node, wherein the text density is the proportion of the number of text characters of each label node to the total number of text characters of the document tree; and the fourth extraction module is used for extracting the text content of the label node with the maximum text character density as the page content of the second page.
4. The apparatus of claim 3, wherein the first obtaining module comprises:
and the third acquisition module is used for crawling the website to be checked in a crawler mode to obtain the first page of the website to be checked and the link object in the first page.
CN201611248655.6A 2016-12-29 2016-12-29 Method and device for checking links in website Active CN108255866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611248655.6A CN108255866B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611248655.6A CN108255866B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Publications (2)

Publication Number Publication Date
CN108255866A CN108255866A (en) 2018-07-06
CN108255866B true CN108255866B (en) 2020-10-27

Family

ID=62721341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611248655.6A Active CN108255866B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Country Status (1)

Country Link
CN (1) CN108255866B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889051A (en) * 2018-09-10 2020-03-17 阿里巴巴集团控股有限公司 Page hyperlink detection method, device and equipment
CN109408760A (en) * 2018-09-30 2019-03-01 东软集团股份有限公司 The method and apparatus for obtaining the information of necrosis link

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000628A (en) * 2006-01-13 2007-07-18 国际商业机器公司 Wrong hyperlink detection equipment and method
CN101510195A (en) * 2008-02-15 2009-08-19 刘峰 Website safety protection and test diagnosis system structure method based on crawler technology
CN102436564A (en) * 2011-12-30 2012-05-02 奇智软件(北京)有限公司 Method and device for identifying falsified webpage
KR101443071B1 (en) * 2013-12-10 2014-09-22 주식회사 브이시스템즈 Error Check System of Webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000628A (en) * 2006-01-13 2007-07-18 国际商业机器公司 Wrong hyperlink detection equipment and method
CN101510195A (en) * 2008-02-15 2009-08-19 刘峰 Website safety protection and test diagnosis system structure method based on crawler technology
CN102436564A (en) * 2011-12-30 2012-05-02 奇智软件(北京)有限公司 Method and device for identifying falsified webpage
KR101443071B1 (en) * 2013-12-10 2014-09-22 주식회사 브이시스템즈 Error Check System of Webpage

Also Published As

Publication number Publication date
CN108255866A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN106407236B (en) A kind of emotion tendency detection method towards comment data
CN104408093B (en) A kind of media event key element abstracting method and device
CN107229668B (en) Text extraction method based on keyword matching
CN109726274B (en) Question generation method, device and storage medium
Popescu et al. Mining user home location and gender from flickr tags
CN102053991B (en) Method and system for multi-language document retrieval
CN103853738B (en) A kind of recognition methods of info web correlation region
CN111160031A (en) Social media named entity identification method based on affix perception
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
CN106815207B (en) Information processing method and device for legal referee document
CN102663023A (en) Implementation method for extracting web content
Yin et al. Facto: a fact lookup engine based on web tables
CN111274239A (en) Test paper structuralization processing method, device and equipment
CN113051500B (en) Phishing website identification method and system fusing multi-source data
CN103810274A (en) Multi-feature image tag sorting method based on WordNet semantic similarity
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN109165373B (en) Data processing method and device
CN106250402B (en) Website classification method and device
Fauzi et al. Webpage segmentation for extracting images and their surrounding contextual information
CN108255866B (en) Method and device for checking links in website
Kopliku et al. Towards a framework for attribute retrieval
Ashraf et al. Author profiling on bi-lingual tweets
CN106485525A (en) Information processing method and device
CN113240322B (en) Climate risk disclosure quality method, apparatus, electronic device, and storage medium
CN108701126A (en) Theme estimating device, theme presumption method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant