CN108255866A - Check the method and apparatus linked in website - Google Patents
Check the method and apparatus linked in website Download PDFInfo
- Publication number
- CN108255866A CN108255866A CN201611248655.6A CN201611248655A CN108255866A CN 108255866 A CN108255866 A CN 108255866A CN 201611248655 A CN201611248655 A CN 201611248655A CN 108255866 A CN108255866 A CN 108255866A
- Authority
- CN
- China
- Prior art keywords
- page
- data set
- text
- linked object
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/972—Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method and apparatus for checking and being linked in website.Wherein, this method includes:The linked object in the first page and first page of website to be checked is obtained, wherein, linked object is used to jump to second page;Obtain the second data set that the first data set that linked object is included is included with second page;The data element included in first data set with the data element included in the second data set is compared, obtains comparison result;Determine whether linked object is false links according to comparison result.The present invention solves the prior art and checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high using manual type.
Description
Technical field
The present invention relates to website testing field, in particular to a kind of method and apparatus for checking and being linked in website.
Background technology
With the development of Internet technology, website has become the main tool for obtaining information from internet for people, thus,
The quality of Website quality is directly related to the experience of user.In Website quality detection, an important index is website memory
False links quantity, which can directly affect experience of the user in website.Assuming that user opens a certain website, see
To a webpage as shown in Fig. 1 (a), a row links header is shown on the left of the page, user can be by clicking the row
In any one links header, jump to a webpage corresponding with the links header.In actual mechanical process, if user's point
One is hit in the row entitled " to build dream, chase after life, interpret a dream:After the links header of the travel of No. two successful connection illusions of Heavenly Palace ", redirect
It is news content that the various regions masses celebrate the Mid-autumn Festival that webpage, which is the introduction as shown in Fig. 1 (b), in this case, Yong Huhui
Think that linking for this website has the behavior that deception is clicked, under serious situation, also result in the customer loss of the website.By
This, the inspection for false links existing in website, it appears particularly significant.
At present, for the inspection of false links existing in website, the prior art depends on artificial progress, passes through people
Each links header in work webpage clicking to check whether each links header is consistent with actually opened content of pages, is come
Judge whether the links header is false links.The drawback is that hand inspection has significant limitation, current website is usual
Comprising many webpages, need to expend great cost of labor, inefficiency;In addition, hand inspection compares subjective rely on of people
Seriously, the interference of various factors may all influence judging result, and accuracy is not high.
Check that false links present in website cause that efficiency is low, accuracy using manual type for the above-mentioned prior art
The problem of not high, currently no effective solution has been proposed.
Invention content
An embodiment of the present invention provides a kind of method and apparatus for checking and being linked in website, are adopted at least solving the prior art
Manually mode checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high.
One side according to embodiments of the present invention provides a kind of method for checking and being linked in website, including:Acquisition is treated
Check the linked object in the first page and first page of website, wherein, linked object is used to jump to second page;Extraction
The second data set that the first data set that linked object is included is included with second page;The number that will be included in first data set
It is compared according to element with the data element included in the second data set, obtains comparison result;It determines to link according to comparison result
Whether object is false links.
Further, the data element included in the data element included in the first data set and the second data set is carried out
It compares, obtains comparison result, including:Search the first data set data element identical with the second data set;Count identical
The quantity of data element;Calculate the ratio of number data elements included in the quantity of identical data element and the first data set
Value.
Further, determine whether linked object is false links according to comparison result, including:If ratio is more than or equal to
Predetermined threshold value, it is determined that linked object is normal link;If ratio is less than predetermined threshold value, it is determined that linked object is wrong chain
It connects.
Further, the linked object in the first page and first page of website to be checked is obtained, including:Pass through reptile
Mode website to be checked is crawled, obtain the linked object in the first page and first page of website to be checked.
Further, the second data set that the first data set that linked object is included is included with second page is obtained,
Including:The second text-string that the first text-string and second page that extraction linked object is included are included;By
One text-string and the second text-string carry out word segmentation processing, obtain third data set and the 4th data set;According to default
Algorithm model extracts first object data element in third data set and is put into the first data set, and extract in the 4th data set
Second target data element is put into the second data set.
Further, the second text included in the first text-string for being included of extraction linked object and second page
Before this character string, method further includes:Based on text density extraction algorithm, the content of pages of second page is extracted, the step packet
It includes:Obtain the document tree of second page;The text character in each label node in document tree is extracted, and each in statistic document tree
The text character number of a label node;The text density of each label node is calculated, wherein, text density is each label node
Text character number account for document tree total text character number ratio;Extract the text of the label node of text character density maximum
Content, the content of pages as second page.
Another aspect according to embodiments of the present invention additionally provides a kind of device for checking and being linked in website, including:It obtains
Module, for obtaining the linked object in the first page of website to be checked and first page, wherein, linked object is used to redirect
To second page;Extraction module, for extract that the first data set that linked object included is included with second page second
Data set;Comparing module, for the data element that will be included in the data element included in the first data set and the second data set
It is compared, obtains comparison result;Determining module, for determining whether linked object is false links according to comparison result.
Further, comparing module includes:Searching module is identical with the second data set for searching the first data set
Data element;Statistical module, for counting the quantity of identical data element;First computing module, for calculating identical number
According to the ratio of number data elements included in the quantity of element and the first data set.
Further, the first determining module includes:Second determining module, if being more than or equal to predetermined threshold value for ratio,
Then determine linked object for normal link;Third determining module, if being less than predetermined threshold value for ratio, it is determined that linked object
For false links.
Further, the first acquisition module includes:Third acquisition module, for by way of reptile to website to be checked
It is crawled, obtains the linked object in the first page and first page of website to be checked.
Further, the second acquisition module includes:First extraction module, for extracting the first text that linked object is included
The second text-string that this character string and second page are included;First processing module, for by the first text-string and
Second text-string carries out word segmentation processing, obtains third data set and the 4th data set;Second extraction module, for according to pre-
If algorithm model, extract first object data element in third data set and be put into the first data set, and extract in the 4th data set
The second target data element be put into the second data set.
Further, device further includes:Third extraction module for being based on text density extraction algorithm, extracts second page
The content of pages in face;Wherein, third extraction module includes:4th acquisition module, for obtaining the document tree of second page;Second
Processing module, for extracting the text character in document tree in each label node, and each label node in statistic document tree
Text character number;Second computing module, for calculating the text density of each label node, wherein, text density is each
The text character number of label node accounts for the ratio of total text character number of document tree;4th extraction module, for extracting text word
Accord with the content of text of the label node of density maximum, the content of pages as second page.
In embodiments of the present invention, the linked object in the first page and first page by obtaining website to be checked,
Wherein, linked object is used to jump to second page;The first data set that extraction linked object is included is wrapped with second page
The second data set contained;The data element included in first data set is compared with the data element included in the second data set
It is right, obtain comparison result;It determines whether linked object is false links according to comparison result, has reached by comparing net to be checked
Stand all pages with its belonging to links header described in things whether unanimously check present in website wrong chain
The purpose connect it is achieved thereby that improving the efficiency of chaining check and the technique effect of accuracy in website, and then solves existing
Technology checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high using manual type.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair
Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 (a) is a kind of Website page schematic diagram according to prior art;
Fig. 1 (b) is a kind of Website page schematic diagram according to prior art;
Fig. 2 is the method flow diagram linked in a kind of inspection website according to embodiments of the present invention;
Fig. 3 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 4 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 5 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 6 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;
Fig. 7 is the method flow diagram linked in a kind of optional inspection website according to embodiments of the present invention;And
Fig. 8 is the schematic device linked in a kind of inspection website according to embodiments of the present invention.
Specific embodiment
In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention
The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects
It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, "
Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way
Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
First, the part noun or term occurred during the embodiment of the present application is described is suitable for following solution
It releases:
Cable chain:Full name is false links, and the actual pages content for referring mainly to website internal chaining title direction links mark with this
Situation about not being consistent is inscribed, cable chain is linked as title.Cable chain is different from chain rupture in the embodiment of the present application, and chain rupture refers to not visit
The link interrupted when the link or access asked, and the chain that cable chain is inconsistent with page-describing content that it is directed toward for links header
It connects.
Embodiment 1
According to embodiments of the present invention, a kind of embodiment of the method for checking and being linked in website is provided, it should be noted that
The step of flow of attached drawing illustrates can perform in the computer system of such as a group of computer-executable instructions, also,
It, in some cases, can be to be different from shown in sequence herein performs although showing logical order in flow charts
The step of going out or describing.
Fig. 2 is the method flow diagram linked in a kind of inspection website according to embodiments of the present invention, as shown in Fig. 2, the party
Method includes the following steps:
Step S202 obtains the linked object in the first page and first page of website to be checked, wherein, linked object
For jumping to second page.
Specifically, in above-mentioned steps, first page can be by any one in multiple pages that website to be checked includes
A or multiple pages;Above-mentioned linked object can be the passage for having embedded chained address or one shown in first page
Picture clicks the word or picture, can jump to other webpages of website to be checked, i.e. second page.To existing in website
False links checked, need to obtain the institute included on all pages and the page that website to be detected is included first
There is linked object.
In a kind of optional embodiment, by taking the webpage shown in Fig. 1 (a) and Fig. 1 (b) as an example, the webpage shown in Fig. 1 (a) can
Think above-mentioned first page, be links header respectively comprising 6 linked objects in the page:" vacation in mid-autumn next day china railway
It is expected that send passenger 7,800,000 ", " preheating follow premier go to America:The trip of three states, those watching focuses are worthy of expecting ", " build dream, chase after dream,
It interprets a dream:The travel of No. two successful connection illusions of Heavenly Palace ", " it is same to enjoy together a wheel moon at " China's Space station future can in orbit more than ten years "
Read family's national conditions --- the various regions masses celebrate the Mid-autumn Festival ", " allow urban and rural environment more clean and tidy beautiful ", each links header all points to one
A page, i.e., above-mentioned second page.User can be entered corresponding by clicking any one links header in first page
Second page, the links header of " enjoy together a wheel moon with reading family's national conditions --- the various regions masses celebrate the Mid-autumn Festival " for example, click, into
The second page entered is the page shown in Fig. 1 (b).
Herein it should be noted that a molding website, generally comprises multiple webpages, would generally be included on each webpage
One or more linked objects, for being directed toward a hyperlink target, which can be a page or identical
Other positions on the page.During existing false links check in website, need to check the institute that website is included
Have whether the linked object on the page is false links.
Step S204 obtains the second data set that the first data set that linked object is included is included with second page.
Specifically, in above-mentioned steps, by linked object for for passage, above-mentioned first data set can be from the
The entity set extracted in some links header in one page can be one or several words, can represent the links header
Meaning;Above-mentioned second data set can be the entity set that is extracted from the second page that the links header is directed toward or
One or several words can be used for characterizing the information that second page is included.
In a kind of optional embodiment, still by taking the webpage shown in Fig. 1 (a) and Fig. 1 (b) as an example, from the link of first page
Title " builds dream, chases after dream, interprets a dream:Entities such as " Heavenly Palaces two " can be extracted in the travel of No. two successful connection illusions of Heavenly Palace ", then
The first data set that the links header is included is " Heavenly Palace two ";From the links header of first page " a wheel moon is enjoyed together with thought
Family national conditions --- the various regions masses celebrate the Mid-autumn Festival " in can extract the entities such as " moon ", " Mid-autumn Festival ", then the links header is wrapped
The first data set contained is " moon, the Mid-autumn Festival ";Can be extracted in content of pages from the second page shown in Fig. 1 (b) " moon ",
The entities such as " Mid-autumn Festival ", " university ", " school badge moon cake ", then the second data set that second page is included are " moon, the Mid-autumn Festival, big
It learns, school badge moon cake ... ".
Herein it should be noted that since linked object can be word or picture, thus, the first data set and the second number
According to concentrate the entity type that includes can be not limited to word, can also the other forms such as picture.
Step S206 carries out the data element included in the data element included in the first data set and the second data set
It compares, obtains comparison result.
Specifically, in above-mentioned steps, data element can include in above-mentioned first data set and the second data set
Entity;The second page that the first data set that some linked object is included in first page is extracted is directed toward with the linked object
After the second data set that face is included, the data element that will be included in the data element included in the first data set and the second data set
Element is compared, and obtains corresponding comparison result.
Herein it should be noted that the quantity of the principal entities extracted from the page generally will be far more than from links header
The quantity of the principal entities of middle extraction, thus, in a kind of optional embodiment, above-mentioned comparison result can be in the second data set
Whether in first data set whole data element is contained.
Step S208 determines whether linked object is false links according to comparison result.
Specifically, in above-mentioned steps, by the data element included in the first data set with being included in the second data set
Data element be compared after, in a kind of optional embodiment, can by check from second page extract second number
The links header is usually determined according to whether concentration contains data element whole in the first data set extracted from links header
Whether it is false links.If the first number extracted from links header is contained in the second data set extracted from second page
According to the data element for concentrating whole, that is, extracted in the entity sets extracted from the content of pages of second page comprising links header
Entity, it is determined that the links header is normal link;If do not include in the second data set extracted from second page from chain
Connect the data element in the first data set extracted in title, that is, in the entity sets extracted from the content of pages of second page
Entity not comprising links header extraction, it is determined that the links header is false links.
In a kind of optional embodiment, still by taking Fig. 1 (a) and webpage shown in Fig. 1 (b) as an example, from the shown in Fig. 1 (b)
The second data set that two pages are included is " moon, the Mid-autumn Festival, university, school badge moon cake ... ", which contains from first
The first data extracted in the links header " enjoy together a wheel moon with reading family's national conditions --- the various regions masses celebrate the Mid-autumn Festival " of the page
Collect " moon, the Mid-autumn Festival ";The links header that the data set does not include first page " builds dream, chases after dream, interprets a dream:Heavenly Palace two successfully connects
The first data set " Heavenly Palace two " extracted in the travel of continuous illusion ";Therefore, if links header " enjoys together a wheel moon and reads house together
National conditions --- the various regions masses celebrate the Mid-autumn Festival " be directed toward Fig. 1 (b) shown in second page, then the links header be directed toward page
Face describes same things, it may be said that the bright links header is normal link;If links header " builds dream, chases after dream, circle
Dream:The second page shown in Fig. 1 (b) that the travel of No. two successful connection illusions of Heavenly Palace " is directed toward, then the links header be directed toward with it
Page-describing is not same things, and it is false links to illustrate the links header.
From the foregoing, it will be observed that in the above embodiments of the present application, by obtaining all content of pages in website to be checked and its institute
The links header of category, and entity extraction is carried out to the links header belonging to content of pages and its, obtaining content of pages and its institute
After the principal entities object of the links header of category, compare content of pages with its belonging to links header in principal entities object,
According to comparison result determine the links header whether false links, it should be noted that due to the master extracted from content of pages
It wants the principal entities quantity extracted in the generally extra links header of physical quantities, in a kind of optional embodiment, can check page
Whether the entity of its affiliated links header extraction is contained in the entity sets of face contents extraction to determine the chain belonging to the page
Connect whether title is false links.By scheme disclosed in above-described embodiment, reach all by comparing website to be checked
The page with its belonging to links header described in things whether unanimously check the purpose of false links present in website,
It is achieved thereby that improving the efficiency of chaining check and the technique effect of accuracy in website, and then solves the prior art using people
Work mode checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high.
In a kind of optional embodiment, as shown in figure 3, by the data element included in the first data set and the second data
The data element included is concentrated to be compared, comparison result is obtained, may include steps of:
Step S302 searches the first data set data element identical with the second data set;
Step S304 counts the quantity of identical data element;
Step S306 calculates the ratio of number data elements included in the quantity of identical data element and the first data set
Value.
Specifically, in the above-described embodiments, the first data set is extracted from some links header in first page, and
After the second data set being extracted in the second page being directed toward from the links header, the data element that can will be included in the first data set
Element is compared with the data element included in the second data set, finds out the first data set data identical with the second data set
Element, that is, the identical entity that links header is included with the second page that it is directed toward, and the quantity of identical entity is counted, one
In kind of optional embodiment, can calculate identical entity quantity and the first data set in the ratio of physical quantities that includes.
In a kind of optional embodiment, as shown in figure 4, determining whether linked object is wrong chain according to comparison result
It connects, including:
Step S402, if ratio is more than or equal to predetermined threshold value, it is determined that linked object is normal link.
Step S404, if ratio is less than predetermined threshold value, it is determined that linked object is false links.
Specifically, in the above-described embodiments, the reality included in the quantity that identical entity is calculated and the first data set
After the ratio of body quantity, the size of the ratio and predetermined threshold value is judged to determine whether linked object is false links;It if should
Ratio is more than or equal to predetermined threshold value, it is determined that the links header is consistent with the described things of the page that it is directed toward, the link mark
Entitled normal link;If the ratio is less than predetermined threshold value, it is determined that the links header and the described thing of the page that it is directed toward
Object is inconsistent, which is false links.
In a kind of optional embodiment, above-mentioned predetermined threshold value can be more than 0.5~1 in any one value.
By above-described embodiment, it can realize that machine judges whether links header and the webpage that it is directed toward are consistent, avoid
The subjective factor of artificial judgment influences, and makes Rule of judgment more standardized.
In a kind of optional embodiment, as shown in figure 5, obtaining the first page and of website to be checked in step S202
Linked object in one page can include:
Step S502 crawls website to be checked by way of reptile, obtains the first page of website to be checked
And the linked object in first page.
Specifically, in the above-described embodiments, all pages of website to be checked and all pages are obtained using means are crawled
Linked object on face.
It should be noted that the source title for marking each page, i.e. chain belonging to the page are needed in the process crawling
Title is connect, by clicking the links header, the page can be jumped to.
By above-described embodiment, what it is due to reptile progress is crawling for entire web site contents, and the covering surface checked will
It is more more comprehensive than manually.
In a kind of optional embodiment, as shown in fig. 6, obtaining the first data set and second page that linked object is included
The second data set that face is included, including:
Step S602, the second text that the first text-string and second page that extraction linked object is included are included
Character string;
First text-string and the second text-string are carried out word segmentation processing, obtain third data set by step S604
With the 4th data set;
Step S606 according to preset algorithm model, extracts first object data element in third data set and is put into the first number
According to collection, and the second target data element extracted in the 4th data set is put into the second data set.
Specifically, in the above-described embodiments, by using the technology of natural language analysis, to content of pages and links header
Entity extraction is carried out respectively, gets the principal entities object of links header and content of pages, specifically, obtains link pair first
As the text-string that the text-string and second page that are included are included, the first text-string and the second text character
String carries out word segmentation processing, obtains containing the third data set of all words in links header and contains in second page and owns
4th data set of word, and according to preset extraction algorithm model, the entity object for extracting links header is put into the first data
Collection and the entity object of second page be put into the second data set, these entity objects can characterize including for links header
The semantic information that semantic information and second page are included.
In a kind of optional embodiment, first object in third data set can be extracted using natural language analysis technology
The second target data element that data element is put into the first data set and the 4th data set is put into the second data set.It is above-mentioned
Preset algorithm model includes but not limited to following several:KNN algorithms, NB Algorithm, decision Tree algorithms, neural network
Method, linear least square, K-Means algorithms, cosine similarity scheduling algorithm.
By natural language analysis technology, can more intelligently get included in content of pages and links header
Information, may thereby determine that content of pages description things whether with its belonging to the described things of links header it is consistent,
Improve efficiency and the accuracy of inspection.
In a kind of optional embodiment, as shown in fig. 7, in the first text-string for being included of extraction linked object and
Before the second text-string that second page is included, the above method can also include:
Step S702 based on text density extraction algorithm, extracts the content of pages of second page, which includes:
Step S7021 obtains the document tree of second page;
Step S7023 extracts the text character in each label node in document tree, and each label in statistic document tree
The text character number of node;
Step S7025 calculates the text density of each label node, wherein, text density is the text of each label node
This number of characters accounts for the ratio of total text character number of document tree;
Step S7027 extracts the content of text of the label node of text character density maximum, the page as second page
Content.
Specifically, in the above-described embodiments, it before the entity object included in extraction webpage, first has to obtain the page
Content of pages, in a kind of optional embodiment, the extraction of content of pages can be realized using text density extraction algorithm,
Meet the tree-like knot of DOM (Document Object Model) standard of W3C tissue publications using the HTML content foundation of webpage
Structure;Then each label node of the dom tree of traversal webpage, label where positioning text using Chinese punctuate and link information,
Second decimation is carried out to this label substance, extracts accurate body matter;In the text extracted in each label node
Rong Hou counts the text character number included in each label node, and calculates the text density in each label node, wherein,
The body matter of content of text, the most likely page in the label node of text character density maximum, thus, by text word
Accord with content of pages of the content of text in the label node of density maximum as second page.
It should be noted that the upper data of Web page are occurred in the form of HTMl, html document is by marking
Note and element composition, most of HTML markups occur in pairs, are used separately as beginning label and terminate to mark, for example, webpage
Show that the title of content is usually marked by < TITLE > </TITLE >, and the subject content of webpage then mainly has several < P > </P >
To mark.Thus, during information extraction, the characteristics of being write using html document, extract < TITLE > </TITLE >
With < P > </P > marker characters and in-between content.
In a kind of optional embodiment, by taking the page shown in Fig. 1 (b) as an example, only from the content of text of body part
The entity elements of extraction could be used to characterize the information of second page, and in the text of the parts such as web page navigation column, link label
Can only extraction result be interfered by holding, thus, based on above-described embodiment, the text of multiple labels in second page is got
It, can more accurately using the content of text in the label of text density maximum as the content of pages of second page after content
Get the entity elements for characterizing second page meaning.
By above-described embodiment, extraction can be realized for characterizing the content of text of page semantic information, and is cast out
Incoherent content of text, improves accuracy rate.
As a kind of preferred embodiment, scheme disclosed in the above embodiments of the present application can crawl mould by web site contents
Block, title and content substance extraction module, entity comparison module three modules realize, wherein, web site contents crawl module and bear
Duty obtains all content of pages and its affiliated links header in website to be checked;Title and content substance extraction module are responsible for
Web site contents are crawled to the links header crawled in module and content of pages analyzes and processes, use natural language analysis
Technology carries out entity extraction to content of pages and links header, gets the principal entities of links header and content of pages respectively
Object;Entity comparison module compares the entity object of links header and content of pages, so as to finally determine whether the link is wrong
Chain.
By scheme disclosed in above-described embodiment, flow is judged to the entire cable chain that entity compares again from crawling to parsing,
Hand inspection logic is simulated by summary, realizes the inspection that the false links in website are quickly carried out by program, greatly
Reduce hand inspection cost;Technology used in modules has certain retractility, with each module of technological progress
With good alternative solution.
Embodiment 2
According to embodiments of the present invention, a kind of device embodiment for checking and being linked in website is additionally provided.The embodiment of the present invention 1
In inspection website in the method that links can be performed in the device of the embodiment of the present invention 2.
Fig. 8 is the schematic device linked in a kind of inspection website according to embodiments of the present invention, as shown in figure 8, the dress
Put including:First acquisition module 801, the second acquisition module 803,805 and first determining module 807 of comparing module.
Wherein, the first acquisition module 801, for obtaining the link pair in the first page of website to be checked and first page
As, wherein, linked object is used to jump to second page;Second acquisition module 803, for extract that linked object is included
The second data set that one data set is included with second page;Comparing module 805, for the data that will be included in the first data set
Element is compared with the data element included in the second data set, obtains comparison result;First determining module 807, for root
Determine whether linked object is false links according to comparison result.
From the foregoing, it will be observed that in the above embodiments of the present application, by obtaining all content of pages in website to be checked and its institute
The links header of category, and entity extraction is carried out to the links header belonging to content of pages and its, obtaining content of pages and its institute
After the principal entities object of the links header of category, compare content of pages with its belonging to links header in principal entities object,
According to comparison result determine the links header whether false links, it should be noted that due to the master extracted from content of pages
It wants the principal entities quantity extracted in the generally extra links header of physical quantities, in a kind of optional embodiment, can check page
Whether the entity of its affiliated links header extraction is contained in the entity sets of face contents extraction to determine the chain belonging to the page
Connect whether title is false links.By scheme disclosed in above-described embodiment, reach all by comparing website to be checked
The page with its belonging to links header described in things whether unanimously check the purpose of false links present in website,
It is achieved thereby that improving the efficiency of chaining check and the technique effect of accuracy in website, and then solves the prior art using people
Work mode checks that false links present in website cause the technical issues of efficiency is low, accuracy is not high.
In a kind of optional embodiment, above-mentioned comparing module 805 includes:Searching module, for searching the first data set
The data element identical with the second data set;Statistical module, for counting the quantity of identical data element;First calculates mould
Block, for calculating the ratio of number data elements included in the quantity of identical data element and the first data set.
In a kind of optional embodiment, above-mentioned first determining module 807 includes:Second determining module, if for than
Value is more than or equal to predetermined threshold value, it is determined that linked object is normal link.Third determining module, if be less than for ratio default
Threshold value, it is determined that linked object is false links.
In a kind of optional embodiment, above-mentioned first acquisition module 801 includes:Third acquisition module, is climbed for passing through
The mode of worm crawls website to be checked, obtains the linked object in the first page and first page of website to be checked.
In a kind of optional embodiment, above-mentioned second acquisition module 803 includes:First extraction module, for extracting chain
Connect the first text-string that object included and the second text-string that second page is included;First processing module is used
In the first text-string and the second text-string are carried out word segmentation processing, third data set and the 4th data set are obtained;The
Two extraction modules, for according to preset algorithm model, extracting first object data element in third data set and being put into the first data
Collection, and the second target data element extracted in the 4th data set is put into the second data set.
In a kind of optional embodiment, above device further includes:Third extraction module, for being based on text density extraction
Algorithm extracts the content of pages of second page;Wherein, third extraction module includes:4th acquisition module, for obtaining second page
The document tree in face;Second processing module, for extracting the text character in document tree in each label node, and statistic document tree
In each label node text character number;Second computing module, for calculating the text density of each label node, wherein,
Text density accounts for the ratio of total text character number of document tree for the text character number of each label node;4th extraction module,
For extracting the content of text of the label node of text character density maximum, the content of pages as second page.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment
The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or
Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially
The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products
It embodies, which is stored in a storage medium, is used including some instructions so that a computer
Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or
Part steps.And aforementioned storage medium includes:USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code
Medium.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of check the method linked in website, which is characterized in that including:
Obtain website to be checked first page and the first page in linked object, wherein, the linked object is used for
Jump to second page;
Obtain the second data set that the first data set that the linked object is included is included with the second page;
The data element included in first data set is compared with the data element included in second data set,
Obtain comparison result;
Determine whether the linked object is false links according to the comparison result.
2. according to the method described in claim 1, it is characterized in that, by the data element included in first data set and institute
It states the data element included in the second data set to be compared, obtains comparison result, including:
Search first data set data element identical with second data set;
Count the quantity of the identical data element;
Calculate the ratio of number data elements included in the quantity of the identical data element and first data set.
3. according to the method described in claim 2, it is characterized in that, whether the linked object is determined according to the comparison result
For false links, including:
If the ratio is more than or equal to predetermined threshold value, it is determined that the linked object is normal link;
If the ratio is less than the predetermined threshold value, it is determined that the linked object is false links.
4. according to the method described in claim 1, it is characterized in that, obtain the first page of website to be checked and the first page
Linked object in face, including:
The website to be checked is crawled by way of reptile, obtains the first page of the website to be checked and described
Linked object in first page.
5. according to the method described in claim 1, it is characterized in that, obtain the first data set that the linked object included with
The second data set that the second page is included, including:
Extract the first text-string that the linked object included and the second text character that the second page is included
String;
First text-string and second text-string are subjected to word segmentation processing, obtain third data set and the 4th
Data set;
According to preset algorithm model, extract first object data element in the third data set and be put into first data set,
And the second target data element extracted in the 4th data set is put into the second data set.
6. according to the method described in claim 5, it is characterized in that, the first text word included in the extraction linked object
Before the second text-string that symbol string and the second page are included, the method further includes:
Based on text density extraction algorithm, the content of pages of the second page is extracted, which includes:
Obtain the document tree of the second page;
The text character in each label node in the document tree is extracted, and counts the text word in each label node
Accord with number;
The text density of each label node is calculated, wherein, the text density is the text in each label node
This number of characters accounts for the ratio of total text character number of the document tree;
Extract the content of text of the label node of text character density maximum, the content of pages as the second page.
7. a kind of check the device linked in website, which is characterized in that including:
First acquisition module, for obtaining the linked object in the first page of website to be checked and the first page, wherein,
The linked object is used to jump to second page;
Second acquisition module is included with the second page for obtaining the first data set that the linked object included
Second data set;
Comparing module, for the data that will be included in the data element included in first data set and second data set
Element is compared, and obtains comparison result;
First determining module, for determining whether the linked object is false links according to the comparison result.
8. device according to claim 7, which is characterized in that the comparing module includes:
Searching module, for searching first data set data element identical with second data set;
Statistical module, for counting the quantity of the identical data element;
First computing module, for calculating the data included in the quantity of the identical data element and first data set
The ratio of number of elements.
9. device according to claim 8, which is characterized in that first determining module includes:
Second determining module, if being more than or equal to predetermined threshold value for the ratio, it is determined that the linked object is normal, chain
It connects;
Third determining module, if being less than the predetermined threshold value for the ratio, it is determined that the linked object is wrong chain
It connects.
10. device according to claim 7, which is characterized in that first acquisition module includes:
Third acquisition module for being crawled by way of reptile to the website to be checked, obtains the net to be checked
Linked object in the first page and the first page stood.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611248655.6A CN108255866B (en) | 2016-12-29 | 2016-12-29 | Method and device for checking links in website |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611248655.6A CN108255866B (en) | 2016-12-29 | 2016-12-29 | Method and device for checking links in website |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108255866A true CN108255866A (en) | 2018-07-06 |
CN108255866B CN108255866B (en) | 2020-10-27 |
Family
ID=62721341
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611248655.6A Active CN108255866B (en) | 2016-12-29 | 2016-12-29 | Method and device for checking links in website |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108255866B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408760A (en) * | 2018-09-30 | 2019-03-01 | 东软集团股份有限公司 | The method and apparatus for obtaining the information of necrosis link |
CN110889051A (en) * | 2018-09-10 | 2020-03-17 | 阿里巴巴集团控股有限公司 | Page hyperlink detection method, device and equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000628A (en) * | 2006-01-13 | 2007-07-18 | 国际商业机器公司 | Wrong hyperlink detection equipment and method |
CN101510195A (en) * | 2008-02-15 | 2009-08-19 | 刘峰 | Website safety protection and test diagnosis system structure method based on crawler technology |
CN102436564A (en) * | 2011-12-30 | 2012-05-02 | 奇智软件(北京)有限公司 | Method and device for identifying falsified webpage |
KR101443071B1 (en) * | 2013-12-10 | 2014-09-22 | 주식회사 브이시스템즈 | Error Check System of Webpage |
-
2016
- 2016-12-29 CN CN201611248655.6A patent/CN108255866B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000628A (en) * | 2006-01-13 | 2007-07-18 | 国际商业机器公司 | Wrong hyperlink detection equipment and method |
CN101510195A (en) * | 2008-02-15 | 2009-08-19 | 刘峰 | Website safety protection and test diagnosis system structure method based on crawler technology |
CN102436564A (en) * | 2011-12-30 | 2012-05-02 | 奇智软件(北京)有限公司 | Method and device for identifying falsified webpage |
KR101443071B1 (en) * | 2013-12-10 | 2014-09-22 | 주식회사 브이시스템즈 | Error Check System of Webpage |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110889051A (en) * | 2018-09-10 | 2020-03-17 | 阿里巴巴集团控股有限公司 | Page hyperlink detection method, device and equipment |
CN109408760A (en) * | 2018-09-30 | 2019-03-01 | 东软集团股份有限公司 | The method and apparatus for obtaining the information of necrosis link |
Also Published As
Publication number | Publication date |
---|---|
CN108255866B (en) | 2020-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101464905B (en) | Web page information extraction system and method | |
CN104408093B (en) | A kind of media event key element abstracting method and device | |
CN102841920B (en) | Method and device for extracting webpage frame information | |
CN113051500B (en) | Phishing website identification method and system fusing multi-source data | |
CN103942340A (en) | Microblog user interest recognizing method based on text mining | |
JP2006004417A (en) | Method and device for recognizing specific type of information file | |
CN103136358B (en) | A kind of method of Automatic Extraction forum data | |
CN102663023A (en) | Implementation method for extracting web content | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN106934275A (en) | A kind of password intensity evaluating method based on personal information | |
CN106951571A (en) | A kind of method and apparatus for giving application mark label | |
CN109271627A (en) | Text analyzing method, apparatus, computer equipment and storage medium | |
CN109194677A (en) | A kind of SQL injection attack detection, device and equipment | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN104537028B (en) | A kind of Web information processing method and device | |
CN108170678A (en) | A kind of text entities abstracting method and system | |
CN104199838B (en) | A kind of user model constructing method based on label disambiguation | |
CN112633431A (en) | Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC | |
US20160283582A1 (en) | Device and method for detecting similar text, and application | |
CN107145591A (en) | Title-based webpage effective metadata content extraction method | |
CN108255866A (en) | Check the method and apparatus linked in website | |
CN106485525A (en) | Information processing method and device | |
CN114780709A (en) | Text matching method and device and electronic equipment | |
CN107239520A (en) | A kind of universal forum context extraction method | |
CN109347873A (en) | A kind of detection method, device and the computer equipment of order injection attacks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |