CN104123363B - Webpage master map extracting method and device - Google Patents
Webpage master map extracting method and device Download PDFInfo
- Publication number
- CN104123363B CN104123363B CN201410346226.7A CN201410346226A CN104123363B CN 104123363 B CN104123363 B CN 104123363B CN 201410346226 A CN201410346226 A CN 201410346226A CN 104123363 B CN104123363 B CN 104123363B
- Authority
- CN
- China
- Prior art keywords
- picture
- webpage
- text
- master map
- html
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000000007 visual effect Effects 0.000 claims abstract description 37
- 238000004088 simulation Methods 0.000 claims abstract description 34
- 238000012216 screening Methods 0.000 claims abstract description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 27
- 238000000605 extraction Methods 0.000 claims description 14
- 239000003550 marker Substances 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000001149 cognitive effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000005252 bulbus oculi Anatomy 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of webpage master map extracting method and devices.This method includes:The html text for obtaining webpage carries out simulation typesetting displaying to html text, and obtains the visual information of each HTML element in webpage;Html text is cut as unit of block message;The text message in block message is obtained, and pictorial information is obtained from block message according to visual information;The picture for meeting predetermined vision requirement is obtained according to pictorial information, and according to text message and pictorial information, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and using the picture as the master map of webpage.By means of technical scheme of the present invention, master map selection can be made to reach very high accuracy rate and efficiency.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of webpage master map extracting method and device.
Background technology
With the development of Internet technology, and hypertext markup language (Hypertext Markup Language, referred to as
HTML) form of expression of webpage is more and more diversified, and one of trend therein is exactly a large amount of appearance of picture in webpage.And tradition
Word compare, picture arresting power and express the meaning aspect have oneself unique advantage.Therefore many search are drawn at present
It holds up and in addition to offer title and other than making a summary, additionally provides the master map extracted from webpage in search result.
As shown in Figure 1, in the prior art, more and more pictures are contained in the result of search engine, this for
Family identifies the information oneself to be found, and it is helpful to improve clicking rate.Simultaneously in terms of Internet advertising, compared to purely dispensing
The advertisement of Text Link, display advertising have the advantage of bigger, can allow user is very clear to see product information.Therefore, from
Master map technology is extracted in webpage and is improving user's search experience, seems extremely important in terms of improving clicking rate.To be badly in need of at present
A kind of webpage master map extracting method.
Invention content
In view of the above problems, it is proposed that the present invention overcoming the above problem in order to provide one kind or solves at least partly
State the webpage master map extracting method and device of problem.
The present invention provides a kind of webpage master map extracting method, including:The html text for obtaining webpage carries out html text
Typesetting displaying is simulated, and obtains the visual information of each HTML element in webpage;Html text is carried out as unit of block message
Cutting;The text message in block message is obtained, and pictorial information is obtained from block message according to visual information;According to pictorial information
The picture for meeting predetermined vision requirement is obtained, and according to text message and pictorial information, from the picture for meeting predetermined vision requirement
In further selection meet the picture of screening rule, and using the picture as the master map of webpage.
Preferably, the html text for obtaining webpage specifically includes:Webpage is obtained according to the uniform resource position mark URL of webpage
Html text.
Preferably, visual information includes:Location information of each HTML element in simulation typesetting displaying and big in webpage
Small information.
Preferably, text message includes:Non- hyperlink text length, hyperlink text length, hyperlink number, hyperlink
Array and picture array.
Preferably, pictorial information includes:The URL of image link, picture illustrate text, the length of picture, picture width
The abscissa of ordinate and picture in simulation typesetting displaying of degree, picture in simulation typesetting displaying.
Preferably, pictorial information is obtained to specifically include:The URL of image link and the explanation of picture are extracted from block message
Text;The width of the length and picture of picture is calculated according to pre-set algorithm priority;According to acquisition of vision information picture
The abscissa of ordinate and picture in simulation typesetting displaying in simulation typesetting displaying.
Preferably, it is specifically included according to the length of pre-set algorithm priority calculating picture and the width of picture following
It is at least one:The algorithm of highest priority is:The length and picture of picture are obtained by HTML markup in iconic marker
Width;The algorithm of second priority is:Capturing pictures and by mapping software obtain picture length and picture width;Third
The algorithm of priority is:Pass through the length and picture of the document dbject model DOM acquisition of information pictures in browser display engine
Width.
Preferably, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length and width of picture
Size and Aspect Ratio meet pre-provisioning request.
Preferably, screening rule specifically includes following at least one:To be located at web page navigation item or menu and long text it
Between picture as master map;In the identical one group of picture of size, select the first pictures as master map;To search results pages class
The webpage of type chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate picture
Illustrate the correlation between text and Web page subject, using the highest picture of correlation as master map;Webpage be website homepage or
When person's special topic page, website logo is chosen as master map.
The present invention also provides a kind of webpage master map extraction elements, including:Webpage capture module, for obtaining webpage
Html text carries out simulation typesetting displaying to html text, and obtains the visual information of each HTML element in webpage;HTML is solved
Module is analysed, for cutting html text as unit of block message;Data obtaining module, for obtaining the text in block message
This information, and pictorial information is obtained from block message according to visual information;Screening module, for being met according to pictorial information acquisition
The picture of predetermined vision requirement, and according to text message and pictorial information, from the picture for meeting predetermined vision requirement further
Selection meets the picture of screening rule, and using the picture as the master map of webpage.
Preferably, webpage capture module is specifically used for:The HTML of webpage is obtained according to the uniform resource position mark URL of webpage
Text.
Preferably, visual information includes:Location information of each HTML element in simulation typesetting displaying and big in webpage
Small information.
Preferably, text message includes:Non- hyperlink text length, hyperlink text length, hyperlink number, hyperlink
Array and picture array.
Preferably, pictorial information includes:The URL of image link, picture illustrate text, the length of picture, picture width
The abscissa of ordinate and picture in simulation typesetting displaying of degree, picture in simulation typesetting displaying.
Preferably, data obtaining module is specifically used for:The URL of image link and the explanation of picture are extracted from block message
Text;The width of the length and picture of picture is calculated according to pre-set algorithm priority;According to acquisition of vision information picture
The abscissa of ordinate and picture in simulation typesetting displaying in simulation typesetting displaying.
Preferably, the algorithm of highest priority is:By the HTML markup in iconic marker come obtain picture length and
The width of picture;The algorithm of second priority is:Capturing pictures and by mapping software obtain picture length and picture width
Degree;The algorithm of third priority is:Pass through the length of the document dbject model DOM acquisition of information pictures in browser display engine
With the width of picture.
Preferably, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length and width of picture
Size and Aspect Ratio meet pre-provisioning request.
Preferably, screening rule specifically includes following at least one:To be located at web page navigation item or menu and long text it
Between picture as master map;In the identical one group of picture of size, select the first pictures as master map;To search results pages class
The webpage of type chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate picture
Illustrate the correlation between text and Web page subject, using the highest picture of correlation as master map;Webpage be website homepage or
When person's special topic page, website logo is chosen as master map.
The present invention has the beneficial effect that:
The master map of webpage is carried out by pictorial information candidate and smart to the master map progress in Candidate Set according to screening rule
Choosing can make master map selection reach very high accuracy rate, in addition, the technical solution of the embodiment of the present invention is due to using visual area
Domain is positioned so that candidate calculative picture greatly reduces, and greatly improves the extraction speed of master map.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technical means of the present invention,
And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, below the special specific implementation mode for lifting the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit are common for this field
Technical staff will become clear.Attached drawing only for the purpose of illustrating preferred embodiments, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the schematic diagram searched plain engine results page in the prior art and show webpage master map;
Fig. 2 is the flow chart of the webpage master map extracting method of the embodiment of the present invention;
Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention;
Fig. 4 is the schematic diagram of the master map Sample Filter 1 of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the master map Sample Filter 2 of the embodiment of the present invention;
Fig. 6 is the schematic diagram of the master map Sample Filter 3 of the embodiment of the present invention;
Fig. 7 is the schematic diagram of the master map Sample Filter 4 of the embodiment of the present invention;
Fig. 8 is the schematic diagram of the master map Sample Filter 5 of the embodiment of the present invention;
Fig. 9 is the schematic diagram of the master map Sample Filter 6 of the embodiment of the present invention;
Figure 10 is the structural schematic diagram of the webpage master map extraction element of the embodiment of the present invention.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
The method for extracting webpage master map may include following two modes:
Mode one:Statistics based on user behavior, this method is based on a kind of it is assumed that the picture user i.e. in webpage clicks and gets over
It is much more important.Specific technical solution is as follows:User's hits of all pictures on page of often throwing the net are counted first, and then, selection is used
Highest picture is clicked as webpage master map in family.But above-mentioned technical proposal has the following problems:1, recall rate is not high:It is not
All pictures have user to click behavior, and some pictures just do not link.2, effective shortcoming:For emerging webpage,
Due to there is no user behavior information, so picture can not be extracted.3, confidence level problem:In the less situation of picture number of clicks
Under, it is susceptible to deviation, and for many little companies, abundant user behavior data as major company can not be obtained.
4, user behavior deviation:Such as in webpage if there is picture be some sexy women pictures, can more attract eyeball, therefore
Obtain more click.
Mode two:Based on machine learning classification method, specific technical solution is as follows:Step 1, the spy of picture in webpage is extracted
Sign, for example, picture size, the position in HTML, the description information etc. of picture;Step 2, prepare mark collection, choose a fixed number
The webpage of amount is labeled picture therein, mark whether master map;Step 3, it is trained (for example, patrolling using disaggregated model
Collect recurrence, SVM, decision forest, GBDT etc.), obtain model;Step 4, the model finished using training carries out picture in webpage
It predicts whether as master map.But above-mentioned technical proposal has the following problems:1, mark needs a large amount of manpower, to cover inhomogeneity
There are many webpage of type, the picture number in each webpage.2, it needs to select a large amount of feature, it can not be at once for badcase
It solves.3, it needs to calculate all pictures, calculation amount is larger.
In order to solve the above problem in the prior art, the present invention provides a kind of webpage master map extracting method and device,
Online and offline two ways is supported to extract master map;Incoming webpage URL is only needed when online, captures html text, and by clear
Device display engine of looking at carry out typesetting displaying, by the parsing of html text be organized into the required data structure of subsequent processing and
Organizational form, finally carries out visual information and the analysis of screening rule obtains webpage master map.Below in conjunction with attached drawing and embodiment,
The present invention will be described in further detail.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention,
Do not limit the present invention.
Embodiment of the method
According to an embodiment of the invention, a kind of webpage master map extracting method is provided, Fig. 2 is the webpage of the embodiment of the present invention
The flow chart of master map extracting method, as shown in Fig. 2, webpage master map extracting method according to the ... of the embodiment of the present invention includes following place
Reason:
S210 obtains the html text of webpage, carries out simulation typesetting displaying to html text, and obtain each in webpage
The visual information of HTML element;Wherein, in embodiments of the present invention, visual information includes:Each HTML element is in mould in webpage
Location information in quasi- typesetting displaying and size information.
The embodiment of the present invention supports online and offline two ways to extract master map;It needs to get the HTML of webpage when offline
Text, and it is online when can be captured according to the URL of webpage, the online html text for obtaining webpage.
S220 cuts html text as unit of block message;It should be noted that above-mentioned block message refer to<
DIV>,<TABLE>The HTML fragment of this kind of label composition.
S230 obtains the text message in block message, and obtains pictorial information from block message according to visual information;Its
In, above-mentioned text message may include:Non- hyperlink text length, hyperlink text length, hyperlink number, hyperlink number
Group and picture array.Pictorial information includes:The URL of image link, picture illustrate text, the length of picture, picture width
The abscissa of ordinate and picture in simulation typesetting displaying of degree, picture in simulation typesetting displaying.
That is, in S230, the pictorial information that is obtained from block message according to visual information can regard as by
A kind of more detailed visual information of processing.
In S230, obtains pictorial information and specifically include:
Step 1, it extracts the URL of image link from block message and picture illustrates text;
Step 2, the width of the length and picture of picture is calculated according to pre-set algorithm priority;Specifically:According to
The length of pre-set algorithm priority calculating picture and the width of picture specifically include following at least one:Highest priority
Algorithm be:The length of picture and the width of picture are obtained by the HTML markup in iconic marker;The calculation of second priority
Method is:Capturing pictures and by mapping software obtain picture length and picture width;The algorithm of third priority is:Pass through
The width of the length and picture of document dbject model DOM acquisition of information pictures in browser display engine.
Step 3, the ordinate according to acquisition of vision information picture in simulation typesetting displaying and picture are in simulation typesetting
Abscissa in displaying.
S240 obtains the picture for meeting predetermined vision requirement according to pictorial information (for example, picture size meets:Long (60~
760), wide (60~760), Aspect Ratio meet the picture between (0.5~2.5)), and according to text message and pictorial information,
Further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and using the picture as the master of webpage
Figure.
In S240, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length of picture
Roomy small and Aspect Ratio meets pre-provisioning request.
Screening rule specifically includes following at least one:By the picture between web page navigation item or menu and long text
As master map;In the identical one group of picture of size, select the first pictures as master map;To the net of search results pages type
Page chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate the expository writing of picture
Sheet and the correlation between Web page subject, using the highest picture of correlation as master map;It is website homepage or special topic in webpage
When page, website logo is chosen as master map.
Below in conjunction with example and attached drawing, the above-mentioned technical proposal of the embodiment of the present invention is continued to be described in detail.
Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention, is only needed when as shown in figure 3, online
It is passed to webpage URL, webpage capture module is captured, and carries out typesetting displaying by browser display engine, is then passed through
HTML parsing modules carry out parsing is organized into the required data structure of downstream module and organizational form, finally by visual information with
Rule base analysis module is analyzed to obtain webpage master map.Each processing procedure that webpage master map extracting method is related to below into
Row is described in detail:
Webpage capture module:Different from traditional handling module based on CURL, WGET, http protocol, which is not
It is simple to obtain html text, it needs to obtain two aspect information:First, html text;Second is that carrying out typesetting exhibition to html text
Show, the behavior of simulation browser, while supporting JavaScript, to obtain the display location of each HTML element in a browser
With size (namely visual information).
In embodiments of the present invention, the typesetting displaying of webpage capture module can be realized by Phantomjs,
Phantomjs is a kind of browser display engine, based on webkit kernels, possesses perfect Javascript parsings, the page
Function is rendered, can be used for simulating the various events that a modern browser is done when loading webpage.
In addition, in embodiments of the present invention, the visual information that webpage capture module obtains can be visited by JavaScript
The DOM structure of HTML is asked to obtain:
Var actualLeft=images [i] .offsetLeft;
Var actualTop=images [i] .offsetTop;
Var current=images [i] .offsetParent;
while(current!==null)
ActualLeft+=current.offsetLeft;
ActualTop+=current.offsetTop;
Current=current.offsetParent;}
HTML parsing modules:HTML is parsed, with a finite state machine, html text is carried out according to block message
Cutting is the main mesh done so to carry out structured organization to webpage, is the foundation stone of subsequent processing.
For example, to following HTML fragment
Become following data structure after parsing:
In the above-described example, block message is mainly the text and hyperlink composition in block.
Visual information and rule base analysis module:The block message that HTML parsing modules are passed to first handled to obtain as
Lower two kinds of data structures (the i.e. above-mentioned text message and pictorial information obtained from block message):
class TextBlock:
def__init__(self):
class ImageBlock:
In embodiments of the present invention, the length and width calculating of picture according to priority divides three kinds:(that is, turning if calculating and not coming out
For lower one kind):
1, picture length and width are obtained by the HTML markup in iconic marker;
2, capturing pictures obtain picture length and width by ImageMagick;
3, picture length and width are obtained by DOM information in phantomjs.
Visual information also needs to view-based access control model with rule base analysis module and carries out picture selection:Specifically, as shown in figure 4,
According to common sense, general webpage author generally can be placed on people's vision foreground on webpage when putting master map,
It is exactly the place that people is most readily visible.And webpage bottom, the position for needing roll mouse that can just see, the corner area of webpage,
It is general to be seldom used for placing master map.
By being counted to the visual zone where a large amount of webpage master maps, the position where most master maps is horizontal seat
0 to 700 are designated as, ordinate is 20 to 1000 (units:Pixel) rectangular region in.It, can be by the area by the visual information
All pictures for meeting a certain size are put into Candidate Set in domain.For example, picture size meets:Long (60~760), it is wide by (60
~760), between Aspect Ratio satisfaction (0.5~2.5).The picture that condition is not satisfied mostly shapes and sizes are unsatisfactory for master map
Requirement, usually icon, ad banner etc..
Finally, visual information and rule base analysis module also need to rule-based further screen image:
Rule one:As shown in figure 5, being generally master map positioned at picture of the web page navigation item (or menu) between long text;
Rule two:As shown in fig. 6, in the identical one group of picture of size, chooses first and be used as master map;
Rule three:As shown in fig. 7, to the webpage of search results pages type, the first pictures of selection are master map;
Rule four:As shown in figure 8, choosing a maximum figure in visible area is used as master map;
Rule five:The correlation between picture description information and webpage TITLE is calculated, the higher picture of correlation is chosen and makees
For master map:The content degree of association of general master map and webpage is very high, and the TITLE of webpage concentrates the content of the webpage of expression, if figure
The description information of piece and the correlation of webpage TITLE are very high, it may be considered that the picture is master map.
Rule six:As shown in figure 9, qualified master map is also can not find in above-mentioned rule, if the webpage is
Website homepage or thematic page, then choose webpage and website LOGO as master map.
It should be noted that the relationship that rule is applicable, can optionally one or more, and can be with random order
It carries out.Preferably, six rules can be applicable in simultaneously in a specific embodiment and carried out successively in the order described above.
In conclusion by means of the technical solution of the embodiment of the present invention, user behavior is needed not rely on, for the single page
Master map extraction is carried out, cold start-up problem is not present, there is stronger adaptability;In addition, using multiple rules such as visual informations, mould
Personification has higher accuracy rate to the cognitive behavior of master map;Also, due to being positioned using visual zone so that candidate
Calculative picture greatly reduces, and greatly improves the extraction speed of master map.The technical solution of the embodiment of the present invention solves
Webpage master map extracts problem in the prior art, is allowed to be applied to search display result page, be formed together with title, the abstract of webpage
The more rich form of expression.Also, abundant advertising creative shows form, changes single word chain displaying, additionally it is possible to improve
The clicking rate of advertisement.
Device embodiment
According to an embodiment of the invention, a kind of webpage master map extraction element is provided, Figure 10 is the net of the embodiment of the present invention
The structural schematic diagram of page master map extraction element, as shown in Figure 10, webpage master map extraction element according to the ... of the embodiment of the present invention includes:
Webpage capture module 100, HTML parsing modules 102, data obtaining module 104 and screening module 106, below to the present invention
The modules of embodiment are described in detail.
Webpage capture module 100, the html text for obtaining webpage carry out simulation typesetting displaying to html text, and
Obtain the visual information of each HTML element in webpage;Wherein, in embodiments of the present invention, visual information includes:It is every in webpage
Location information and size information of a HTML element in simulation typesetting displaying.
The embodiment of the present invention supports online and offline two ways to extract master map;Webpage capture module 100 can be with when offline
Be directly obtained the html text of webpage, and it is online when webpage capture module 100 can be captured according to the URL of webpage,
Line obtains the html text of webpage.
HTML parsing modules 102, for cutting html text as unit of block message;On it should be noted that
State block message refer to<DIV>,<TABLE>The HTML fragment of this kind of label composition.
Data obtaining module 104 is obtained for obtaining the text message in block message, and according to visual information from block message
Take pictorial information;Wherein, above-mentioned text message may include:Non- hyperlink text length, hyperlink text length, hyperlink
Number, hyperlink array and picture array.Pictorial information includes:The URL of image link, the length for illustrating text, picture of picture
The abscissa of ordinate and picture in simulation typesetting displaying of degree, the width of picture, picture in simulation typesetting displaying.
Data obtaining module 104 is specifically used for:The expository writing of the URL and picture of image link are extracted from block message
This;The width of the length and picture of picture is calculated according to pre-set algorithm priority;Existed according to acquisition of vision information picture
Simulate the abscissa of ordinate and picture in simulation typesetting displaying in typesetting displaying.Wherein, the algorithm of highest priority
For:The length of picture and the width of picture are obtained by the HTML markup in iconic marker;The algorithm of second priority is:It grabs
It takes picture and obtains the width of the length and picture of picture by mapping software;The algorithm of third priority is:Pass through browser
The width of the length and picture of document dbject model DOM acquisition of information pictures in display engine.
Screening module 106, for obtaining the picture for meeting predetermined vision requirement (for example, picture size according to pictorial information
Meet:Long (60~760), wide (60~760), Aspect Ratio meets the picture between (0.5~2.5)), and according to text message
And pictorial information, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and by the picture
Master map as webpage.
Wherein, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length of picture is roomy
Small and Aspect Ratio meets pre-provisioning request.
Screening rule specifically includes following at least one:By the picture between web page navigation item or menu and long text
As master map;In the identical one group of picture of size, select the first pictures as master map;To the net of search results pages type
Page chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate the expository writing of picture
Sheet and the correlation between Web page subject, using the highest picture of correlation as master map;It is website homepage or special topic in webpage
When page, website logo is chosen as master map.
The specific processing of modules is referred to above method reality in the webpage master map extraction element of the embodiment of the present invention
The description applied in example is understood that details are not described herein, wherein the data obtaining module 104 in the embodiment of the present invention and sieve
Modeling block 106 is equivalent to visual information and rule base analysis module in embodiment of the method.
In conclusion by means of the technical solution of the embodiment of the present invention, user behavior is needed not rely on, for the single page
Master map extraction is carried out, cold start-up problem is not present, there is stronger adaptability;In addition, using multiple rules such as visual informations, mould
Personification has higher accuracy rate to the cognitive behavior of master map;Also, due to being positioned using visual zone so that candidate
Calculative picture greatly reduces, and greatly improves the extraction speed of master map.The technical solution of the embodiment of the present invention solves
Webpage master map extracts problem in the prior art, is allowed to be applied to search display result page, be formed together with title, the abstract of webpage
The more rich form of expression.Also, abundant advertising creative shows form, changes single word chain displaying, additionally it is possible to improve
The clicking rate of advertisement.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art
God and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to include these modifications and variations.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein.
Various general-purpose systems can also be used together with teaching based on this.As described above, it constructs required by this kind of system
Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It should be understood that can utilize various
Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair
Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice without these specific details.In some instances, well known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect
Shield the present invention claims the more features of feature than being expressly recited in each claim.More precisely, as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific implementation mode are expressly incorporated in the specific implementation mode, wherein each claim itself
All as a separate embodiment of the present invention.
Those skilled in the art, which are appreciated that, to carry out adaptivity to the module in the client in embodiment
Ground changes and they is arranged in the one or more clients different from the embodiment.It can be the module in embodiment
It is combined into a module, and multiple submodule or subelement or sub-component can be divided into addition.In addition to such spy
Sign and/or except at least some of process or unit exclude each other, may be used any combinations to this specification (including
Adjoint claim, abstract and attached drawing) disclosed in all features and so disclosed any method or client
All processes or unit are combined.Unless expressly stated otherwise, this specification (including adjoint claim, abstract and attached
Figure) disclosed in each feature can be replaced by providing the alternative features of identical, equivalent or similar purpose.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
One of meaning mode can use in any combination.
The all parts embodiment of the present invention can be with hardware realization, or to run on one or more processors
Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (DSP) realize the client according to the ... of the embodiment of the present invention for being loaded with sequence network address
In some or all components some or all functions.The present invention is also implemented as described herein for executing
Some or all equipment or program of device (for example, computer program and computer program product) of method.In this way
Realization the present invention program can may be stored on the computer-readable medium, or can with one or more signal shape
Formula.Such signal can be downloaded from internet website and be obtained, and either be provided on carrier signal or with any other shape
Formula provides.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch
To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame
Claim.
Claims (16)
1. a kind of webpage master map extracting method, which is characterized in that including:
The html text for obtaining webpage carries out simulation typesetting displaying to the html text, and obtains each in the webpage
The visual information of HTML element;
The html text is cut as unit of block message;
The text message in the block message is obtained, and pictorial information is obtained from the block message according to the visual information;
The pictorial information includes:The URL of image link, picture illustrate text, the length of picture, the width of picture, picture in mould
The abscissa of ordinate and picture in simulation typesetting displaying in quasi- typesetting displaying;
The picture for meeting predetermined vision requirement is obtained according to the pictorial information, and is believed according to the text message and the picture
Breath, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and using the picture as described in
The master map of webpage;The screening rule includes:Using the picture between web page navigation item or menu and long text as master map.
2. the method as described in claim 1, which is characterized in that the html text for obtaining webpage specifically includes:According to webpage
Uniform resource position mark URL obtains the html text of webpage.
3. the method as described in claim 1, which is characterized in that the visual information includes:Each html element in the webpage
Location information and size information of the element in simulation typesetting displaying.
4. the method as described in claim 1, which is characterized in that the text message includes:Non- hyperlink text length, hyperlink
Connect text size, hyperlink number, hyperlink array and picture array.
5. the method as described in claim 1, which is characterized in that obtain pictorial information and specifically include:
The URL's and picture of extraction image link illustrates text from the block message;
The width of the length and picture of picture is calculated according to pre-set algorithm priority;
According to picture described in the acquisition of vision information simulation typesetting displaying in ordinate and the picture simulation arrange
Abscissa in version displaying.
6. method as claimed in claim 5, which is characterized in that calculate the length of picture according to pre-set algorithm priority
It is specifically included with the width of picture following at least one:
The algorithm of highest priority is:The length of picture and the width of picture are obtained by the HTML markup in iconic marker;
The algorithm of second priority is:Capturing pictures and by mapping software obtain picture length and picture width;
The algorithm of third priority is:Pass through the length of the document dbject model DOM acquisition of information pictures in browser display engine
The width of degree and picture.
7. the method as described in claim 1, which is characterized in that the predetermined vision requirement includes:The position position of the picture
In in predetermined region, and the length and width size and Aspect Ratio of the picture meet pre-provisioning request.
8. the method as described in claim 1, which is characterized in that the screening rule further includes following at least one:
In the identical one group of picture of size, select the first pictures as master map;
To the webpage of search results pages type, the first pictures are chosen as master map;
Using a maximum pictures in visible area as master map;
The correlation illustrated between text and Web page subject for calculating picture, using the highest picture of correlation as master map;
When the webpage is website homepage or thematic page, website logo is chosen as master map.
9. a kind of webpage master map extraction element, which is characterized in that including:
Webpage capture module, the html text for obtaining webpage carry out simulation typesetting displaying to the html text, and obtain
The visual information of each HTML element in the webpage;
HTML parsing modules, for cutting the html text as unit of block message;
Data obtaining module is believed for obtaining the text message in the block message, and according to the visual information from described piece
Pictorial information is obtained in breath;The pictorial information includes:The URL of image link, picture illustrate text, the length of picture, figure
Ordinate and picture abscissa in simulation typesetting displaying of the width, picture of piece in simulation typesetting displaying;
Screening module, for obtaining the picture for meeting predetermined vision requirement according to the pictorial information, and according to the text envelope
Breath and the pictorial information, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and will
Master map of the picture as the webpage;The screening rule includes:It will be between web page navigation item or menu and long text
Picture as master map.
10. device as claimed in claim 9, which is characterized in that webpage capture module is specifically used for:According to the unified money of webpage
Source finger URL URL obtains the html text of webpage.
11. device as claimed in claim 9, which is characterized in that the visual information includes:Each html element in the webpage
Location information and size information of the element in simulation typesetting displaying.
12. device as claimed in claim 9, which is characterized in that the text message includes:Non- hyperlink text length surpasses
Link text length, hyperlink number, hyperlink array and picture array.
13. device as claimed in claim 9, which is characterized in that described information acquisition module is specifically used for:
The URL's and picture of extraction image link illustrates text from the block message;
The width of the length and picture of picture is calculated according to pre-set algorithm priority;
According to picture described in the acquisition of vision information simulation typesetting displaying in ordinate and the picture simulation arrange
Abscissa in version displaying.
14. device as claimed in claim 13, which is characterized in that the algorithm of highest priority is:By in iconic marker
HTML markup obtains the length of picture and the width of picture;The algorithm of second priority is:Capturing pictures are simultaneously soft by drawing
Part obtains the width of the length and picture of picture;The algorithm of third priority is:Pass through the file pair in browser display engine
As the width of the length and picture of model DOM acquisition of information pictures.
15. device as claimed in claim 9, which is characterized in that the predetermined vision requirement includes:The position position of the picture
In in predetermined region, and the length and width size and Aspect Ratio of the picture meet pre-provisioning request.
16. device as claimed in claim 9, which is characterized in that the screening rule further includes following at least one:
In the identical one group of picture of size, select the first pictures as master map;
To the webpage of search results pages type, the first pictures are chosen as master map;
Using a maximum pictures in visible area as master map;
The correlation illustrated between text and Web page subject for calculating picture, using the highest picture of correlation as master map;
When the webpage is website homepage or thematic page, website logo is chosen as master map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410346226.7A CN104123363B (en) | 2014-07-21 | 2014-07-21 | Webpage master map extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410346226.7A CN104123363B (en) | 2014-07-21 | 2014-07-21 | Webpage master map extracting method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104123363A CN104123363A (en) | 2014-10-29 |
CN104123363B true CN104123363B (en) | 2018-07-13 |
Family
ID=51768774
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410346226.7A Active CN104123363B (en) | 2014-07-21 | 2014-07-21 | Webpage master map extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104123363B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376114B (en) * | 2014-12-01 | 2018-01-30 | 百度在线网络技术(北京)有限公司 | A kind of search result methods of exhibiting and device |
CN104699837B (en) * | 2015-03-31 | 2017-04-12 | 北京奇虎科技有限公司 | Method, device and server for selecting illustrated pictures of web pages |
CN104881428B (en) * | 2015-04-02 | 2019-03-29 | 广州神马移动信息科技有限公司 | A kind of hum pattern extraction, search method and the device of hum pattern webpage |
CN106445997B (en) * | 2016-07-20 | 2021-02-05 | 腾讯科技(北京)有限公司 | Information processing method and server |
CN106503059B (en) * | 2016-09-27 | 2019-07-23 | 北京小米移动软件有限公司 | Displayed page method for pushing and device |
CN106547540A (en) * | 2016-10-12 | 2017-03-29 | 惠州市德赛西威汽车电子股份有限公司 | A kind of method for drafting of text button |
CN106484913B (en) * | 2016-10-26 | 2021-09-07 | 腾讯科技(深圳)有限公司 | Target picture determining method and server |
CN108268488B (en) * | 2016-12-30 | 2022-04-19 | 百度在线网络技术(北京)有限公司 | Webpage main graph identification method and device |
CN108399167B (en) * | 2017-02-04 | 2022-04-29 | 百度在线网络技术(北京)有限公司 | Webpage information extraction method and device |
CN107066596A (en) * | 2017-04-19 | 2017-08-18 | 北京小米移动软件有限公司 | The method and apparatus for generating link information |
CN107766475A (en) * | 2017-10-09 | 2018-03-06 | 李亚强 | A kind of system of selection of info web master map and device |
CN109685085B (en) * | 2017-10-18 | 2023-09-26 | 阿里巴巴集团控股有限公司 | Main graph extraction method and device |
CN112084451B (en) * | 2020-09-16 | 2022-09-30 | 哈尔滨工业大学 | Webpage LOGO extraction system and method based on visual blocking |
CN112597765A (en) * | 2020-12-25 | 2021-04-02 | 四川长虹电器股份有限公司 | Automatic movie and television topic generation method based on multi-mode features |
CN116578763B (en) * | 2023-07-11 | 2023-09-15 | 卓谨信息科技(常州)有限公司 | Multisource information exhibition system based on generated AI cognitive model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101944109A (en) * | 2010-09-06 | 2011-01-12 | 华南理工大学 | System and method for extracting picture abstract based on page partitioning |
CN103425644A (en) * | 2012-05-14 | 2013-12-04 | 腾讯科技(深圳)有限公司 | Method and device for extracting pictures in webpage content |
CN103885959A (en) * | 2012-12-20 | 2014-06-25 | 腾讯科技(深圳)有限公司 | Webpage bookmark generating method and webpage bookmark generating device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8875007B2 (en) * | 2010-11-08 | 2014-10-28 | Microsoft Corporation | Creating and modifying an image wiki page |
-
2014
- 2014-07-21 CN CN201410346226.7A patent/CN104123363B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101944109A (en) * | 2010-09-06 | 2011-01-12 | 华南理工大学 | System and method for extracting picture abstract based on page partitioning |
CN103425644A (en) * | 2012-05-14 | 2013-12-04 | 腾讯科技(深圳)有限公司 | Method and device for extracting pictures in webpage content |
CN103885959A (en) * | 2012-12-20 | 2014-06-25 | 腾讯科技(深圳)有限公司 | Webpage bookmark generating method and webpage bookmark generating device |
Also Published As
Publication number | Publication date |
---|---|
CN104123363A (en) | 2014-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104123363B (en) | Webpage master map extracting method and device | |
US8880498B2 (en) | System and method for aggregating and ranking data from a plurality of web sites | |
CN103544176B (en) | Method and apparatus for generating the page structure template corresponding to multiple pages | |
CN103488781B (en) | Method, the search engine server of information search are provided | |
CN104360882B (en) | Display methods and device are carried out to picture in webpage in a kind of browser | |
US9443014B2 (en) | Custom web page themes | |
US10110966B2 (en) | Method, device, server and client device for video processing | |
US20090100356A1 (en) | Method for Presenting a Web Page | |
CN102779123B (en) | A kind of website shows screenshotss method, system and the desk module and method of content | |
CN107784059A (en) | For searching for and selecting the method and system and machine-readable medium of image | |
US11164221B2 (en) | Native online ad creation | |
CN102999595B (en) | A kind of for providing method and the equipment of the accession page corresponding with page info | |
CN107666435A (en) | A kind of method and device for shielding message | |
Jeong et al. | Usability study on newspaper mobile websites | |
CN106899549A (en) | A kind of network security detection method and device | |
US20120272130A1 (en) | Object control method for displaying objects in front of link elements | |
CN105786965A (en) | URL-based user behavior analysis method and device | |
CN106776615A (en) | Heating power drawing generating method and device | |
CN104268282A (en) | Web banner advertisement displaying method and system | |
CN107766398A (en) | For the method, apparatus and data handling system for image is matched with content item | |
CN106951429B (en) | Method, browser and equipment for enhancing webpage comment display | |
Oliveira et al. | From 10 Blue Links Pages to Feature-Full Search Engine Results Pages-Analysis of the Temporal Evolution of SERP Features | |
CN108205540A (en) | The methods, devices and systems of information browse | |
KR102062248B1 (en) | Method for advertising releated commercial image by analyzing online news article image | |
US9058623B2 (en) | Semantic tagged ads |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220726 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |