CN104123363B - Webpage master map extracting method and device - Google Patents

Webpage master map extracting method and device Download PDF

Info

Publication number
CN104123363B
CN104123363B CN201410346226.7A CN201410346226A CN104123363B CN 104123363 B CN104123363 B CN 104123363B CN 201410346226 A CN201410346226 A CN 201410346226A CN 104123363 B CN104123363 B CN 104123363B
Authority
CN
China
Prior art keywords
picture
webpage
text
master map
html
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410346226.7A
Other languages
Chinese (zh)
Other versions
CN104123363A (en
Inventor
陈华清
许晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410346226.7A priority Critical patent/CN104123363B/en
Publication of CN104123363A publication Critical patent/CN104123363A/en
Application granted granted Critical
Publication of CN104123363B publication Critical patent/CN104123363B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of webpage master map extracting method and devices.This method includes:The html text for obtaining webpage carries out simulation typesetting displaying to html text, and obtains the visual information of each HTML element in webpage;Html text is cut as unit of block message;The text message in block message is obtained, and pictorial information is obtained from block message according to visual information;The picture for meeting predetermined vision requirement is obtained according to pictorial information, and according to text message and pictorial information, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and using the picture as the master map of webpage.By means of technical scheme of the present invention, master map selection can be made to reach very high accuracy rate and efficiency.

Description

Webpage master map extracting method and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of webpage master map extracting method and device.
Background technology
With the development of Internet technology, and hypertext markup language (Hypertext Markup Language, referred to as HTML) form of expression of webpage is more and more diversified, and one of trend therein is exactly a large amount of appearance of picture in webpage.And tradition Word compare, picture arresting power and express the meaning aspect have oneself unique advantage.Therefore many search are drawn at present It holds up and in addition to offer title and other than making a summary, additionally provides the master map extracted from webpage in search result.
As shown in Figure 1, in the prior art, more and more pictures are contained in the result of search engine, this for Family identifies the information oneself to be found, and it is helpful to improve clicking rate.Simultaneously in terms of Internet advertising, compared to purely dispensing The advertisement of Text Link, display advertising have the advantage of bigger, can allow user is very clear to see product information.Therefore, from Master map technology is extracted in webpage and is improving user's search experience, seems extremely important in terms of improving clicking rate.To be badly in need of at present A kind of webpage master map extracting method.
Invention content
In view of the above problems, it is proposed that the present invention overcoming the above problem in order to provide one kind or solves at least partly State the webpage master map extracting method and device of problem.
The present invention provides a kind of webpage master map extracting method, including:The html text for obtaining webpage carries out html text Typesetting displaying is simulated, and obtains the visual information of each HTML element in webpage;Html text is carried out as unit of block message Cutting;The text message in block message is obtained, and pictorial information is obtained from block message according to visual information;According to pictorial information The picture for meeting predetermined vision requirement is obtained, and according to text message and pictorial information, from the picture for meeting predetermined vision requirement In further selection meet the picture of screening rule, and using the picture as the master map of webpage.
Preferably, the html text for obtaining webpage specifically includes:Webpage is obtained according to the uniform resource position mark URL of webpage Html text.
Preferably, visual information includes:Location information of each HTML element in simulation typesetting displaying and big in webpage Small information.
Preferably, text message includes:Non- hyperlink text length, hyperlink text length, hyperlink number, hyperlink Array and picture array.
Preferably, pictorial information includes:The URL of image link, picture illustrate text, the length of picture, picture width The abscissa of ordinate and picture in simulation typesetting displaying of degree, picture in simulation typesetting displaying.
Preferably, pictorial information is obtained to specifically include:The URL of image link and the explanation of picture are extracted from block message Text;The width of the length and picture of picture is calculated according to pre-set algorithm priority;According to acquisition of vision information picture The abscissa of ordinate and picture in simulation typesetting displaying in simulation typesetting displaying.
Preferably, it is specifically included according to the length of pre-set algorithm priority calculating picture and the width of picture following It is at least one:The algorithm of highest priority is:The length and picture of picture are obtained by HTML markup in iconic marker Width;The algorithm of second priority is:Capturing pictures and by mapping software obtain picture length and picture width;Third The algorithm of priority is:Pass through the length and picture of the document dbject model DOM acquisition of information pictures in browser display engine Width.
Preferably, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length and width of picture Size and Aspect Ratio meet pre-provisioning request.
Preferably, screening rule specifically includes following at least one:To be located at web page navigation item or menu and long text it Between picture as master map;In the identical one group of picture of size, select the first pictures as master map;To search results pages class The webpage of type chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate picture Illustrate the correlation between text and Web page subject, using the highest picture of correlation as master map;Webpage be website homepage or When person's special topic page, website logo is chosen as master map.
The present invention also provides a kind of webpage master map extraction elements, including:Webpage capture module, for obtaining webpage Html text carries out simulation typesetting displaying to html text, and obtains the visual information of each HTML element in webpage;HTML is solved Module is analysed, for cutting html text as unit of block message;Data obtaining module, for obtaining the text in block message This information, and pictorial information is obtained from block message according to visual information;Screening module, for being met according to pictorial information acquisition The picture of predetermined vision requirement, and according to text message and pictorial information, from the picture for meeting predetermined vision requirement further Selection meets the picture of screening rule, and using the picture as the master map of webpage.
Preferably, webpage capture module is specifically used for:The HTML of webpage is obtained according to the uniform resource position mark URL of webpage Text.
Preferably, visual information includes:Location information of each HTML element in simulation typesetting displaying and big in webpage Small information.
Preferably, text message includes:Non- hyperlink text length, hyperlink text length, hyperlink number, hyperlink Array and picture array.
Preferably, pictorial information includes:The URL of image link, picture illustrate text, the length of picture, picture width The abscissa of ordinate and picture in simulation typesetting displaying of degree, picture in simulation typesetting displaying.
Preferably, data obtaining module is specifically used for:The URL of image link and the explanation of picture are extracted from block message Text;The width of the length and picture of picture is calculated according to pre-set algorithm priority;According to acquisition of vision information picture The abscissa of ordinate and picture in simulation typesetting displaying in simulation typesetting displaying.
Preferably, the algorithm of highest priority is:By the HTML markup in iconic marker come obtain picture length and The width of picture;The algorithm of second priority is:Capturing pictures and by mapping software obtain picture length and picture width Degree;The algorithm of third priority is:Pass through the length of the document dbject model DOM acquisition of information pictures in browser display engine With the width of picture.
Preferably, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length and width of picture Size and Aspect Ratio meet pre-provisioning request.
Preferably, screening rule specifically includes following at least one:To be located at web page navigation item or menu and long text it Between picture as master map;In the identical one group of picture of size, select the first pictures as master map;To search results pages class The webpage of type chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate picture Illustrate the correlation between text and Web page subject, using the highest picture of correlation as master map;Webpage be website homepage or When person's special topic page, website logo is chosen as master map.
The present invention has the beneficial effect that:
The master map of webpage is carried out by pictorial information candidate and smart to the master map progress in Candidate Set according to screening rule Choosing can make master map selection reach very high accuracy rate, in addition, the technical solution of the embodiment of the present invention is due to using visual area Domain is positioned so that candidate calculative picture greatly reduces, and greatly improves the extraction speed of master map.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, below the special specific implementation mode for lifting the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit are common for this field Technical staff will become clear.Attached drawing only for the purpose of illustrating preferred embodiments, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the schematic diagram searched plain engine results page in the prior art and show webpage master map;
Fig. 2 is the flow chart of the webpage master map extracting method of the embodiment of the present invention;
Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention;
Fig. 4 is the schematic diagram of the master map Sample Filter 1 of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the master map Sample Filter 2 of the embodiment of the present invention;
Fig. 6 is the schematic diagram of the master map Sample Filter 3 of the embodiment of the present invention;
Fig. 7 is the schematic diagram of the master map Sample Filter 4 of the embodiment of the present invention;
Fig. 8 is the schematic diagram of the master map Sample Filter 5 of the embodiment of the present invention;
Fig. 9 is the schematic diagram of the master map Sample Filter 6 of the embodiment of the present invention;
Figure 10 is the structural schematic diagram of the webpage master map extraction element of the embodiment of the present invention.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
The method for extracting webpage master map may include following two modes:
Mode one:Statistics based on user behavior, this method is based on a kind of it is assumed that the picture user i.e. in webpage clicks and gets over It is much more important.Specific technical solution is as follows:User's hits of all pictures on page of often throwing the net are counted first, and then, selection is used Highest picture is clicked as webpage master map in family.But above-mentioned technical proposal has the following problems:1, recall rate is not high:It is not All pictures have user to click behavior, and some pictures just do not link.2, effective shortcoming:For emerging webpage, Due to there is no user behavior information, so picture can not be extracted.3, confidence level problem:In the less situation of picture number of clicks Under, it is susceptible to deviation, and for many little companies, abundant user behavior data as major company can not be obtained. 4, user behavior deviation:Such as in webpage if there is picture be some sexy women pictures, can more attract eyeball, therefore Obtain more click.
Mode two:Based on machine learning classification method, specific technical solution is as follows:Step 1, the spy of picture in webpage is extracted Sign, for example, picture size, the position in HTML, the description information etc. of picture;Step 2, prepare mark collection, choose a fixed number The webpage of amount is labeled picture therein, mark whether master map;Step 3, it is trained (for example, patrolling using disaggregated model Collect recurrence, SVM, decision forest, GBDT etc.), obtain model;Step 4, the model finished using training carries out picture in webpage It predicts whether as master map.But above-mentioned technical proposal has the following problems:1, mark needs a large amount of manpower, to cover inhomogeneity There are many webpage of type, the picture number in each webpage.2, it needs to select a large amount of feature, it can not be at once for badcase It solves.3, it needs to calculate all pictures, calculation amount is larger.
In order to solve the above problem in the prior art, the present invention provides a kind of webpage master map extracting method and device, Online and offline two ways is supported to extract master map;Incoming webpage URL is only needed when online, captures html text, and by clear Device display engine of looking at carry out typesetting displaying, by the parsing of html text be organized into the required data structure of subsequent processing and Organizational form, finally carries out visual information and the analysis of screening rule obtains webpage master map.Below in conjunction with attached drawing and embodiment, The present invention will be described in further detail.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, Do not limit the present invention.
Embodiment of the method
According to an embodiment of the invention, a kind of webpage master map extracting method is provided, Fig. 2 is the webpage of the embodiment of the present invention The flow chart of master map extracting method, as shown in Fig. 2, webpage master map extracting method according to the ... of the embodiment of the present invention includes following place Reason:
S210 obtains the html text of webpage, carries out simulation typesetting displaying to html text, and obtain each in webpage The visual information of HTML element;Wherein, in embodiments of the present invention, visual information includes:Each HTML element is in mould in webpage Location information in quasi- typesetting displaying and size information.
The embodiment of the present invention supports online and offline two ways to extract master map;It needs to get the HTML of webpage when offline Text, and it is online when can be captured according to the URL of webpage, the online html text for obtaining webpage.
S220 cuts html text as unit of block message;It should be noted that above-mentioned block message refer to< DIV>,<TABLE>The HTML fragment of this kind of label composition.
S230 obtains the text message in block message, and obtains pictorial information from block message according to visual information;Its In, above-mentioned text message may include:Non- hyperlink text length, hyperlink text length, hyperlink number, hyperlink number Group and picture array.Pictorial information includes:The URL of image link, picture illustrate text, the length of picture, picture width The abscissa of ordinate and picture in simulation typesetting displaying of degree, picture in simulation typesetting displaying.
That is, in S230, the pictorial information that is obtained from block message according to visual information can regard as by A kind of more detailed visual information of processing.
In S230, obtains pictorial information and specifically include:
Step 1, it extracts the URL of image link from block message and picture illustrates text;
Step 2, the width of the length and picture of picture is calculated according to pre-set algorithm priority;Specifically:According to The length of pre-set algorithm priority calculating picture and the width of picture specifically include following at least one:Highest priority Algorithm be:The length of picture and the width of picture are obtained by the HTML markup in iconic marker;The calculation of second priority Method is:Capturing pictures and by mapping software obtain picture length and picture width;The algorithm of third priority is:Pass through The width of the length and picture of document dbject model DOM acquisition of information pictures in browser display engine.
Step 3, the ordinate according to acquisition of vision information picture in simulation typesetting displaying and picture are in simulation typesetting Abscissa in displaying.
S240 obtains the picture for meeting predetermined vision requirement according to pictorial information (for example, picture size meets:Long (60~ 760), wide (60~760), Aspect Ratio meet the picture between (0.5~2.5)), and according to text message and pictorial information, Further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and using the picture as the master of webpage Figure.
In S240, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length of picture Roomy small and Aspect Ratio meets pre-provisioning request.
Screening rule specifically includes following at least one:By the picture between web page navigation item or menu and long text As master map;In the identical one group of picture of size, select the first pictures as master map;To the net of search results pages type Page chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate the expository writing of picture Sheet and the correlation between Web page subject, using the highest picture of correlation as master map;It is website homepage or special topic in webpage When page, website logo is chosen as master map.
Below in conjunction with example and attached drawing, the above-mentioned technical proposal of the embodiment of the present invention is continued to be described in detail.
Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention, is only needed when as shown in figure 3, online It is passed to webpage URL, webpage capture module is captured, and carries out typesetting displaying by browser display engine, is then passed through HTML parsing modules carry out parsing is organized into the required data structure of downstream module and organizational form, finally by visual information with Rule base analysis module is analyzed to obtain webpage master map.Each processing procedure that webpage master map extracting method is related to below into Row is described in detail:
Webpage capture module:Different from traditional handling module based on CURL, WGET, http protocol, which is not It is simple to obtain html text, it needs to obtain two aspect information:First, html text;Second is that carrying out typesetting exhibition to html text Show, the behavior of simulation browser, while supporting JavaScript, to obtain the display location of each HTML element in a browser With size (namely visual information).
In embodiments of the present invention, the typesetting displaying of webpage capture module can be realized by Phantomjs, Phantomjs is a kind of browser display engine, based on webkit kernels, possesses perfect Javascript parsings, the page Function is rendered, can be used for simulating the various events that a modern browser is done when loading webpage.
In addition, in embodiments of the present invention, the visual information that webpage capture module obtains can be visited by JavaScript The DOM structure of HTML is asked to obtain:
Var actualLeft=images [i] .offsetLeft;
Var actualTop=images [i] .offsetTop;
Var current=images [i] .offsetParent;
while(current!==null)
ActualLeft+=current.offsetLeft;
ActualTop+=current.offsetTop;
Current=current.offsetParent;}
HTML parsing modules:HTML is parsed, with a finite state machine, html text is carried out according to block message Cutting is the main mesh done so to carry out structured organization to webpage, is the foundation stone of subsequent processing.
For example, to following HTML fragment
Become following data structure after parsing:
In the above-described example, block message is mainly the text and hyperlink composition in block.
Visual information and rule base analysis module:The block message that HTML parsing modules are passed to first handled to obtain as Lower two kinds of data structures (the i.e. above-mentioned text message and pictorial information obtained from block message):
class TextBlock:
def__init__(self):
class ImageBlock:
In embodiments of the present invention, the length and width calculating of picture according to priority divides three kinds:(that is, turning if calculating and not coming out For lower one kind):
1, picture length and width are obtained by the HTML markup in iconic marker;
2, capturing pictures obtain picture length and width by ImageMagick;
3, picture length and width are obtained by DOM information in phantomjs.
Visual information also needs to view-based access control model with rule base analysis module and carries out picture selection:Specifically, as shown in figure 4, According to common sense, general webpage author generally can be placed on people's vision foreground on webpage when putting master map, It is exactly the place that people is most readily visible.And webpage bottom, the position for needing roll mouse that can just see, the corner area of webpage, It is general to be seldom used for placing master map.
By being counted to the visual zone where a large amount of webpage master maps, the position where most master maps is horizontal seat 0 to 700 are designated as, ordinate is 20 to 1000 (units:Pixel) rectangular region in.It, can be by the area by the visual information All pictures for meeting a certain size are put into Candidate Set in domain.For example, picture size meets:Long (60~760), it is wide by (60 ~760), between Aspect Ratio satisfaction (0.5~2.5).The picture that condition is not satisfied mostly shapes and sizes are unsatisfactory for master map Requirement, usually icon, ad banner etc..
Finally, visual information and rule base analysis module also need to rule-based further screen image:
Rule one:As shown in figure 5, being generally master map positioned at picture of the web page navigation item (or menu) between long text;
Rule two:As shown in fig. 6, in the identical one group of picture of size, chooses first and be used as master map;
Rule three:As shown in fig. 7, to the webpage of search results pages type, the first pictures of selection are master map;
Rule four:As shown in figure 8, choosing a maximum figure in visible area is used as master map;
Rule five:The correlation between picture description information and webpage TITLE is calculated, the higher picture of correlation is chosen and makees For master map:The content degree of association of general master map and webpage is very high, and the TITLE of webpage concentrates the content of the webpage of expression, if figure The description information of piece and the correlation of webpage TITLE are very high, it may be considered that the picture is master map.
Rule six:As shown in figure 9, qualified master map is also can not find in above-mentioned rule, if the webpage is Website homepage or thematic page, then choose webpage and website LOGO as master map.
It should be noted that the relationship that rule is applicable, can optionally one or more, and can be with random order It carries out.Preferably, six rules can be applicable in simultaneously in a specific embodiment and carried out successively in the order described above.
In conclusion by means of the technical solution of the embodiment of the present invention, user behavior is needed not rely on, for the single page Master map extraction is carried out, cold start-up problem is not present, there is stronger adaptability;In addition, using multiple rules such as visual informations, mould Personification has higher accuracy rate to the cognitive behavior of master map;Also, due to being positioned using visual zone so that candidate Calculative picture greatly reduces, and greatly improves the extraction speed of master map.The technical solution of the embodiment of the present invention solves Webpage master map extracts problem in the prior art, is allowed to be applied to search display result page, be formed together with title, the abstract of webpage The more rich form of expression.Also, abundant advertising creative shows form, changes single word chain displaying, additionally it is possible to improve The clicking rate of advertisement.
Device embodiment
According to an embodiment of the invention, a kind of webpage master map extraction element is provided, Figure 10 is the net of the embodiment of the present invention The structural schematic diagram of page master map extraction element, as shown in Figure 10, webpage master map extraction element according to the ... of the embodiment of the present invention includes: Webpage capture module 100, HTML parsing modules 102, data obtaining module 104 and screening module 106, below to the present invention The modules of embodiment are described in detail.
Webpage capture module 100, the html text for obtaining webpage carry out simulation typesetting displaying to html text, and Obtain the visual information of each HTML element in webpage;Wherein, in embodiments of the present invention, visual information includes:It is every in webpage Location information and size information of a HTML element in simulation typesetting displaying.
The embodiment of the present invention supports online and offline two ways to extract master map;Webpage capture module 100 can be with when offline Be directly obtained the html text of webpage, and it is online when webpage capture module 100 can be captured according to the URL of webpage, Line obtains the html text of webpage.
HTML parsing modules 102, for cutting html text as unit of block message;On it should be noted that State block message refer to<DIV>,<TABLE>The HTML fragment of this kind of label composition.
Data obtaining module 104 is obtained for obtaining the text message in block message, and according to visual information from block message Take pictorial information;Wherein, above-mentioned text message may include:Non- hyperlink text length, hyperlink text length, hyperlink Number, hyperlink array and picture array.Pictorial information includes:The URL of image link, the length for illustrating text, picture of picture The abscissa of ordinate and picture in simulation typesetting displaying of degree, the width of picture, picture in simulation typesetting displaying.
Data obtaining module 104 is specifically used for:The expository writing of the URL and picture of image link are extracted from block message This;The width of the length and picture of picture is calculated according to pre-set algorithm priority;Existed according to acquisition of vision information picture Simulate the abscissa of ordinate and picture in simulation typesetting displaying in typesetting displaying.Wherein, the algorithm of highest priority For:The length of picture and the width of picture are obtained by the HTML markup in iconic marker;The algorithm of second priority is:It grabs It takes picture and obtains the width of the length and picture of picture by mapping software;The algorithm of third priority is:Pass through browser The width of the length and picture of document dbject model DOM acquisition of information pictures in display engine.
Screening module 106, for obtaining the picture for meeting predetermined vision requirement (for example, picture size according to pictorial information Meet:Long (60~760), wide (60~760), Aspect Ratio meets the picture between (0.5~2.5)), and according to text message And pictorial information, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and by the picture Master map as webpage.
Wherein, predetermined vision requirement includes:The position of picture is located in predetermined region, and the length of picture is roomy Small and Aspect Ratio meets pre-provisioning request.
Screening rule specifically includes following at least one:By the picture between web page navigation item or menu and long text As master map;In the identical one group of picture of size, select the first pictures as master map;To the net of search results pages type Page chooses the first pictures as master map;Using a maximum pictures in visible area as master map;Calculate the expository writing of picture Sheet and the correlation between Web page subject, using the highest picture of correlation as master map;It is website homepage or special topic in webpage When page, website logo is chosen as master map.
The specific processing of modules is referred to above method reality in the webpage master map extraction element of the embodiment of the present invention The description applied in example is understood that details are not described herein, wherein the data obtaining module 104 in the embodiment of the present invention and sieve Modeling block 106 is equivalent to visual information and rule base analysis module in embodiment of the method.
In conclusion by means of the technical solution of the embodiment of the present invention, user behavior is needed not rely on, for the single page Master map extraction is carried out, cold start-up problem is not present, there is stronger adaptability;In addition, using multiple rules such as visual informations, mould Personification has higher accuracy rate to the cognitive behavior of master map;Also, due to being positioned using visual zone so that candidate Calculative picture greatly reduces, and greatly improves the extraction speed of master map.The technical solution of the embodiment of the present invention solves Webpage master map extracts problem in the prior art, is allowed to be applied to search display result page, be formed together with title, the abstract of webpage The more rich form of expression.Also, abundant advertising creative shows form, changes single word chain displaying, additionally it is possible to improve The clicking rate of advertisement.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art God and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with teaching based on this.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It should be understood that can utilize various Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect Shield the present invention claims the more features of feature than being expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific implementation mode are expressly incorporated in the specific implementation mode, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art, which are appreciated that, to carry out adaptivity to the module in the client in embodiment Ground changes and they is arranged in the one or more clients different from the embodiment.It can be the module in embodiment It is combined into a module, and multiple submodule or subelement or sub-component can be divided into addition.In addition to such spy Sign and/or except at least some of process or unit exclude each other, may be used any combinations to this specification (including Adjoint claim, abstract and attached drawing) disclosed in all features and so disclosed any method or client All processes or unit are combined.Unless expressly stated otherwise, this specification (including adjoint claim, abstract and attached Figure) disclosed in each feature can be replaced by providing the alternative features of identical, equivalent or similar purpose.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.
The all parts embodiment of the present invention can be with hardware realization, or to run on one or more processors Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) realize the client according to the ... of the embodiment of the present invention for being loaded with sequence network address In some or all components some or all functions.The present invention is also implemented as described herein for executing Some or all equipment or program of device (for example, computer program and computer program product) of method.In this way Realization the present invention program can may be stored on the computer-readable medium, or can with one or more signal shape Formula.Such signal can be downloaded from internet website and be obtained, and either be provided on carrier signal or with any other shape Formula provides.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame Claim.

Claims (16)

1. a kind of webpage master map extracting method, which is characterized in that including:
The html text for obtaining webpage carries out simulation typesetting displaying to the html text, and obtains each in the webpage The visual information of HTML element;
The html text is cut as unit of block message;
The text message in the block message is obtained, and pictorial information is obtained from the block message according to the visual information; The pictorial information includes:The URL of image link, picture illustrate text, the length of picture, the width of picture, picture in mould The abscissa of ordinate and picture in simulation typesetting displaying in quasi- typesetting displaying;
The picture for meeting predetermined vision requirement is obtained according to the pictorial information, and is believed according to the text message and the picture Breath, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and using the picture as described in The master map of webpage;The screening rule includes:Using the picture between web page navigation item or menu and long text as master map.
2. the method as described in claim 1, which is characterized in that the html text for obtaining webpage specifically includes:According to webpage Uniform resource position mark URL obtains the html text of webpage.
3. the method as described in claim 1, which is characterized in that the visual information includes:Each html element in the webpage Location information and size information of the element in simulation typesetting displaying.
4. the method as described in claim 1, which is characterized in that the text message includes:Non- hyperlink text length, hyperlink Connect text size, hyperlink number, hyperlink array and picture array.
5. the method as described in claim 1, which is characterized in that obtain pictorial information and specifically include:
The URL's and picture of extraction image link illustrates text from the block message;
The width of the length and picture of picture is calculated according to pre-set algorithm priority;
According to picture described in the acquisition of vision information simulation typesetting displaying in ordinate and the picture simulation arrange Abscissa in version displaying.
6. method as claimed in claim 5, which is characterized in that calculate the length of picture according to pre-set algorithm priority It is specifically included with the width of picture following at least one:
The algorithm of highest priority is:The length of picture and the width of picture are obtained by the HTML markup in iconic marker;
The algorithm of second priority is:Capturing pictures and by mapping software obtain picture length and picture width;
The algorithm of third priority is:Pass through the length of the document dbject model DOM acquisition of information pictures in browser display engine The width of degree and picture.
7. the method as described in claim 1, which is characterized in that the predetermined vision requirement includes:The position position of the picture In in predetermined region, and the length and width size and Aspect Ratio of the picture meet pre-provisioning request.
8. the method as described in claim 1, which is characterized in that the screening rule further includes following at least one:
In the identical one group of picture of size, select the first pictures as master map;
To the webpage of search results pages type, the first pictures are chosen as master map;
Using a maximum pictures in visible area as master map;
The correlation illustrated between text and Web page subject for calculating picture, using the highest picture of correlation as master map;
When the webpage is website homepage or thematic page, website logo is chosen as master map.
9. a kind of webpage master map extraction element, which is characterized in that including:
Webpage capture module, the html text for obtaining webpage carry out simulation typesetting displaying to the html text, and obtain The visual information of each HTML element in the webpage;
HTML parsing modules, for cutting the html text as unit of block message;
Data obtaining module is believed for obtaining the text message in the block message, and according to the visual information from described piece Pictorial information is obtained in breath;The pictorial information includes:The URL of image link, picture illustrate text, the length of picture, figure Ordinate and picture abscissa in simulation typesetting displaying of the width, picture of piece in simulation typesetting displaying;
Screening module, for obtaining the picture for meeting predetermined vision requirement according to the pictorial information, and according to the text envelope Breath and the pictorial information, further selection meets the picture of screening rule from the picture for meeting predetermined vision requirement, and will Master map of the picture as the webpage;The screening rule includes:It will be between web page navigation item or menu and long text Picture as master map.
10. device as claimed in claim 9, which is characterized in that webpage capture module is specifically used for:According to the unified money of webpage Source finger URL URL obtains the html text of webpage.
11. device as claimed in claim 9, which is characterized in that the visual information includes:Each html element in the webpage Location information and size information of the element in simulation typesetting displaying.
12. device as claimed in claim 9, which is characterized in that the text message includes:Non- hyperlink text length surpasses Link text length, hyperlink number, hyperlink array and picture array.
13. device as claimed in claim 9, which is characterized in that described information acquisition module is specifically used for:
The URL's and picture of extraction image link illustrates text from the block message;
The width of the length and picture of picture is calculated according to pre-set algorithm priority;
According to picture described in the acquisition of vision information simulation typesetting displaying in ordinate and the picture simulation arrange Abscissa in version displaying.
14. device as claimed in claim 13, which is characterized in that the algorithm of highest priority is:By in iconic marker HTML markup obtains the length of picture and the width of picture;The algorithm of second priority is:Capturing pictures are simultaneously soft by drawing Part obtains the width of the length and picture of picture;The algorithm of third priority is:Pass through the file pair in browser display engine As the width of the length and picture of model DOM acquisition of information pictures.
15. device as claimed in claim 9, which is characterized in that the predetermined vision requirement includes:The position position of the picture In in predetermined region, and the length and width size and Aspect Ratio of the picture meet pre-provisioning request.
16. device as claimed in claim 9, which is characterized in that the screening rule further includes following at least one:
In the identical one group of picture of size, select the first pictures as master map;
To the webpage of search results pages type, the first pictures are chosen as master map;
Using a maximum pictures in visible area as master map;
The correlation illustrated between text and Web page subject for calculating picture, using the highest picture of correlation as master map;
When the webpage is website homepage or thematic page, website logo is chosen as master map.
CN201410346226.7A 2014-07-21 2014-07-21 Webpage master map extracting method and device Active CN104123363B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410346226.7A CN104123363B (en) 2014-07-21 2014-07-21 Webpage master map extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410346226.7A CN104123363B (en) 2014-07-21 2014-07-21 Webpage master map extracting method and device

Publications (2)

Publication Number Publication Date
CN104123363A CN104123363A (en) 2014-10-29
CN104123363B true CN104123363B (en) 2018-07-13

Family

ID=51768774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410346226.7A Active CN104123363B (en) 2014-07-21 2014-07-21 Webpage master map extracting method and device

Country Status (1)

Country Link
CN (1) CN104123363B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376114B (en) * 2014-12-01 2018-01-30 百度在线网络技术(北京)有限公司 A kind of search result methods of exhibiting and device
CN104699837B (en) * 2015-03-31 2017-04-12 北京奇虎科技有限公司 Method, device and server for selecting illustrated pictures of web pages
CN104881428B (en) * 2015-04-02 2019-03-29 广州神马移动信息科技有限公司 A kind of hum pattern extraction, search method and the device of hum pattern webpage
CN106445997B (en) * 2016-07-20 2021-02-05 腾讯科技(北京)有限公司 Information processing method and server
CN106503059B (en) * 2016-09-27 2019-07-23 北京小米移动软件有限公司 Displayed page method for pushing and device
CN106547540A (en) * 2016-10-12 2017-03-29 惠州市德赛西威汽车电子股份有限公司 A kind of method for drafting of text button
CN106484913B (en) * 2016-10-26 2021-09-07 腾讯科技(深圳)有限公司 Target picture determining method and server
CN108268488B (en) * 2016-12-30 2022-04-19 百度在线网络技术(北京)有限公司 Webpage main graph identification method and device
CN108399167B (en) * 2017-02-04 2022-04-29 百度在线网络技术(北京)有限公司 Webpage information extraction method and device
CN107066596A (en) * 2017-04-19 2017-08-18 北京小米移动软件有限公司 The method and apparatus for generating link information
CN107766475A (en) * 2017-10-09 2018-03-06 李亚强 A kind of system of selection of info web master map and device
CN109685085B (en) * 2017-10-18 2023-09-26 阿里巴巴集团控股有限公司 Main graph extraction method and device
CN112084451B (en) * 2020-09-16 2022-09-30 哈尔滨工业大学 Webpage LOGO extraction system and method based on visual blocking
CN112597765A (en) * 2020-12-25 2021-04-02 四川长虹电器股份有限公司 Automatic movie and television topic generation method based on multi-mode features
CN116578763B (en) * 2023-07-11 2023-09-15 卓谨信息科技(常州)有限公司 Multisource information exhibition system based on generated AI cognitive model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103885959A (en) * 2012-12-20 2014-06-25 腾讯科技(深圳)有限公司 Webpage bookmark generating method and webpage bookmark generating device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8875007B2 (en) * 2010-11-08 2014-10-28 Microsoft Corporation Creating and modifying an image wiki page

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103885959A (en) * 2012-12-20 2014-06-25 腾讯科技(深圳)有限公司 Webpage bookmark generating method and webpage bookmark generating device

Also Published As

Publication number Publication date
CN104123363A (en) 2014-10-29

Similar Documents

Publication Publication Date Title
CN104123363B (en) Webpage master map extracting method and device
US8880498B2 (en) System and method for aggregating and ranking data from a plurality of web sites
CN103544176B (en) Method and apparatus for generating the page structure template corresponding to multiple pages
CN103488781B (en) Method, the search engine server of information search are provided
CN104360882B (en) Display methods and device are carried out to picture in webpage in a kind of browser
US9443014B2 (en) Custom web page themes
US10110966B2 (en) Method, device, server and client device for video processing
US20090100356A1 (en) Method for Presenting a Web Page
CN102779123B (en) A kind of website shows screenshotss method, system and the desk module and method of content
CN107784059A (en) For searching for and selecting the method and system and machine-readable medium of image
US11164221B2 (en) Native online ad creation
CN102999595B (en) A kind of for providing method and the equipment of the accession page corresponding with page info
CN107666435A (en) A kind of method and device for shielding message
Jeong et al. Usability study on newspaper mobile websites
CN106899549A (en) A kind of network security detection method and device
US20120272130A1 (en) Object control method for displaying objects in front of link elements
CN105786965A (en) URL-based user behavior analysis method and device
CN106776615A (en) Heating power drawing generating method and device
CN104268282A (en) Web banner advertisement displaying method and system
CN107766398A (en) For the method, apparatus and data handling system for image is matched with content item
CN106951429B (en) Method, browser and equipment for enhancing webpage comment display
Oliveira et al. From 10 Blue Links Pages to Feature-Full Search Engine Results Pages-Analysis of the Temporal Evolution of SERP Features
CN108205540A (en) The methods, devices and systems of information browse
KR102062248B1 (en) Method for advertising releated commercial image by analyzing online news article image
US9058623B2 (en) Semantic tagged ads

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220726

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.