CN104391978A

CN104391978A - Method and device for storing and processing web pages of browsers

Info

Publication number: CN104391978A
Application number: CN201410742954.XA
Authority: CN
Inventors: 伯诺克
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2014-12-05
Filing date: 2014-12-05
Publication date: 2015-03-04
Anticipated expiration: 2034-12-05
Also published as: CN104391978B

Abstract

The invention discloses a method and a device for storing and processing web pages of browsers. The method for storing and processing the web pages of the browsers includes receiving search keywords for searching the required-to-be-browsed web pages from the stored web pages of the browsers; matching the search keywords with the stored web pages of the browsers to obtain addresses of the matched stored web pages; outputting the addresses of the matched stored web pages. The method and the device have the advantages that the problem of low efficiency when target web pages are searched from stored web pages of existing browsers can be solved, and accordingly an effect of improving the efficiency when the target web pages are searched from the stored web pages of the browsers can be realized.

Description

For web page storage disposal route and the device of browser

Technical field

The present invention relates to internet arena, in particular to a kind of web page storage disposal route for browser and device.

Background technology

Existing browser has the function of collection webpage.The URL address of webpage and the title of this webpage of user's preservation is have recorded in web page storage folder.When user needs the webpage of again accessing collection, these webpages can be found to conduct interviews by the title of the network address in collection or webpage.Although aforesaid way can allow user find the webpage of collection, when collection record is a lot, the webpage identifying needs can only be removed by the title in collection.But the title of webpage usually can not represent web page contents, or some keyword of the web page contents of user's care is not included in the title of the webpage of collection, makes user be difficult to the webpage finding needs to access fast in the webpage of a large amount of collection.

For the inefficient problem of searching target web in correlation technique from the collection webpage of browser, at present effective solution is not yet proposed.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of web page storage disposal route for browser and device, to solve the inefficient problem of searching target web from the collection webpage of browser.

To achieve these goals, according to an aspect of the present invention, a kind of web page storage disposal route for browser is provided.

Web page storage disposal route for browser according to the present invention comprises: receive search key, and wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse; Search key is mated with the collection webpage of browser, obtains the address of the collection webpage mated; The address of the collection webpage of output matching.

Further, search key is carried out mating comprising with the collection webpage of browser: the title and the content of text that obtain the collection webpage of browser; And the title of the collection webpage of browser and content of text are mated with search key, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.

Further, before the title of the collection webpage by browser and content of text mate with search key, method also comprises: the content of text obtaining the collection webpage of browser; Obtain network address and the title of the collection webpage of browser; And the content of text of the collection webpage of storage browser, network address and title.

Further, the content of text obtaining the collection webpage of browser comprises: the address obtaining the collection webpage of browser; According to the address access collection webpage of the collection webpage of browser; And content of text is crawled from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.

Further, from the process at access collection webpage, crawl content of text from collection webpage, the content of text obtaining the collection webpage of browser comprises: the HTML (Hypertext Markup Language) label filtering the collection webpage of browser; And content of text is crawled from the collection webpage of the browser of filtration HTML (Hypertext Markup Language) label, obtain the content of text of the collection webpage of browser.

Further, content of text is crawled from collection webpage in the process of access collection webpage, after obtaining the content of text of collection webpage of browser, method also comprises: from the content of text of the collection webpage of browser, obtain keyword, obtains the keyword of the collection webpage of browser; Store the keyword of the collection webpage of browser, network address and title, the title of the collection webpage of browser and content of text are carried out mating comprising with search key: the keyword of the collection webpage of browser and title are mated with search key.

To achieve these goals, according to a further aspect in the invention, a kind of web page storage treating apparatus for browser is provided.

Web page storage treating apparatus for browser according to the present invention comprises: receiving element, and for receiving search key, wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse; Matching unit, for being mated with the collection webpage of browser by search key, obtains the address of the collection webpage mated; And output unit, for the address of the collection webpage of output matching.

Further, matching unit comprises: the first acquisition module, for obtaining title and the content of text of the collection webpage of browser; And matching module, mate with search key for the title of the collection webpage by browser and content of text, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.

Further, device also comprises: the first acquiring unit, for obtaining the content of text of the collection webpage of browser; Second acquisition unit, for obtaining network address and the title of the collection webpage of browser; And storage unit, for storing the content of text of the collection webpage of browser, network address and title.

Further, the first acquiring unit comprises: the second acquisition module, obtains the address of the collection webpage of browser; Access modules, for the address access collection webpage of the collection webpage according to browser; And crawl module, for crawling content of text from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.

Pass through the present invention, adopt the mode of retrieval from the collection webpage of browser, search the collection webpage needing access, solve the inefficient problem of searching target web from the collection webpage of browser, and then reach the effect improving and search the efficiency of target web from the collection webpage of browser.

Accompanying drawing explanation

The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the process flow diagram of the web page storage disposal route for browser according to the embodiment of the present invention; And

Fig. 2 is the schematic diagram of the web page storage treating apparatus for browser according to the embodiment of the present invention.

Embodiment

It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.

The application's scheme is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment is only the embodiment of the application's part, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all should belong to the scope of the application's protection.

It should be noted that, term " first ", " second " etc. in the instructions of the application and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged, in the appropriate case so that the embodiment of the application described herein.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.

Embodiments provide a kind of web page storage disposal route for browser, Fig. 1 is the process flow diagram of the web page storage disposal route for browser according to the embodiment of the present invention.

As shown in Figure 1, the method comprises following step S102 to step S106:

Step S102: receive search key, wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse.

Search key can be that search key can be a keyword, also can be multiple keyword arbitrarily for searching the keyword needing the webpage browsed from the collection webpage of browser.Particularly, by arranging a frame retrieval in the region of the collection webpage of browser, the search key of user's input can be received by this frame retrieval.

Step S104: mated with the collection webpage of browser by search key, obtains the address of the collection webpage mated.

The collection webpage of browser is usually located in the collection of browser, saves address and the title of collection webpage in the collection of existing browser.Search key and the collection webpage of browser being carried out mating can be mated by the title of search key with collection webpage, if there is search key in the title of collection webpage, illustrates that the webpage that this collection webpage and user need to access is relevant.The reason of mating with search key in the collection webpage of record browser collects webpage.Preferably, in order to improve the accuracy of being searched the collection webpage needing access by search key, search key is carried out mating comprising with the collection webpage of browser: the title and the content of text that obtain the collection webpage of browser; And the title of the collection webpage of browser and content of text are mated with search key, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.

The content of collection webpage can be obtained by access collection webpage, also can be in advance the content of text of each collection webpage in the collection webpage of browser is stored in local data base or other storage areas, by obtaining the content of text of collection webpage from database or other storage areas.The content of text of collection webpage can be the full text content of collection webpage, also can be the keyword of the extraction in the full text content of collection webpage.Because the title collecting webpage can not represent the content of collection webpage sometimes, or the keyword of the content of the collection webpage that user is concerned about may not be included in the title of collection webpage, now, if mated by means of only by the title of search key with collection webpage, the collection webpage that cannot retrieve and need access can be caused, and user may repeatedly retrieve the collection webpage that also cannot retrieve and need access by changing multiple search key, mated with search key by the title of the collection webpage by browser and content of text, the problems referred to above can be avoided.Particularly, first the title of collection webpage can be mated with search key, if the title of collection webpage mates with search key, the content no longer can carrying out collecting webpage is mated with search key, if the title of collection webpage does not mate with search key, then the content of collection webpage is mated with search key.By said method, collection webpage and the probability mated of search key can be improved, improve the accuracy of being searched the collection webpage needing access by search key further.

Preferably, in order to improve the title of collection webpage and the efficiency of content of text of above-mentioned acquisition browser, before the title of the collection webpage by browser and content of text mate with search key, the method also comprises: the content of text obtaining the collection webpage of browser; Obtain network address and the title of the collection webpage of browser; And the content of text of the collection webpage of storage browser, network address and title.

By obtaining the title of the content of text of the collection webpage of browser, the network address of collection webpage and collection webpage in advance and be stored in local storage area before retrieving the collection webpage of browser, such as local data base, particularly, in the process storing content of text, network address and the title of collecting webpage, the content of text of the collection webpage of browser, network address and title can be associated, namely set up the corresponding relation of content of text, network address and the title belonging to same collection webpage.Pass through said method, when the collection webpage of user to browser carry out retrieval be time, can be get the collection content of text of webpage fast, title mates with search key, if the address of this collection webpage can be obtained when there is the collection webpage mated with search key fast, improve effectiveness of retrieval.

Alternatively, the content of text obtaining the collection webpage of browser comprises: the address obtaining the collection webpage of browser; According to the address access collection webpage of the collection webpage of browser; And content of text is crawled from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.

Network address and the title of the collection webpage of browser have been stored in the collection of browser, particularly, can by calling application programming interfaces (the ApplicationProgramming Interface of the address for obtaining collection webpage that browser provides, API) address of collection webpage is obtained, i.e. URL(uniform resource locator) (UniformResource Locator, URL).This collection webpage can be accessed by the address of collecting webpage, in the process of access collection webpage, crawl content of text from collection webpage, obtain the content of text of the collection webpage of browser.Particularly, content of text can be crawled by web crawlers from collection webpage.Web crawlers is automatically crawl program or the script of information on network according to setting rule, such as, can arrange web crawlers and only crawl content of text on webpage, also can arrange web crawlers and only crawl picture on webpage, wait for.Only crawled the content of text of collection webpage by web crawlers in the embodiment of the present invention.Preferably, in order to improve the efficiency of the content of text crawling collection webpage, in the process of access collection webpage, crawl content of text from collection webpage, the content of text obtaining the collection webpage of browser comprises: the HTML (Hypertext Markup Language) label filtering the collection webpage of browser; And content of text is crawled from the collection webpage of the browser of filtration HTML (Hypertext Markup Language) label, obtain the content of text of the collection webpage of browser.

HTML (Hypertext Markup Language) (Hyper Text Markup Language, HTML) label is unit minimum in HTML (Hypertext Markup Language), the display format of webpage can be set by this HTML (Hypertext Markup Language) label, such as, the display position etc. of the title of webpage, key word, web page contents is set by HTML (Hypertext Markup Language) label.Particularly, can in the address by collection webpage after server request accessed web page, the content returned by server is mated with the regular expression preset, filter out the HTML (Hypertext Markup Language) label of collection webpage, wherein, regular expression be use single character string describe, mate a series of character string meeting certain syntactic rule, such as, one for mate China Post coding regular expression for " [1-9] d{5} (?! D) ", character string to be matched is " Chinabeijing100081haidian ", then can go out the character " 100081 " representing postcode in character string to be detected by Rapid matching by this regular expression, other characters are then filtered.

Preferably, content of text is crawled from collection webpage in the process of access collection webpage, after obtaining the content of text of collection webpage of browser, method also comprises: from the content of text of the collection webpage of browser, obtain keyword, obtains the keyword of the collection webpage of browser; Store the keyword of the collection webpage of browser, network address and title, the title of the collection webpage of browser and content of text are carried out mating comprising with search key: the keyword of the collection webpage of browser and title are mated with search key.

The keyword of the collection webpage of browser can be some words that in the content of text of collection webpage, occurrence number is more, also can be the word of the content of text that in the content of text of collection webpage, position is forward, such as, collect the summary etc. of the content of text of webpage.Particularly, the embodiment of the present invention is described for more some words of occurrence number in the content of text collecting webpage as the keyword of this collection webpage, after the content of text getting collection webpage, word can be cut to the content of text of collection webpage, the content of text being about to collection webpage is divided into independently word, some stop words can be filtered out in advance, stop words and modal particle, conjunctions etc. are without the word of physical meaning, by the word composition set of words obtained after filtration, add up the word and this word occurrence number repeated that repeat in this set of words, if the occurrence number of this word repeated is greater than predetermined threshold value, the word then this repeated is as the keyword of collection webpage.After obtaining the keyword of collection webpage of browser, similarly, the corresponding relation of the keyword of collection webpage, network address and title can be set up when storing keyword, network address and the title process of collecting webpage.Because the content of text collecting webpage may be more, it is comparatively consuming time when search key mates with the content of text of collection webpage, on the other hand, also the matching result of too much mistake may be there is, namely the collection webpage mated with search key is not the collection webpage that user needs to access, mated with search key by the keyword extracted in the content of text of collection webpage, not only can improve the efficiency of coupling, and the accuracy of matching result can be improved.

Step S106: the address of the collection webpage of output matching.

Can be obtained the address with the collection webpage mated of search key in the collection webpage of browser by above-mentioned steps, the address exporting the collection webpage of this coupling is checked for user.

As can be seen from the above description, present invention achieves following technique effect:

The embodiment of the present invention is by receiving search key, search key is mated with the collection webpage of browser, obtain the address of the collection webpage mated, the address of the collection webpage of output matching, from the collection webpage of browser, the collection webpage needing access is searched by the mode of retrieval, search compared to the collection webpage by user successively open any browser in prior art, improve the efficiency of searching target web from the collection webpage of browser, solve the inefficient problem of searching target web in correlation technique from the collection webpage of browser.

It should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, but in some cases, can be different from the step shown or described by order execution herein.

According to the another aspect of the embodiment of the present invention, provide a kind of web page storage treating apparatus for browser, this device may be used for the web page storage disposal route for browser performing the embodiment of the present invention, and the method for the embodiment of the present invention also can be performed by the web page storage treating apparatus for browser of the embodiment of the present invention.

Fig. 2 is the schematic diagram of the web page storage treating apparatus for browser according to the embodiment of the present invention.As shown in Figure 2, this web page storage treating apparatus being used for browser comprises: receiving element 10, matching unit 20 and output unit 30.

Receiving element 10, for receiving search key, wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse.

Matching unit 20, for being mated with the collection webpage of browser by search key, obtains the address of the collection webpage mated.

The collection webpage of browser is usually located in the collection of browser, saves address and the title of collection webpage in the collection of existing browser.Search key and the collection webpage of browser being carried out mating can be mated by the title of search key with collection webpage, if there is search key in the title of collection webpage, illustrates that the webpage that this collection webpage and user need to access is relevant.

Output unit 30, for the address of the collection webpage of output matching.

With behind the address of the collection webpage mated of search key in the collection webpage obtaining browser, the address exporting the collection webpage of this coupling is checked for user.

The embodiment of the present invention receives search key by receiving element 10, and search key mates with the collection webpage of browser by matching unit 20, obtains the address of the collection webpage mated, the address of the collection webpage of output unit 30 output matching.The embodiment of the present invention searches the collection webpage needing access from the collection webpage of browser by the mode of retrieval, search compared to the collection webpage by user successively open any browser in prior art, improve the efficiency of searching target web from the collection webpage of browser, solve the inefficient problem of searching target web in correlation technique from the collection webpage of browser.

Preferably, matching unit 20 comprises: the first acquisition module, for obtaining title and the content of text of the collection webpage of browser; And matching module, mate with search key for the title of the collection webpage by browser and content of text, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.

Preferably, this device also comprises: the first acquiring unit, for obtaining the content of text of the collection webpage of browser; Second acquisition unit, for obtaining network address and the title of the collection webpage of browser; And storage unit, for storing the content of text of the collection webpage of browser, network address and title.

Preferably, the first acquiring unit comprises: the second acquisition module, obtains the address of the collection webpage of browser; Access modules, for the address access collection webpage of the collection webpage according to browser; And crawl module, for crawling content of text from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.

Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1., for a web page storage disposal route for browser, it is characterized in that, comprising:

Receive search key, wherein, described search key is used for searching from the collection webpage of browser the webpage needing to browse;

Described search key is mated with the collection webpage of described browser, obtains the address of the collection webpage mated; And export the address of collection webpage of described coupling.

2. the web page storage disposal route for browser according to claim 1, is characterized in that, is carried out mating comprising by described search key with the collection webpage of described browser:

Obtain title and the content of text of the collection webpage of described browser; And the title of the collection webpage of described browser and content of text are mated with described search key,

Wherein, if title and the content of text of the collection webpage of described browser mate with described search key, then determine that described search key mates with the collection webpage of described browser, if title and the content of text of the collection webpage of described browser do not mate with described search key, then determine that described search key does not mate with the collection webpage of described browser.

3. the web page storage disposal route for browser according to claim 1, is characterized in that, before the title of the collection webpage by described browser and content of text mate with described search key, described method also comprises:

Obtain the content of text of the collection webpage of described browser;

Obtain network address and the title of the collection webpage of described browser; And store the content of text of collection webpage of described browser, network address and title.

4. the web page storage disposal route for browser according to claim 3, is characterized in that, the content of text obtaining the collection webpage of described browser comprises:

Obtain the address of the collection webpage of described browser;

Described collection webpage is accessed according to the address of the collection webpage of described browser; And content of text is crawled from described collection webpage in the process of the described collection webpage of access, obtain the content of text of the collection webpage of described browser.

5. the web page storage disposal route for browser according to claim 4, is characterized in that, from the process at the described collection webpage of access, crawl content of text from described collection webpage, the content of text obtaining the collection webpage of described browser comprises:

Filter the HTML (Hypertext Markup Language) label of the collection webpage of described browser; And content of text is crawled from the collection webpage of the described browser of filtration HTML (Hypertext Markup Language) label, obtain the content of text of the collection webpage of described browser.

6. the web page storage disposal route for browser according to claim 4, is characterized in that,

Content of text is crawled from described collection webpage in the process of the described collection webpage of access, after obtaining the content of text of collection webpage of described browser, described method also comprises: from the content of text of the collection webpage of described browser, obtain keyword, obtains the keyword of the collection webpage of described browser; Store the keyword of the collection webpage of described browser, network address and title, the title of the collection webpage of described browser and content of text are carried out mating comprising with described search key: the keyword of the collection webpage of described browser and title are mated with described search key.

7., for a web page storage treating apparatus for browser, it is characterized in that, comprising:

Receiving element, for receiving search key, wherein, described search key is used for searching from the collection webpage of browser the webpage needing to browse;

Matching unit, for being mated with the collection webpage of described browser by described search key, obtains the address of the collection webpage mated; And output unit, for exporting the address of the collection webpage of described coupling.

8. the web page storage treating apparatus for browser according to claim 7, it is characterized in that, described matching unit comprises:

First acquisition module, for obtaining title and the content of text of the collection webpage of described browser; And matching module, mate with described search key for the title of the collection webpage by described browser and content of text, wherein, if title and the content of text of the collection webpage of described browser mate with described search key, then determine that described search key mates with the collection webpage of described browser, if title and the content of text of the collection webpage of described browser do not mate with described search key, then determine that described search key does not mate with the collection webpage of described browser.

9. the web page storage treating apparatus for browser according to claim 7, it is characterized in that, described device also comprises:

First acquiring unit, for obtaining the content of text of the collection webpage of described browser;

Second acquisition unit, for obtaining network address and the title of the collection webpage of described browser; And storage unit, for storing the content of text of the collection webpage of described browser, network address and title.

10. the web page storage treating apparatus for browser according to claim 9, is characterized in that, described first acquiring unit comprises:

Second acquisition module, obtains the address of the collection webpage of described browser;

Access modules, described collection webpage is accessed in the address for the collection webpage according to described browser; And crawl module, for crawling content of text from described collection webpage in the process of the described collection webpage of access, obtain the content of text of the collection webpage of described browser.