Summary of the invention
Fundamental purpose of the present invention is to provide a kind of web page storage disposal route for browser and device, to solve the inefficient problem of searching target web from the collection webpage of browser.
To achieve these goals, according to an aspect of the present invention, a kind of web page storage disposal route for browser is provided.
Web page storage disposal route for browser according to the present invention comprises: receive search key, and wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse; Search key is mated with the collection webpage of browser, obtains the address of the collection webpage mated; The address of the collection webpage of output matching.
Further, search key is carried out mating comprising with the collection webpage of browser: the title and the content of text that obtain the collection webpage of browser; And the title of the collection webpage of browser and content of text are mated with search key, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.
Further, before the title of the collection webpage by browser and content of text mate with search key, method also comprises: the content of text obtaining the collection webpage of browser; Obtain network address and the title of the collection webpage of browser; And the content of text of the collection webpage of storage browser, network address and title.
Further, the content of text obtaining the collection webpage of browser comprises: the address obtaining the collection webpage of browser; According to the address access collection webpage of the collection webpage of browser; And content of text is crawled from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.
Further, from the process at access collection webpage, crawl content of text from collection webpage, the content of text obtaining the collection webpage of browser comprises: the HTML (Hypertext Markup Language) label filtering the collection webpage of browser; And content of text is crawled from the collection webpage of the browser of filtration HTML (Hypertext Markup Language) label, obtain the content of text of the collection webpage of browser.
Further, content of text is crawled from collection webpage in the process of access collection webpage, after obtaining the content of text of collection webpage of browser, method also comprises: from the content of text of the collection webpage of browser, obtain keyword, obtains the keyword of the collection webpage of browser; Store the keyword of the collection webpage of browser, network address and title, the title of the collection webpage of browser and content of text are carried out mating comprising with search key: the keyword of the collection webpage of browser and title are mated with search key.
To achieve these goals, according to a further aspect in the invention, a kind of web page storage treating apparatus for browser is provided.
Web page storage treating apparatus for browser according to the present invention comprises: receiving element, and for receiving search key, wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse; Matching unit, for being mated with the collection webpage of browser by search key, obtains the address of the collection webpage mated; And output unit, for the address of the collection webpage of output matching.
Further, matching unit comprises: the first acquisition module, for obtaining title and the content of text of the collection webpage of browser; And matching module, mate with search key for the title of the collection webpage by browser and content of text, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.
Further, device also comprises: the first acquiring unit, for obtaining the content of text of the collection webpage of browser; Second acquisition unit, for obtaining network address and the title of the collection webpage of browser; And storage unit, for storing the content of text of the collection webpage of browser, network address and title.
Further, the first acquiring unit comprises: the second acquisition module, obtains the address of the collection webpage of browser; Access modules, for the address access collection webpage of the collection webpage according to browser; And crawl module, for crawling content of text from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.
Pass through the present invention, adopt the mode of retrieval from the collection webpage of browser, search the collection webpage needing access, solve the inefficient problem of searching target web from the collection webpage of browser, and then reach the effect improving and search the efficiency of target web from the collection webpage of browser.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.
The application's scheme is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment is only the embodiment of the application's part, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all should belong to the scope of the application's protection.
It should be noted that, term " first ", " second " etc. in the instructions of the application and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged, in the appropriate case so that the embodiment of the application described herein.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiments provide a kind of web page storage disposal route for browser, Fig. 1 is the process flow diagram of the web page storage disposal route for browser according to the embodiment of the present invention.
As shown in Figure 1, the method comprises following step S102 to step S106:
Step S102: receive search key, wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse.
Search key can be that search key can be a keyword, also can be multiple keyword arbitrarily for searching the keyword needing the webpage browsed from the collection webpage of browser.Particularly, by arranging a frame retrieval in the region of the collection webpage of browser, the search key of user's input can be received by this frame retrieval.
Step S104: mated with the collection webpage of browser by search key, obtains the address of the collection webpage mated.
The collection webpage of browser is usually located in the collection of browser, saves address and the title of collection webpage in the collection of existing browser.Search key and the collection webpage of browser being carried out mating can be mated by the title of search key with collection webpage, if there is search key in the title of collection webpage, illustrates that the webpage that this collection webpage and user need to access is relevant.The reason of mating with search key in the collection webpage of record browser collects webpage.Preferably, in order to improve the accuracy of being searched the collection webpage needing access by search key, search key is carried out mating comprising with the collection webpage of browser: the title and the content of text that obtain the collection webpage of browser; And the title of the collection webpage of browser and content of text are mated with search key, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.
The content of collection webpage can be obtained by access collection webpage, also can be in advance the content of text of each collection webpage in the collection webpage of browser is stored in local data base or other storage areas, by obtaining the content of text of collection webpage from database or other storage areas.The content of text of collection webpage can be the full text content of collection webpage, also can be the keyword of the extraction in the full text content of collection webpage.Because the title collecting webpage can not represent the content of collection webpage sometimes, or the keyword of the content of the collection webpage that user is concerned about may not be included in the title of collection webpage, now, if mated by means of only by the title of search key with collection webpage, the collection webpage that cannot retrieve and need access can be caused, and user may repeatedly retrieve the collection webpage that also cannot retrieve and need access by changing multiple search key, mated with search key by the title of the collection webpage by browser and content of text, the problems referred to above can be avoided.Particularly, first the title of collection webpage can be mated with search key, if the title of collection webpage mates with search key, the content no longer can carrying out collecting webpage is mated with search key, if the title of collection webpage does not mate with search key, then the content of collection webpage is mated with search key.By said method, collection webpage and the probability mated of search key can be improved, improve the accuracy of being searched the collection webpage needing access by search key further.
Preferably, in order to improve the title of collection webpage and the efficiency of content of text of above-mentioned acquisition browser, before the title of the collection webpage by browser and content of text mate with search key, the method also comprises: the content of text obtaining the collection webpage of browser; Obtain network address and the title of the collection webpage of browser; And the content of text of the collection webpage of storage browser, network address and title.
By obtaining the title of the content of text of the collection webpage of browser, the network address of collection webpage and collection webpage in advance and be stored in local storage area before retrieving the collection webpage of browser, such as local data base, particularly, in the process storing content of text, network address and the title of collecting webpage, the content of text of the collection webpage of browser, network address and title can be associated, namely set up the corresponding relation of content of text, network address and the title belonging to same collection webpage.Pass through said method, when the collection webpage of user to browser carry out retrieval be time, can be get the collection content of text of webpage fast, title mates with search key, if the address of this collection webpage can be obtained when there is the collection webpage mated with search key fast, improve effectiveness of retrieval.
Alternatively, the content of text obtaining the collection webpage of browser comprises: the address obtaining the collection webpage of browser; According to the address access collection webpage of the collection webpage of browser; And content of text is crawled from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.
Network address and the title of the collection webpage of browser have been stored in the collection of browser, particularly, can by calling application programming interfaces (the ApplicationProgramming Interface of the address for obtaining collection webpage that browser provides, API) address of collection webpage is obtained, i.e. URL(uniform resource locator) (UniformResource Locator, URL).This collection webpage can be accessed by the address of collecting webpage, in the process of access collection webpage, crawl content of text from collection webpage, obtain the content of text of the collection webpage of browser.Particularly, content of text can be crawled by web crawlers from collection webpage.Web crawlers is automatically crawl program or the script of information on network according to setting rule, such as, can arrange web crawlers and only crawl content of text on webpage, also can arrange web crawlers and only crawl picture on webpage, wait for.Only crawled the content of text of collection webpage by web crawlers in the embodiment of the present invention.Preferably, in order to improve the efficiency of the content of text crawling collection webpage, in the process of access collection webpage, crawl content of text from collection webpage, the content of text obtaining the collection webpage of browser comprises: the HTML (Hypertext Markup Language) label filtering the collection webpage of browser; And content of text is crawled from the collection webpage of the browser of filtration HTML (Hypertext Markup Language) label, obtain the content of text of the collection webpage of browser.
HTML (Hypertext Markup Language) (Hyper Text Markup Language, HTML) label is unit minimum in HTML (Hypertext Markup Language), the display format of webpage can be set by this HTML (Hypertext Markup Language) label, such as, the display position etc. of the title of webpage, key word, web page contents is set by HTML (Hypertext Markup Language) label.Particularly, can in the address by collection webpage after server request accessed web page, the content returned by server is mated with the regular expression preset, filter out the HTML (Hypertext Markup Language) label of collection webpage, wherein, regular expression be use single character string describe, mate a series of character string meeting certain syntactic rule, such as, one for mate China Post coding regular expression for " [1-9] d{5} (?! D) ", character string to be matched is " Chinabeijing100081haidian ", then can go out the character " 100081 " representing postcode in character string to be detected by Rapid matching by this regular expression, other characters are then filtered.
Preferably, content of text is crawled from collection webpage in the process of access collection webpage, after obtaining the content of text of collection webpage of browser, method also comprises: from the content of text of the collection webpage of browser, obtain keyword, obtains the keyword of the collection webpage of browser; Store the keyword of the collection webpage of browser, network address and title, the title of the collection webpage of browser and content of text are carried out mating comprising with search key: the keyword of the collection webpage of browser and title are mated with search key.
The keyword of the collection webpage of browser can be some words that in the content of text of collection webpage, occurrence number is more, also can be the word of the content of text that in the content of text of collection webpage, position is forward, such as, collect the summary etc. of the content of text of webpage.Particularly, the embodiment of the present invention is described for more some words of occurrence number in the content of text collecting webpage as the keyword of this collection webpage, after the content of text getting collection webpage, word can be cut to the content of text of collection webpage, the content of text being about to collection webpage is divided into independently word, some stop words can be filtered out in advance, stop words and modal particle, conjunctions etc. are without the word of physical meaning, by the word composition set of words obtained after filtration, add up the word and this word occurrence number repeated that repeat in this set of words, if the occurrence number of this word repeated is greater than predetermined threshold value, the word then this repeated is as the keyword of collection webpage.After obtaining the keyword of collection webpage of browser, similarly, the corresponding relation of the keyword of collection webpage, network address and title can be set up when storing keyword, network address and the title process of collecting webpage.Because the content of text collecting webpage may be more, it is comparatively consuming time when search key mates with the content of text of collection webpage, on the other hand, also the matching result of too much mistake may be there is, namely the collection webpage mated with search key is not the collection webpage that user needs to access, mated with search key by the keyword extracted in the content of text of collection webpage, not only can improve the efficiency of coupling, and the accuracy of matching result can be improved.
Step S106: the address of the collection webpage of output matching.
Can be obtained the address with the collection webpage mated of search key in the collection webpage of browser by above-mentioned steps, the address exporting the collection webpage of this coupling is checked for user.
As can be seen from the above description, present invention achieves following technique effect:
The embodiment of the present invention is by receiving search key, search key is mated with the collection webpage of browser, obtain the address of the collection webpage mated, the address of the collection webpage of output matching, from the collection webpage of browser, the collection webpage needing access is searched by the mode of retrieval, search compared to the collection webpage by user successively open any browser in prior art, improve the efficiency of searching target web from the collection webpage of browser, solve the inefficient problem of searching target web in correlation technique from the collection webpage of browser.
It should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, but in some cases, can be different from the step shown or described by order execution herein.
According to the another aspect of the embodiment of the present invention, provide a kind of web page storage treating apparatus for browser, this device may be used for the web page storage disposal route for browser performing the embodiment of the present invention, and the method for the embodiment of the present invention also can be performed by the web page storage treating apparatus for browser of the embodiment of the present invention.
Fig. 2 is the schematic diagram of the web page storage treating apparatus for browser according to the embodiment of the present invention.As shown in Figure 2, this web page storage treating apparatus being used for browser comprises: receiving element 10, matching unit 20 and output unit 30.
Receiving element 10, for receiving search key, wherein, search key is used for searching from the collection webpage of browser the webpage needing to browse.
Search key can be that search key can be a keyword, also can be multiple keyword arbitrarily for searching the keyword needing the webpage browsed from the collection webpage of browser.Particularly, by arranging a frame retrieval in the region of the collection webpage of browser, the search key of user's input can be received by this frame retrieval.
Matching unit 20, for being mated with the collection webpage of browser by search key, obtains the address of the collection webpage mated.
The collection webpage of browser is usually located in the collection of browser, saves address and the title of collection webpage in the collection of existing browser.Search key and the collection webpage of browser being carried out mating can be mated by the title of search key with collection webpage, if there is search key in the title of collection webpage, illustrates that the webpage that this collection webpage and user need to access is relevant.
Output unit 30, for the address of the collection webpage of output matching.
With behind the address of the collection webpage mated of search key in the collection webpage obtaining browser, the address exporting the collection webpage of this coupling is checked for user.
The embodiment of the present invention receives search key by receiving element 10, and search key mates with the collection webpage of browser by matching unit 20, obtains the address of the collection webpage mated, the address of the collection webpage of output unit 30 output matching.The embodiment of the present invention searches the collection webpage needing access from the collection webpage of browser by the mode of retrieval, search compared to the collection webpage by user successively open any browser in prior art, improve the efficiency of searching target web from the collection webpage of browser, solve the inefficient problem of searching target web in correlation technique from the collection webpage of browser.
Preferably, matching unit 20 comprises: the first acquisition module, for obtaining title and the content of text of the collection webpage of browser; And matching module, mate with search key for the title of the collection webpage by browser and content of text, wherein, if title and the content of text of the collection webpage of browser mate with search key, then deterministic retrieval keyword mates with the collection webpage of browser, if title and the content of text of the collection webpage of browser do not mate with search key, then deterministic retrieval keyword does not mate with the collection webpage of browser.
Preferably, this device also comprises: the first acquiring unit, for obtaining the content of text of the collection webpage of browser; Second acquisition unit, for obtaining network address and the title of the collection webpage of browser; And storage unit, for storing the content of text of the collection webpage of browser, network address and title.
Preferably, the first acquiring unit comprises: the second acquisition module, obtains the address of the collection webpage of browser; Access modules, for the address access collection webpage of the collection webpage according to browser; And crawl module, for crawling content of text from collection webpage in the process of access collection webpage, obtain the content of text of the collection webpage of browser.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.