A kind of method that efficient information gathers
Technical field
The present invention relates to a kind of information acquisition technique, be specifically related to a kind of method that efficient information gathers.
Background technology
Internet develop and universal, bring the undergoes rapid expansion of information, the form of information is also thereupon varied, and information is exactly one of them.
Internet search engine technology is convenient for user provides, passage fast, help the faster more accurate information that more fully obtains information of user, by search engine technique, information search is arisen at the historic moment, during certain keyword of user search, all relevant informations all can be retrieved out, and user can by checking to the factor such as trust and preference of website the information oneself wanting to see.In information search, the collection of information is crucial, and the accuracy, promptness etc. of collection directly affect quality and the Consumer's Experience of information search.
The collection of information is the key of information integration always, traditional information collection is the groundwork of web editor personnel, is mostly to search in artificial mode, and this mode is not only a kind of repeated labor, the one waste of labour especially, work efficiency is very low.And occur along with the collection framework of system, realize the process such as collection, arrangement to information in the mode of system, greatly improve work efficiency, and save labour.
The shortcoming of current information collection is that source is many and assorted, and noise data is many, and emphasis data are not given prominence to, and gathers not in time etc.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of method that efficient information gathers.The method obtains the related data of data source entrance from database, the entrance of the excellent updating decision of the priority scheduling information quality of data, obtains the details page information data of entrance, carries out downloading, transcoding and extraction, complete information data is sent.
In order to realize foregoing invention object, the present invention takes following technical scheme:
The method that efficient information gathers, the method comprises entrance scanning and news downloads two parts.
In optimal technical scheme provided by the invention, described entrance scanning comprises the steps:
A, access information to be dispatched according to priority policy, dispatch scanning entrance is put into entrance downloader;
B, entrance downloader obtain download result, will download successful entrance and correlate template information pushing to withdrawal device;
C, obtain extraction result from withdrawal device;
C-1, to extraction successful entrance result carry out re-scheduling and analysis, obtain corresponding information details page connect and other relevant informations, put it into task queue to be downloaded, wait for news download subsequent treatment;
C-2, then wait for the dispatch scanning in next cycle according to scheduling strategy to extracting failed entrance.
In second optimal technical scheme provided by the invention, described steps A medium priority strategy comprises the renewal amount of entrance and the weight of website of entrance.
In 3rd optimal technical scheme provided by the invention, the time expand that in described step C-2, dispatch scanning cycle stretch-out extremely arranges by scheduling strategy.
In 4th optimal technical scheme provided by the invention, described news is downloaded and is comprised the steps:
A, by the data-pushing of task queue to be downloaded to news downloader;
B, obtain download result from news downloader, successful for download details page and relevant information are pushed to withdrawal device and carry out extracted data;
The extraction result of C, the page that to obtain detailed information from withdrawal device, the extraction result of page turning needs to merge;
D, to extraction result analyze;
If it is complete information data that D-1 extracts result, be sent to database by transmitter;
If D-2 extracts containing page turning link in result, page turning link is put in task queue to be downloaded, forwards steps A to and process, this page turning queue priority processing;
If D-3 extracts in result containing image link, then carry out picture processing operation, after processing, check this information data, if complete information data are then sent to database by transmitter, if have page turning link then to forward steps A to process, picture processing operator precedence is in page turning queue task.
In 5th optimal technical scheme provided by the invention, described picture processing comprises the steps:
(1) image link is pushed to picture downloader;
(2), after picture downloader download pictures, analyze picture and compression upload process is carried out to picture.
In 6th optimal technical scheme provided by the invention, described manually given access information comprises following content:
(1) entrance storehouse, comprise linking inlet ports, template that entrance extracts, affiliated web site and respective labels;
(2) storehouse, website, comprises web site url, website PR rank and respective labels;
(3) template data.
Compared with prior art, beneficial effect of the present invention is:
The invention provides efficient information and gather framework method, the result of collection is accurate, decreases a large amount of noise datas gathered in information, builds simple, gathers significant data in time.
Accompanying drawing explanation
Fig. 1 is a kind of process flow diagram of information acquisition method
Fig. 2 is the process flow diagram of entrance scanning
Fig. 3 is the process flow diagram that news is downloaded
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
As shown in Figure 1, efficient information acquisition method, the method comprises entrance scanning and news downloads two parts.
As shown in Figure 2, the concrete steps of entrance scanning are as follows:
A, dispatch scanning entrance is put into entrance downloader;
B, entrance downloader obtain download result, will download successful entrance and correlate template information pushing to withdrawal device;
C, withdrawal device obtain extraction result, and analyze result;
C-1, to extraction successful entrance result carry out re-scheduling, obtain corresponding information details page connect and other relevant informations, put it into task queue to be downloaded, wait for news download subsequent treatment;
C-2, then wait for the dispatch scanning in next cycle according to scheduling strategy to extracting failed entrance.
As shown in Figure 3, the concrete steps of news download are as follows:
A, from news downloader, obtain queue to be downloaded;
B, obtain download result from news downloader, successful for download details page and relevant information are pushed to withdrawal device and carry out extracted data;
The extraction result of C, the page that to obtain detailed information from withdrawal device, the extraction result of page turning needs to merge;
D, to extraction result carry out transcoding and analyze;
If it is complete information data that D-1 extracts result, be sent to database by transmitter;
If D-2 extracts containing page turning link in result, page turning link is put in task queue 2 to be downloaded, forwards steps A to and process, this page turning queue priority processing;
If D-3 extracts containing image link in result, then image link is pushed to picture downloader, after picture downloader download pictures, analysis picture also carries out compression upload process to picture.After processing, check this information data, if complete information data are then sent to database by transmitter, process if there is page turning link then to forward steps A to, picture processing operator precedence is in page turning queue task.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the field are to be understood that: still can modify to the specific embodiment of the present invention or equivalent replacement, and not departing from any amendment of spirit and scope of the invention or equivalent replacement, it all should be encompassed in the middle of right of the present invention.