CN109033195A - The acquisition methods of webpage information obtain equipment and computer-readable medium - Google Patents

The acquisition methods of webpage information obtain equipment and computer-readable medium Download PDF

Info

Publication number
CN109033195A
CN109033195A CN201810688855.6A CN201810688855A CN109033195A CN 109033195 A CN109033195 A CN 109033195A CN 201810688855 A CN201810688855 A CN 201810688855A CN 109033195 A CN109033195 A CN 109033195A
Authority
CN
China
Prior art keywords
url
crawled
queue
webpage
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810688855.6A
Other languages
Chinese (zh)
Inventor
孟祥祥
陈冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sheng Electronic Payment Services Ltd
Original Assignee
Shanghai Sheng Electronic Payment Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sheng Electronic Payment Services Ltd filed Critical Shanghai Sheng Electronic Payment Services Ltd
Priority to CN201810688855.6A priority Critical patent/CN109033195A/en
Publication of CN109033195A publication Critical patent/CN109033195A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The purpose of the application is to provide the acquisition methods, computer-readable medium and equipment of a kind of webpage information, the application passes through before acquiring webpage, web crawlers queue comprising uniform resource locator (URL) to be crawled is put into memory database, it avoids due to the problem of URL of storage in memory can disappear when network crawler system needs to restart, after can guaranteeing that network crawler system is restarted, URL to be crawled quickly can be read from the web crawlers queue of memory database, guarantee the normal execution of network crawler system;Web page content information is extracted from the webpage got by using Context resolution tool, web page contents are cleaned in realization, the web page content information is finally stored, web page content information storage is realized, to improve the acquisition efficiency and reliability of web page content information.

Description

The acquisition methods of webpage information obtain equipment and computer-readable medium
Technical field
This application involves computer field more particularly to a kind of acquisition methods of webpage information, obtain equipment and computer Readable medium.
Background technique
Currently, network crawler system is when crawling webpage information, usually by uniform resource locator to be crawled (uniform resource locator, URL) is stored in memory.When network crawler system needs to restart, storage The URL of in memory to be crawled can disappear.When network crawler system wishes to continue to crawl webpage information after restart, need It picks up URL to be crawled and URL to be crawled is loaded onto memory, it is lower to obtain efficiency so as to cause webpage information.
Summary of the invention
The purpose of the application is to provide a kind of acquisition methods of webpage information, obtains equipment and computer-readable Jie Matter.
According to the one aspect of the application, provide a kind of acquisition methods of webpage information, this method comprises: will comprising to The web crawlers queue of the URL crawled is put into memory database;From the web crawlers queue in the memory database URL to be crawled described in middle taking-up;Acquisition request is sent to the corresponding website the URL, the acquisition request is for requesting institute State the corresponding webpage of URL to be crawled;If getting the webpage from the website, use Context resolution tool from the net Web page content information is extracted in page;Store the web page content information.
Further, described that the web crawlers queue comprising URL to be crawled is put into memory database in the above method In before, further includes: the URL to be crawled is ranked up by pre-set priority rule;By the URL to be crawled after sequence It is put into the web crawlers queue.
Further, in the above method, after the corresponding website transmission acquisition request to the URL, further includes: if The webpage is not got from the website, then the URL to be crawled is put back to the network in the memory database In crawler queue.
Further, in the above method, the net URL to be crawled put back in the memory database In network crawler queue, comprising: if the priority of the URL to be crawled is greater than or equal to preset threshold, by described wait crawl URL put back to the team in the web crawlers queue head position;Alternatively, being preset if the priority of the URL to be crawled is less than The URL to be crawled then is put back to the tail of the queue position in the web crawlers queue by threshold value.
Further, in the above method, institute is taken out in the web crawlers queue from the memory database After stating URL to be crawled, further includes: one thread pool of starting, and the URL to be crawled is put into the thread pool; It is described to send acquisition request to the corresponding website the URL, comprising: by the thread pool, to be obtained to described in the transmission of the website Take request.
Further, described to send acquisition request to the corresponding website the URL in the above method, comprising: from preset IP address is extracted in agent Internet protocol IP queue;The acquisition is sent to the website by the IP address being drawn into Request.
Further, in the above method, before or after the corresponding website transmission acquisition request to the URL, also It include: to obtain identifying code figure from the website;Verifying is identified from the identifying code figure by the way of text identification Code, and the identifying code is sent to the website.
Further, cookie when including the website log in the above method, in the acquisition request, it is described to institute It states before the corresponding website URL sends acquisition request, further includes: described in the browser used in the website log obtains cookie。
Further, in the above method, the storage web page content information, comprising: by the web page content information It is stored after being packaged into JSON format.
According to the another aspect of the application, a kind of acquisition equipment of webpage information is additionally provided, the equipment includes being used for Store the memory of computer program instructions and the processor for executing program instructions, wherein the computer program instructions When being executed by the processor, triggers the equipment and execute method described in any of the above embodiments.
According to the another aspect of the application, a kind of computer-readable medium is additionally provided, is stored thereon with computer-readable Instruction, the computer-readable instruction can be executed by processor to realize method described in any of the above embodiments.
Compared with prior art, the application is by will include uniform resource locator to be crawled before acquiring webpage (URL) web crawlers queue is put into memory database, is avoided due to storage when network crawler system needs to restart The problem of URL in memory can disappear, it is ensured that, can be quickly from the net of memory database after network crawler system is restarted URL to be crawled is read in network crawler queue, guarantees the normal execution of network crawler system;By using Context resolution tool from obtaining Web page content information is extracted in the webpage got, web page contents are cleaned in realization, the web page content information is finally stored, Web page content information storage is realized, to improve the acquisition efficiency and reliability of web page content information.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of flow chart of the acquisition methods of webpage information according to one embodiment of the application;
Fig. 2 shows the flow charts that webpage is obtained by agent IP address of one embodiment of the application;
Fig. 3 shows the flow chart that webpage is obtained by cookie of one embodiment of the application;
Fig. 4 shows the flow chart of the acquisition methods of the webpage information of one specific embodiment of the application;
Fig. 5 shows a kind of structural schematic diagram of the acquisition equipment of webpage information of one embodiment of the application.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
The application is described in further detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flashRAM).Memory is showing for computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As shown in Figure 1, the application provides a kind of acquisition methods of webpage information, this method can be applied to the network equipment End, for example, can be executed by web crawlers by the network equipment.Wherein, web crawlers is otherwise known as webpage spider or network Robot, alternatively, web crawlers is more among a kind of FOAF (Friend-of-a-Friend is XML/RDF vocabulary) community Frequent is known as webpage follower.Web crawlers can refer to one kind according to certain rules, automatically grab web message Program or script.Web crawlers can according to the task URL in web crawlers queue grab target, access corresponding webpage with Relevant link, information required for obtaining, each step of the present embodiment can be realized by web crawlers.As shown in Figure 1, This method comprises:
Web crawlers queue comprising URL to be crawled is put into memory database by step S101.
Here, URL is the expression succinct to the position for the resource that can be obtained from internet and one kind of access method, It is the address of standard resource on internet.Each file on internet has a unique URL, and the information that it includes is pointed out How the position of file in webpage and browser should handle it.Generally, basic URL include mode (or agreement), Server name or the address Internet protocol (internet protocol, IP) (corresponding website), path and filename.
Wherein, above-mentioned web crawlers queue is alternatively referred to as web crawlers task queue.
Above-mentioned memory database refers to the database putting data and directly operating in memory.Relative to disk, memory Reading and writing data speed will be several orders of magnitude higher, and be saved the data in memory compared to accessing and can greatly improve from disk The performance of application.
Step S102 takes out the URL to be crawled from the web crawlers queue in the memory database.
Here, can once be climbed from the network in the memory database according to equipment real data processing capacity One or more URL to be crawled are taken out in worm queue.
Step S103 sends acquisition request to the corresponding website the URL, and the acquisition request is described wait climb for requesting The corresponding webpage of the URL taken.
Here, a URL includes the address information of corresponding website and webpage.The web crawlers can be according to taking-up one Address information in a or multiple URL to be crawled sends the request for obtaining the correspondence webpage in the website to corresponding website.
Step S104 uses Context resolution tool from the webpage got if getting the webpage from the website Middle extraction web page content information.
Here, when smoothly getting the requested webpage of the acquisition request and arriving, so that it may use Context resolution work Tool presses preset rules, and all or part of web page content information is extracted from the webpage got.
Step S105 stores the web page content information.
Here, after extracting all or part of web page content information in the webpage got, it can be to the web page contents Information is stored, and further storage is counted or analyzed to web page content information in order to subsequent.
The present embodiment is by the way that before acquiring webpage, the network comprising uniform resource locator (URL) to be crawled is climbed Worm queue is put into memory database, is avoided due to the URL of storage in memory when network crawler system needs to restart The problem of disappearing, it is ensured that after network crawler system is restarted, can quickly be read from the web crawlers queue of memory database URL to be crawled is taken, guarantees the normal execution of network crawler system;By using Context resolution tool from the webpage got Web page content information is extracted, web page contents are cleaned in realization, finally store the web page content information, realize web page contents Information storage, to improve the acquisition efficiency and reliability of web page content information.
The application can be realized based on webmagic frame or the frame of other forms.
In one embodiment of acquisition methods of the webpage information of the application, step S101, by the network comprising URL to be crawled Before crawler queue is put into memory database, further includes: treat the URL crawled by pre-set priority rule and be ranked up;It will URL to be crawled after sequence is put into the web crawlers queue.
It, can be by here, the pre-set priority rule can be the significance level of categories of websites or URL by URL The URL crawled is treated according to the pre-set priority rule and carries out priority ranking, for example, the higher URL of priority can be come Then sorted URL, is put into the web crawlers queue correspondingly, the lower URL of priority is come below by front. It can successively be taken out from the web crawlers queue according to the rule of first in first out to subsequent and carry out corresponding webpage and climbed It takes, wherein the higher URL for coming front of priority can preferentially be removed carry out web page crawl.
In one embodiment of acquisition methods of the webpage information of the application, step S103 is sent to the corresponding website the URL After acquisition request, further includes: if not from the corresponding website get it is described obtain requested webpage, will it is described to The URL crawled is put back in the web crawlers queue in the memory database.
Here, some URL for crawling failure are had due to the unstability of network crawler system, if not from the correspondence Website get it is described obtain requested webpage, can choose and the URL for crawling failure is stored into back the memory number again According in the web crawlers queue in library, after network crawler system is stablized, by URL again from the memory database It is taken out in the web crawlers queue, new root of laying equal stress on carries out crawling for webpage according to the URL of taking-up.
In one embodiment of acquisition methods of the webpage information of the application, the URL to be crawled is put back into the memory number According in the web crawlers queue in library, comprising: if the priority of URL to be crawled to be put back to is more than or equal to default threshold The URL to be crawled then is put back to the head of the team in web crawlers queue position by value;Alternatively, if to be put back to wait crawl URL priority be less than preset threshold, then the URL to be crawled is put back to the tail of the queue position in the web crawlers queue It sets.
Here, connecting an embodiment, when not getting the requested webpage of acquisition from the corresponding website, need When the URL wait crawl to be put back to the web crawlers queue in the memory database, wait put back under can first judging The priority of URL to be crawled whether be more than or equal to preset threshold, if so, the URL to be crawled is put back to the net Team's head position in network crawler queue, for URL after being put back into the web crawlers queue, what is taken out from team's head position is still this URL, it is still available timely to crawl after once being crawled unsuccessfully in order to the higher URL of priority;Alternatively, if wait put back to URL to be crawled priority be less than preset threshold, then the URL to be crawled can be put back in the web crawlers queue Tail of the queue position, URL is put back into the tail of the queue position of web crawlers queue, is from team's head every time since it is in tail of the queue position Position starts to take URL, so this puts back to the URL of tail of the queue position, it can be by later taking-up again, to realize that priority is lower URL can URL higher compared to other priority being crawled later, crawl effect with improve the higher URL of other priority Rate.In one embodiment of acquisition methods of the webpage information of the application, step S102, from the network in the memory database After URL to be crawled described in being taken out in crawler queue, further includes: one thread pool of starting, and the URL to be crawled is put Enter in the thread pool;Step S103 sends acquisition request to the corresponding website the URL, comprising: by the thread pool, The acquisition request is sent to the website.
Here, thread pool is a kind of multiple threads form, task is added to queue in treatment process, is then being created Automatically start these tasks after thread.In the present embodiment, web crawlers is sent by the thread pool to the corresponding website The acquisition request, may be implemented the parallel acquisition of the corresponding webpage of multiple URL to be crawled, and raising crawls efficiency.
In one embodiment of acquisition methods of the webpage information of the application, step S103 is sent to the corresponding website the URL The acquisition request, comprising: extract IP address in preset agent Internet protocol IP queue;Pass through the IP being drawn into Address sends the acquisition request to the website.
Here, can randomly select or sequentially recycle extraction by way of, from preset agent Internet protocol IP team IP address is extracted in column, to guarantee that the IP address extracted every time is different.
Net if encountering the website to the acquisition request limited amount sent by same IP address, i.e., for limitation IP It stands, different IP address can be randomly selected in preset agent Internet protocol IP queue, then, web crawlers leads to every time The different IP addresses randomly selected are crossed, the acquisition request is sent to the corresponding website, solves website to IP address The acquisition request improves the success rate that webpage obtains to restricted problem.
Specifically, for example, as shown in Fig. 2, step S201, judges whether website needs agent IP address, if so, step S202 randomly selects IP address in preset configuration file queue, then, web crawlers by it is described with randomly selecting IP Location, step S203 send the acquisition request to the corresponding website;Otherwise, web crawlers passes through current IP address, step Rapid S203 directly sends the acquisition request to the corresponding website.
In one embodiment of acquisition methods of the webpage information of the application, step S103 is sent to the corresponding website the URL Before or after the acquisition request, further includes: obtain identifying code figure from the corresponding website;Using the side of text identification Formula identifies identifying code from the identifying code figure, and the identifying code is sent to the corresponding website.
Here, before or after sending the acquisition request to corresponding website described in the web crawlers, for some nets Station needs graphical verification code input correct the case where just obtaining webpage, can be by the way of text identification from the identifying code figure Identifying code is identified in shape, and the identifying code is sent to the corresponding website, is realized to the automatic broken of identifying code figure Solution improves success rate and efficiency that webpage obtains.
In one embodiment of acquisition methods of the webpage information of the application, the mode of the text identification is Tesseract knowledge Otherwise.
Here, the mode of the text identification can be various OCR identification methods, such as it can be Tesseract identification Mode.
Tesseract, a open source OCR (the Optical Character safeguarded by HP development in laboratory by Google Recognition, optical character identification) engine, the library that can constantly train increases the ability of image converting text constantly By force;If team's depth needs, the OCR engine for meeting self-demand can also be developed using it as template.
The present embodiment identifies identifying code figure by way of Tesseract, and the identification that identifying code figure can be improved is quasi- True rate, and then improve success rate and efficiency that webpage obtains.
When including the website log in one embodiment of acquisition methods of the webpage information of the application, in the acquisition request Cookie, it is described send acquisition request to the corresponding website the URL before, further includes: used from the website log Browser obtain the cookie.
Here, cookie, refers to certain websites to distinguish that user identity, progress session (session) track and are stored in Data (generally going through encryption) on user local terminal.
For needing to input subscriber identity information, after carrying out manual entry, the website of webpage could be obtained, it can be in user It is carried out for the first time by being obtained from browser corresponding when corresponding website this time manually logs in behind browser successful log website cookie.When subsequent each repetition logs in the website or removes webpage, the acquisition request comprising the cookie can be sent every time To website, avoid it is cumbersome manually login process, and then improve success rate and efficiency that webpage obtains.
Specifically, for example, as shown in figure 3, step S301, can first judge whether website needs to log in, if so, step S202, web crawlers can obtain corresponding cookie when corresponding website manually logs in for the first time from browser, and then, network is climbed Acquisition request comprising the cookie is sent to website by worm, carries out step S203, and webpage crawls, and otherwise, web crawlers can Directly to send the acquisition request for not including cookie to website, step S203 is carried out, webpage crawls.
In one embodiment of acquisition methods of the webpage information of the application, step S105 stores the web page content information, packet It includes: being stored after the web page content information is packaged into JSON format.
Here, the data that JSON (JavaScript Object Notation, JS object numbered musical notation) is a kind of lightweight are handed over Change format.It is based on a subset of ECMAScript (European Computer association formulate js specification), using being totally independent of The text formatting of programming language stores and indicates data.Succinctly and clearly hierarchical structure makes JSON become ideal data Exchange language.It is easy to people to read and write, while is also easy to machine parsing and generating, and effectively promoting network transmission efficiency.
Subsequent net can be improved by storing after the web page content information is packaged into JSON format in the present embodiment The speed and success rate of the inquiry of page content information.
In one embodiment of acquisition methods of the webpage information of the application, step S105 stores the web page contents letter, packet It includes: web page content information is stored into ElasticSearch cluster.
Here, ElasticSearch is the search server based on Lucene.It is multi-purpose that it provides a distribution The full-text search engine of family ability is based on RESTful web interface.Elasticsearch is developed with Java, and conduct Open source code publication under Apache license terms is stablized, reliably, fastly designed for that can reach real-time search in cloud computing Speed, it is easy to install and use, can simply using JSON by HTTP come index data.
The present embodiment can be improved in subsequent web pages by the way that web page content information is stored into ElasticSearch cluster Hold the speed and success rate of the inquiry of information.
In one embodiment of acquisition methods of the webpage information of the application, the memory database is redis memory database.
Here, redis is a key-value storage system, it supports the value type of storage relatively more, including String (character string), list (chained list), set (set), zset (sorted set-- ordered set) and hash (Hash class Type).These data types all support push/pop, add/remove and take intersection union and difference set and richer operation, and And these operations are all atomicities.On this basis, redis supports the sequence of various different modes, for guaranteed efficiency, number According to being all to cache in memory.
In the present embodiment, the web crawlers queue comprising uniform resource locator (URL) to be crawled can be put into In redis memory database;In addition, if web crawlers does not get the requested net of acquisition from the corresponding website Page, then put back to the URL to be crawled in the web crawlers queue in the redis memory database, facilitate next time Again it crawls.
The present embodiment passes through network of the redis memory database storage comprising uniform resource locator (URL) to be crawled Crawler queue can be further improved the efficiency that network crawler system obtains uniform resource locator to be crawled.
In one embodiment of acquisition methods of the webpage information of the application, the Context resolution tool is jsoup analytical tool.
Here, jsoup is the html parser of a Java, some address URL, html text content can be directly parsed. It provides a set of very labour-saving API, can take out and operate by DOM, CSS and similar to the operating method of jQuery Data.
The present embodiment, come the web page content information in analyzing web page, can be further improved net by jsoup analytical tool The extraction accuracy rate and efficiency of page content information.
As shown in figure 4, including the following steps: in one specific embodiment of acquisition methods of the webpage information of the application
Step S401: the URL that the URL and web crawlers that will be crawled newly are added during crawling is stored in redis In memory database, guarantee normally execute after crawler system is restarted;
Step S402: when web crawlers acquires webpage, newly-built spider (crawlers) thread pool is read from redis The URL for taking URL to be crawled, and starting reading carries out crawling for webpage in corresponding website;
Step S403: judging whether corresponding webpage crawls success,
If crawling failure, step S401: the URL of failure is stored into redis and is retried again;
Step S404: if crawling success, web page contents letter is extracted from the webpage got using Context resolution tool Breath, is packaged into JSON format for the web page content information;
Step S405: the web page content information of post package JSON format is stored into ElasticSearch cluster.
According to the another side of the application, a kind of acquisition equipment of webpage information is also provided, which can execute above-mentioned Fig. 1 To method shown in Fig. 4.The equipment can be realized by way of software, hardware or soft or hard combination, for example, the equipment can wrap Include the corresponding module or unit for executing each step in method shown in above-mentioned Fig. 1 to Fig. 4.
For example, as shown in figure 5, the equipment includes:
URL storage device 501, for will include that the web crawlers queue of uniform resource locator to be crawled (URL) is put Enter in memory database;
Here, uniform resource locator (Uniform Resource Locator, URL) is to can obtain from internet The position of the resource arrived and a kind of succinct expression of access method, are the addresses of standard resource on internet.On internet Each file has a unique URL, and the information that it includes points out that the position of the file in webpage and browser should be why It is handled, basic URL includes mode (or agreement), server name or IP address (corresponding website), path and filename;
In web crawlers queue (web crawlers task queue), (be otherwise known as web crawlers webpage spider, net machine It is people, more frequent to be known as webpage follower among the community FOAF), be it is a kind of according to certain rules, automatically grab ten thousand dimensions The program or script of net information, web crawlers can grab target according to the task URL in task queue, access corresponding net Page is linked to relevant, information required for obtaining;
Data are exactly put the database directly operated in memory by memory database.Relative to disk, the data of memory Read or write speed will be several orders of magnitude higher, and application can be greatlyd improve by saving the data in memory to compare to access from disk Performance;
URL extraction element 502, it is described wait climb for being taken out from the web crawlers queue in the memory database The URL taken;
Here, can be according to device data processing capacity, once from the web crawlers team in the memory database Column take out one or more URL to be crawled;
Web crawlers device 503, for sending acquisition request to corresponding website, the acquisition request is described for requesting The corresponding webpage of URL to be crawled;
Here, each URL include corresponding website and webpage address information, the web crawlers according to take out one or Address information in multiple URL to be crawled sends the corresponding request for obtaining the webpage in the website to corresponding website;
Context resolution device 504, if for getting the requested webpage of the acquisition request from the corresponding website, Web page content information is then extracted from the webpage got using Context resolution tool;
Here, when smoothly getting the requested webpage of the acquisition request and arriving, so that it may use Context resolution work Tool presses preset rules, and all or part of web page content information is extracted from the webpage got;
Storage device 505, for storing the web page content information.
Here, after extracting all or part of web page content information in the webpage got, it can be to the web page contents Information is stored, and further storage is counted or analyzed to web page content information in order to subsequent.
The present embodiment is by the way that before acquiring webpage, the network comprising uniform resource locator (URL) to be crawled is climbed Worm queue is put into memory database, is avoided due to the URL of storage in memory when network crawler system needs to restart The problem of disappearing, it is ensured that after network crawler system is restarted, can quickly be read from the web crawlers queue of memory database URL to be crawled is taken, guarantees the normal execution of network crawler system;By using Context resolution tool from the webpage got Web page content information is extracted, web page contents are cleaned in realization, finally store the web page content information, realize web page contents Information storage, to improve the acquisition efficiency and reliability of web page content information.
The application can be realized based on webmagic frame or the frame of other forms.
According to the another side of the application, a kind of acquisition equipment of webpage information is also provided, the equipment includes for storing The memory of computer program instructions and processor for executing program instructions, wherein the computer program instructions are by institute When stating processor execution, triggers the equipment and execute method described in any of the above embodiments.
According to the another side of the application, a kind of computer-readable medium is also provided, computer-readable instruction is stored thereon with, The computer-readable instruction can be executed by processor to realize method described in any of the above embodiments.
Each computer-readable medium of the application and the detailed content of apparatus embodiments, for details, reference can be made to each method embodiments Corresponding content, here, repeating no more.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies Within, then the application is also intended to include these modifications and variations.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can be executed to implement the above steps or functions by processor.Similarly, the application Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, example Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution. And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of the application, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the application are triggered Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case where without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table Show title, and does not indicate any particular order.

Claims (11)

1. a kind of acquisition methods of webpage information, which is characterized in that the described method includes:
Web crawlers queue comprising URL to be crawled is put into memory database;
The URL to be crawled is taken out from the web crawlers queue in the memory database;
Acquisition request is sent to the corresponding website the URL, the acquisition request is used to request the URL to be crawled corresponding Webpage;
If getting the webpage from the website, web page contents letter is extracted from the webpage using Context resolution tool Breath;
Store the web page content information.
2. the method according to claim 1, wherein described by the web crawlers queue comprising URL to be crawled Before being put into memory database, further includes:
The URL to be crawled is ranked up by pre-set priority rule;
URL to be crawled after sequence is put into the web crawlers queue.
3. method according to claim 1 or 2, which is characterized in that described obtain to the corresponding website transmission of the URL is asked After asking, further includes:
If not getting the webpage from the website, the URL to be crawled is put back to the institute in the memory database It states in web crawlers queue.
4. according to the method described in claim 3, it is characterized in that, described put back to the memory number for the URL to be crawled According in the web crawlers queue in library, comprising:
If the priority of the URL to be crawled is greater than or equal to preset threshold, the URL to be crawled is put back into the net Team's head position in network crawler queue;Alternatively,
If the priority of the URL to be crawled is less than preset threshold, the URL to be crawled is put back into the web crawlers Tail of the queue position in queue.
5. method according to claim 1 to 4, which is characterized in that described from the memory database After URL to be crawled described in being taken out in the web crawlers queue, further includes::
Start a thread pool, and the URL to be crawled is put into the thread pool;
It is described to send acquisition request to the corresponding website the URL, comprising:
By the thread pool, the acquisition request is sent to the website.
6. method according to claim 1 to 4, which is characterized in that described to be sent out to the corresponding website the URL Send acquisition request, comprising:
IP address is extracted from preset agent Internet protocol IP queue;
The acquisition request is sent to the website by the IP address being drawn into.
7. method according to any one of claim 1 to 6, which is characterized in that described to be sent out to the corresponding website the URL Before or after sending acquisition request, further includes:
Identifying code figure is obtained from the website;
It identifies identifying code from the identifying code figure by the way of text identification, and the identifying code is sent to described Website.
8. method according to any one of claim 1 to 7, which is characterized in that include the net in the acquisition request The cookie to stand when logging in, it is described send acquisition request to the corresponding website the URL before, further includes:
The browser used in the website log obtains the cookie.
9. method according to any one of claim 1 to 8, which is characterized in that the storage web page content information, Include:
It is stored after the web page content information is packaged into JSON format.
10. a kind of acquisition equipment of webpage information, the equipment includes the memory and use for storing computer program instructions In the processor executed program instructions, wherein when the computer program instructions are executed by the processor, trigger the equipment Method described in any one of perform claim requirement 1 to 9.
11. a kind of computer-readable medium, is stored thereon with computer-readable instruction, the computer-readable instruction can be processed Device is executed to realize method described in any one of claims 1 to 9.
CN201810688855.6A 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium Pending CN109033195A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810688855.6A CN109033195A (en) 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810688855.6A CN109033195A (en) 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium

Publications (1)

Publication Number Publication Date
CN109033195A true CN109033195A (en) 2018-12-18

Family

ID=65520811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810688855.6A Pending CN109033195A (en) 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium

Country Status (1)

Country Link
CN (1) CN109033195A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831491A (en) * 2019-01-15 2019-05-31 科大国创软件股份有限公司 Intrusive social data acquisition method based on agency
CN109885744A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Web data crawling method, device, system, computer equipment and storage medium
CN109992707A (en) * 2019-03-18 2019-07-09 广州视源电子科技股份有限公司 Data crawling method and device, storage medium and server
CN110008393A (en) * 2018-12-29 2019-07-12 义语智能科技(上海)有限公司 It is a kind of for obtaining the method and apparatus of site information
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN110069686A (en) * 2019-03-15 2019-07-30 平安科技(深圳)有限公司 User behavior analysis method, apparatus, computer installation and storage medium
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method for transformation, system, storage medium and the electronic equipment of unstructured data
CN110262888A (en) * 2019-06-26 2019-09-20 京东数字科技控股有限公司 The method and apparatus that method for scheduling task and device and calculate node execute task
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111538883A (en) * 2020-03-25 2020-08-14 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111538550A (en) * 2020-04-17 2020-08-14 姜海强 Webpage information screening method based on image detection algorithm
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN112508362A (en) * 2020-11-24 2021-03-16 江苏省质量和标准化研究院 Product export information processing method and device, electronic equipment and storage medium
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN112905867A (en) * 2019-03-14 2021-06-04 福建省天奕网络科技有限公司 Efficient historical data tracing and crawling method and terminal
CN112948654A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Webpage crawling method and device and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
CN104408194A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Acquisition method and device of web crawler request
CN105207852A (en) * 2015-10-09 2015-12-30 西安未来国际信息股份有限公司 Method for directionally acquiring network data based on distributed mode

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN104408194A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Acquisition method and device of web crawler request
CN105207852A (en) * 2015-10-09 2015-12-30 西安未来国际信息股份有限公司 Method for directionally acquiring network data based on distributed mode

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
夏征农: "《大辞海信息科学卷》", 31 December 2015, 上海辞书出版社 *
李小平: "《网络影视课程编导论》", 30 April 2016, 北京理工大学出版社 *
郑铁男: "《数字编辑实训教程》", 30 September 2017, 知识产权出版社 *
韦鹏程: "《大数据巨量分析与机器学习的整合与开发》", 31 May 2017, 电子科技大学出版社 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008393A (en) * 2018-12-29 2019-07-12 义语智能科技(上海)有限公司 It is a kind of for obtaining the method and apparatus of site information
CN110008393B (en) * 2018-12-29 2023-03-07 义语智能科技(上海)有限公司 Method and equipment for acquiring website information
CN109885744B (en) * 2019-01-07 2024-05-10 平安科技(深圳)有限公司 Webpage data crawling method, device, system, computer equipment and storage medium
CN109885744A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Web data crawling method, device, system, computer equipment and storage medium
CN109831491A (en) * 2019-01-15 2019-05-31 科大国创软件股份有限公司 Intrusive social data acquisition method based on agency
CN109831491B (en) * 2019-01-15 2022-03-15 科大国创软件股份有限公司 Invasive social data acquisition method based on agent
CN112905866B (en) * 2019-03-14 2022-06-07 福建省天奕网络科技有限公司 Historical data tracing and crawling method and terminal without manual participation
CN112905867A (en) * 2019-03-14 2021-06-04 福建省天奕网络科技有限公司 Efficient historical data tracing and crawling method and terminal
CN112905867B (en) * 2019-03-14 2022-06-07 福建省天奕网络科技有限公司 Efficient historical data tracing and crawling method and terminal
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN112905866A (en) * 2019-03-14 2021-06-04 福建省天奕网络科技有限公司 Historical data tracing and crawling method and terminal without manual participation
CN110069686A (en) * 2019-03-15 2019-07-30 平安科技(深圳)有限公司 User behavior analysis method, apparatus, computer installation and storage medium
CN109992707A (en) * 2019-03-18 2019-07-09 广州视源电子科技股份有限公司 Data crawling method and device, storage medium and server
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method for transformation, system, storage medium and the electronic equipment of unstructured data
CN110262888B (en) * 2019-06-26 2020-11-20 京东数字科技控股有限公司 Task scheduling method and device and method and device for computing node to execute task
CN110262888A (en) * 2019-06-26 2019-09-20 京东数字科技控股有限公司 The method and apparatus that method for scheduling task and device and calculate node execute task
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN112948654A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Webpage crawling method and device and computer equipment
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111324797B (en) * 2020-02-20 2023-08-11 民生科技有限责任公司 Method and device for precisely acquiring data at high speed
CN111538883A (en) * 2020-03-25 2020-08-14 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111538883B (en) * 2020-03-25 2023-11-17 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111538550A (en) * 2020-04-17 2020-08-14 姜海强 Webpage information screening method based on image detection algorithm
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN112508362A (en) * 2020-11-24 2021-03-16 江苏省质量和标准化研究院 Product export information processing method and device, electronic equipment and storage medium
CN112508362B (en) * 2020-11-24 2024-04-23 江苏省质量和标准化研究院 Product outlet information processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109033195A (en) The acquisition methods of webpage information obtain equipment and computer-readable medium
US9614862B2 (en) System and method for webpage analysis
US9203720B2 (en) Monitoring the health of web page analytics code
WO2016173200A1 (en) Malicious website detection method and system
US9954886B2 (en) Method and apparatus for detecting website security
JP5695027B2 (en) Method and system for acquiring AJAX web page content
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN104331369B (en) Page detection method and device, server based on browser
US8359317B2 (en) Method and device for indexing resource content in computer networks
JP6103325B2 (en) Method, apparatus and system for acquiring user behavior
CN104063401B (en) The method and apparatus that a kind of webpage pattern address merges
CN111552854A (en) Webpage data capturing method and device, storage medium and equipment
CN104021154B (en) A kind of method and apparatus scanned in a browser
CN102262635A (en) Page crawler system and page crawler method
CN106599270B (en) Network data capturing method and crawler
CN108632219A (en) A kind of website vulnerability detection method, detection service device and system
CN107590236B (en) Big data acquisition method and system for building construction enterprises
Gheorghe et al. Modern techniques of web scraping for data scientists
CN112395485A (en) Policy big data mining method and device, computer equipment and storage medium
CN103455492B (en) A kind of method and apparatus of search and webpage
US11023590B2 (en) Security testing tool using crowd-sourced data
CN109246069B (en) Webpage login method and device and readable storage medium
CN109062803B (en) Method and device for automatically generating test case based on crawler
US20120215757A1 (en) Web crawling using static analysis
CN114491210A (en) Data acquisition method and device based on web crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218