CN109033195A - The acquisition methods of webpage information obtain equipment and computer-readable medium - Google Patents
The acquisition methods of webpage information obtain equipment and computer-readable medium Download PDFInfo
- Publication number
- CN109033195A CN109033195A CN201810688855.6A CN201810688855A CN109033195A CN 109033195 A CN109033195 A CN 109033195A CN 201810688855 A CN201810688855 A CN 201810688855A CN 109033195 A CN109033195 A CN 109033195A
- Authority
- CN
- China
- Prior art keywords
- url
- crawled
- queue
- webpage
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 235000014510 cooky Nutrition 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 10
- 230000005540 biological transmission Effects 0.000 claims description 6
- 230000009193 crawling Effects 0.000 description 8
- 238000000605 extraction Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 241000239290 Araneae Species 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011017 operating method Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The purpose of the application is to provide the acquisition methods, computer-readable medium and equipment of a kind of webpage information, the application passes through before acquiring webpage, web crawlers queue comprising uniform resource locator (URL) to be crawled is put into memory database, it avoids due to the problem of URL of storage in memory can disappear when network crawler system needs to restart, after can guaranteeing that network crawler system is restarted, URL to be crawled quickly can be read from the web crawlers queue of memory database, guarantee the normal execution of network crawler system;Web page content information is extracted from the webpage got by using Context resolution tool, web page contents are cleaned in realization, the web page content information is finally stored, web page content information storage is realized, to improve the acquisition efficiency and reliability of web page content information.
Description
Technical field
This application involves computer field more particularly to a kind of acquisition methods of webpage information, obtain equipment and computer
Readable medium.
Background technique
Currently, network crawler system is when crawling webpage information, usually by uniform resource locator to be crawled
(uniform resource locator, URL) is stored in memory.When network crawler system needs to restart, storage
The URL of in memory to be crawled can disappear.When network crawler system wishes to continue to crawl webpage information after restart, need
It picks up URL to be crawled and URL to be crawled is loaded onto memory, it is lower to obtain efficiency so as to cause webpage information.
Summary of the invention
The purpose of the application is to provide a kind of acquisition methods of webpage information, obtains equipment and computer-readable Jie
Matter.
According to the one aspect of the application, provide a kind of acquisition methods of webpage information, this method comprises: will comprising to
The web crawlers queue of the URL crawled is put into memory database;From the web crawlers queue in the memory database
URL to be crawled described in middle taking-up;Acquisition request is sent to the corresponding website the URL, the acquisition request is for requesting institute
State the corresponding webpage of URL to be crawled;If getting the webpage from the website, use Context resolution tool from the net
Web page content information is extracted in page;Store the web page content information.
Further, described that the web crawlers queue comprising URL to be crawled is put into memory database in the above method
In before, further includes: the URL to be crawled is ranked up by pre-set priority rule;By the URL to be crawled after sequence
It is put into the web crawlers queue.
Further, in the above method, after the corresponding website transmission acquisition request to the URL, further includes: if
The webpage is not got from the website, then the URL to be crawled is put back to the network in the memory database
In crawler queue.
Further, in the above method, the net URL to be crawled put back in the memory database
In network crawler queue, comprising: if the priority of the URL to be crawled is greater than or equal to preset threshold, by described wait crawl
URL put back to the team in the web crawlers queue head position;Alternatively, being preset if the priority of the URL to be crawled is less than
The URL to be crawled then is put back to the tail of the queue position in the web crawlers queue by threshold value.
Further, in the above method, institute is taken out in the web crawlers queue from the memory database
After stating URL to be crawled, further includes: one thread pool of starting, and the URL to be crawled is put into the thread pool;
It is described to send acquisition request to the corresponding website the URL, comprising: by the thread pool, to be obtained to described in the transmission of the website
Take request.
Further, described to send acquisition request to the corresponding website the URL in the above method, comprising: from preset
IP address is extracted in agent Internet protocol IP queue;The acquisition is sent to the website by the IP address being drawn into
Request.
Further, in the above method, before or after the corresponding website transmission acquisition request to the URL, also
It include: to obtain identifying code figure from the website;Verifying is identified from the identifying code figure by the way of text identification
Code, and the identifying code is sent to the website.
Further, cookie when including the website log in the above method, in the acquisition request, it is described to institute
It states before the corresponding website URL sends acquisition request, further includes: described in the browser used in the website log obtains
cookie。
Further, in the above method, the storage web page content information, comprising: by the web page content information
It is stored after being packaged into JSON format.
According to the another aspect of the application, a kind of acquisition equipment of webpage information is additionally provided, the equipment includes being used for
Store the memory of computer program instructions and the processor for executing program instructions, wherein the computer program instructions
When being executed by the processor, triggers the equipment and execute method described in any of the above embodiments.
According to the another aspect of the application, a kind of computer-readable medium is additionally provided, is stored thereon with computer-readable
Instruction, the computer-readable instruction can be executed by processor to realize method described in any of the above embodiments.
Compared with prior art, the application is by will include uniform resource locator to be crawled before acquiring webpage
(URL) web crawlers queue is put into memory database, is avoided due to storage when network crawler system needs to restart
The problem of URL in memory can disappear, it is ensured that, can be quickly from the net of memory database after network crawler system is restarted
URL to be crawled is read in network crawler queue, guarantees the normal execution of network crawler system;By using Context resolution tool from obtaining
Web page content information is extracted in the webpage got, web page contents are cleaned in realization, the web page content information is finally stored,
Web page content information storage is realized, to improve the acquisition efficiency and reliability of web page content information.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of flow chart of the acquisition methods of webpage information according to one embodiment of the application;
Fig. 2 shows the flow charts that webpage is obtained by agent IP address of one embodiment of the application;
Fig. 3 shows the flow chart that webpage is obtained by cookie of one embodiment of the application;
Fig. 4 shows the flow chart of the acquisition methods of the webpage information of one specific embodiment of the application;
Fig. 5 shows a kind of structural schematic diagram of the acquisition equipment of webpage information of one embodiment of the application.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
The application is described in further detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more
Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flashRAM).Memory is showing for computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or
Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer
Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As shown in Figure 1, the application provides a kind of acquisition methods of webpage information, this method can be applied to the network equipment
End, for example, can be executed by web crawlers by the network equipment.Wherein, web crawlers is otherwise known as webpage spider or network
Robot, alternatively, web crawlers is more among a kind of FOAF (Friend-of-a-Friend is XML/RDF vocabulary) community
Frequent is known as webpage follower.Web crawlers can refer to one kind according to certain rules, automatically grab web message
Program or script.Web crawlers can according to the task URL in web crawlers queue grab target, access corresponding webpage with
Relevant link, information required for obtaining, each step of the present embodiment can be realized by web crawlers.As shown in Figure 1,
This method comprises:
Web crawlers queue comprising URL to be crawled is put into memory database by step S101.
Here, URL is the expression succinct to the position for the resource that can be obtained from internet and one kind of access method,
It is the address of standard resource on internet.Each file on internet has a unique URL, and the information that it includes is pointed out
How the position of file in webpage and browser should handle it.Generally, basic URL include mode (or agreement),
Server name or the address Internet protocol (internet protocol, IP) (corresponding website), path and filename.
Wherein, above-mentioned web crawlers queue is alternatively referred to as web crawlers task queue.
Above-mentioned memory database refers to the database putting data and directly operating in memory.Relative to disk, memory
Reading and writing data speed will be several orders of magnitude higher, and be saved the data in memory compared to accessing and can greatly improve from disk
The performance of application.
Step S102 takes out the URL to be crawled from the web crawlers queue in the memory database.
Here, can once be climbed from the network in the memory database according to equipment real data processing capacity
One or more URL to be crawled are taken out in worm queue.
Step S103 sends acquisition request to the corresponding website the URL, and the acquisition request is described wait climb for requesting
The corresponding webpage of the URL taken.
Here, a URL includes the address information of corresponding website and webpage.The web crawlers can be according to taking-up one
Address information in a or multiple URL to be crawled sends the request for obtaining the correspondence webpage in the website to corresponding website.
Step S104 uses Context resolution tool from the webpage got if getting the webpage from the website
Middle extraction web page content information.
Here, when smoothly getting the requested webpage of the acquisition request and arriving, so that it may use Context resolution work
Tool presses preset rules, and all or part of web page content information is extracted from the webpage got.
Step S105 stores the web page content information.
Here, after extracting all or part of web page content information in the webpage got, it can be to the web page contents
Information is stored, and further storage is counted or analyzed to web page content information in order to subsequent.
The present embodiment is by the way that before acquiring webpage, the network comprising uniform resource locator (URL) to be crawled is climbed
Worm queue is put into memory database, is avoided due to the URL of storage in memory when network crawler system needs to restart
The problem of disappearing, it is ensured that after network crawler system is restarted, can quickly be read from the web crawlers queue of memory database
URL to be crawled is taken, guarantees the normal execution of network crawler system;By using Context resolution tool from the webpage got
Web page content information is extracted, web page contents are cleaned in realization, finally store the web page content information, realize web page contents
Information storage, to improve the acquisition efficiency and reliability of web page content information.
The application can be realized based on webmagic frame or the frame of other forms.
In one embodiment of acquisition methods of the webpage information of the application, step S101, by the network comprising URL to be crawled
Before crawler queue is put into memory database, further includes: treat the URL crawled by pre-set priority rule and be ranked up;It will
URL to be crawled after sequence is put into the web crawlers queue.
It, can be by here, the pre-set priority rule can be the significance level of categories of websites or URL by URL
The URL crawled is treated according to the pre-set priority rule and carries out priority ranking, for example, the higher URL of priority can be come
Then sorted URL, is put into the web crawlers queue correspondingly, the lower URL of priority is come below by front.
It can successively be taken out from the web crawlers queue according to the rule of first in first out to subsequent and carry out corresponding webpage and climbed
It takes, wherein the higher URL for coming front of priority can preferentially be removed carry out web page crawl.
In one embodiment of acquisition methods of the webpage information of the application, step S103 is sent to the corresponding website the URL
After acquisition request, further includes: if not from the corresponding website get it is described obtain requested webpage, will it is described to
The URL crawled is put back in the web crawlers queue in the memory database.
Here, some URL for crawling failure are had due to the unstability of network crawler system, if not from the correspondence
Website get it is described obtain requested webpage, can choose and the URL for crawling failure is stored into back the memory number again
According in the web crawlers queue in library, after network crawler system is stablized, by URL again from the memory database
It is taken out in the web crawlers queue, new root of laying equal stress on carries out crawling for webpage according to the URL of taking-up.
In one embodiment of acquisition methods of the webpage information of the application, the URL to be crawled is put back into the memory number
According in the web crawlers queue in library, comprising: if the priority of URL to be crawled to be put back to is more than or equal to default threshold
The URL to be crawled then is put back to the head of the team in web crawlers queue position by value;Alternatively, if to be put back to wait crawl
URL priority be less than preset threshold, then the URL to be crawled is put back to the tail of the queue position in the web crawlers queue
It sets.
Here, connecting an embodiment, when not getting the requested webpage of acquisition from the corresponding website, need
When the URL wait crawl to be put back to the web crawlers queue in the memory database, wait put back under can first judging
The priority of URL to be crawled whether be more than or equal to preset threshold, if so, the URL to be crawled is put back to the net
Team's head position in network crawler queue, for URL after being put back into the web crawlers queue, what is taken out from team's head position is still this
URL, it is still available timely to crawl after once being crawled unsuccessfully in order to the higher URL of priority;Alternatively, if wait put back to
URL to be crawled priority be less than preset threshold, then the URL to be crawled can be put back in the web crawlers queue
Tail of the queue position, URL is put back into the tail of the queue position of web crawlers queue, is from team's head every time since it is in tail of the queue position
Position starts to take URL, so this puts back to the URL of tail of the queue position, it can be by later taking-up again, to realize that priority is lower
URL can URL higher compared to other priority being crawled later, crawl effect with improve the higher URL of other priority
Rate.In one embodiment of acquisition methods of the webpage information of the application, step S102, from the network in the memory database
After URL to be crawled described in being taken out in crawler queue, further includes: one thread pool of starting, and the URL to be crawled is put
Enter in the thread pool;Step S103 sends acquisition request to the corresponding website the URL, comprising: by the thread pool,
The acquisition request is sent to the website.
Here, thread pool is a kind of multiple threads form, task is added to queue in treatment process, is then being created
Automatically start these tasks after thread.In the present embodiment, web crawlers is sent by the thread pool to the corresponding website
The acquisition request, may be implemented the parallel acquisition of the corresponding webpage of multiple URL to be crawled, and raising crawls efficiency.
In one embodiment of acquisition methods of the webpage information of the application, step S103 is sent to the corresponding website the URL
The acquisition request, comprising: extract IP address in preset agent Internet protocol IP queue;Pass through the IP being drawn into
Address sends the acquisition request to the website.
Here, can randomly select or sequentially recycle extraction by way of, from preset agent Internet protocol IP team
IP address is extracted in column, to guarantee that the IP address extracted every time is different.
Net if encountering the website to the acquisition request limited amount sent by same IP address, i.e., for limitation IP
It stands, different IP address can be randomly selected in preset agent Internet protocol IP queue, then, web crawlers leads to every time
The different IP addresses randomly selected are crossed, the acquisition request is sent to the corresponding website, solves website to IP address
The acquisition request improves the success rate that webpage obtains to restricted problem.
Specifically, for example, as shown in Fig. 2, step S201, judges whether website needs agent IP address, if so, step
S202 randomly selects IP address in preset configuration file queue, then, web crawlers by it is described with randomly selecting IP
Location, step S203 send the acquisition request to the corresponding website;Otherwise, web crawlers passes through current IP address, step
Rapid S203 directly sends the acquisition request to the corresponding website.
In one embodiment of acquisition methods of the webpage information of the application, step S103 is sent to the corresponding website the URL
Before or after the acquisition request, further includes: obtain identifying code figure from the corresponding website;Using the side of text identification
Formula identifies identifying code from the identifying code figure, and the identifying code is sent to the corresponding website.
Here, before or after sending the acquisition request to corresponding website described in the web crawlers, for some nets
Station needs graphical verification code input correct the case where just obtaining webpage, can be by the way of text identification from the identifying code figure
Identifying code is identified in shape, and the identifying code is sent to the corresponding website, is realized to the automatic broken of identifying code figure
Solution improves success rate and efficiency that webpage obtains.
In one embodiment of acquisition methods of the webpage information of the application, the mode of the text identification is Tesseract knowledge
Otherwise.
Here, the mode of the text identification can be various OCR identification methods, such as it can be Tesseract identification
Mode.
Tesseract, a open source OCR (the Optical Character safeguarded by HP development in laboratory by Google
Recognition, optical character identification) engine, the library that can constantly train increases the ability of image converting text constantly
By force;If team's depth needs, the OCR engine for meeting self-demand can also be developed using it as template.
The present embodiment identifies identifying code figure by way of Tesseract, and the identification that identifying code figure can be improved is quasi-
True rate, and then improve success rate and efficiency that webpage obtains.
When including the website log in one embodiment of acquisition methods of the webpage information of the application, in the acquisition request
Cookie, it is described send acquisition request to the corresponding website the URL before, further includes: used from the website log
Browser obtain the cookie.
Here, cookie, refers to certain websites to distinguish that user identity, progress session (session) track and are stored in
Data (generally going through encryption) on user local terminal.
For needing to input subscriber identity information, after carrying out manual entry, the website of webpage could be obtained, it can be in user
It is carried out for the first time by being obtained from browser corresponding when corresponding website this time manually logs in behind browser successful log website
cookie.When subsequent each repetition logs in the website or removes webpage, the acquisition request comprising the cookie can be sent every time
To website, avoid it is cumbersome manually login process, and then improve success rate and efficiency that webpage obtains.
Specifically, for example, as shown in figure 3, step S301, can first judge whether website needs to log in, if so, step
S202, web crawlers can obtain corresponding cookie when corresponding website manually logs in for the first time from browser, and then, network is climbed
Acquisition request comprising the cookie is sent to website by worm, carries out step S203, and webpage crawls, and otherwise, web crawlers can
Directly to send the acquisition request for not including cookie to website, step S203 is carried out, webpage crawls.
In one embodiment of acquisition methods of the webpage information of the application, step S105 stores the web page content information, packet
It includes: being stored after the web page content information is packaged into JSON format.
Here, the data that JSON (JavaScript Object Notation, JS object numbered musical notation) is a kind of lightweight are handed over
Change format.It is based on a subset of ECMAScript (European Computer association formulate js specification), using being totally independent of
The text formatting of programming language stores and indicates data.Succinctly and clearly hierarchical structure makes JSON become ideal data
Exchange language.It is easy to people to read and write, while is also easy to machine parsing and generating, and effectively promoting network transmission efficiency.
Subsequent net can be improved by storing after the web page content information is packaged into JSON format in the present embodiment
The speed and success rate of the inquiry of page content information.
In one embodiment of acquisition methods of the webpage information of the application, step S105 stores the web page contents letter, packet
It includes: web page content information is stored into ElasticSearch cluster.
Here, ElasticSearch is the search server based on Lucene.It is multi-purpose that it provides a distribution
The full-text search engine of family ability is based on RESTful web interface.Elasticsearch is developed with Java, and conduct
Open source code publication under Apache license terms is stablized, reliably, fastly designed for that can reach real-time search in cloud computing
Speed, it is easy to install and use, can simply using JSON by HTTP come index data.
The present embodiment can be improved in subsequent web pages by the way that web page content information is stored into ElasticSearch cluster
Hold the speed and success rate of the inquiry of information.
In one embodiment of acquisition methods of the webpage information of the application, the memory database is redis memory database.
Here, redis is a key-value storage system, it supports the value type of storage relatively more, including
String (character string), list (chained list), set (set), zset (sorted set-- ordered set) and hash (Hash class
Type).These data types all support push/pop, add/remove and take intersection union and difference set and richer operation, and
And these operations are all atomicities.On this basis, redis supports the sequence of various different modes, for guaranteed efficiency, number
According to being all to cache in memory.
In the present embodiment, the web crawlers queue comprising uniform resource locator (URL) to be crawled can be put into
In redis memory database;In addition, if web crawlers does not get the requested net of acquisition from the corresponding website
Page, then put back to the URL to be crawled in the web crawlers queue in the redis memory database, facilitate next time
Again it crawls.
The present embodiment passes through network of the redis memory database storage comprising uniform resource locator (URL) to be crawled
Crawler queue can be further improved the efficiency that network crawler system obtains uniform resource locator to be crawled.
In one embodiment of acquisition methods of the webpage information of the application, the Context resolution tool is jsoup analytical tool.
Here, jsoup is the html parser of a Java, some address URL, html text content can be directly parsed.
It provides a set of very labour-saving API, can take out and operate by DOM, CSS and similar to the operating method of jQuery
Data.
The present embodiment, come the web page content information in analyzing web page, can be further improved net by jsoup analytical tool
The extraction accuracy rate and efficiency of page content information.
As shown in figure 4, including the following steps: in one specific embodiment of acquisition methods of the webpage information of the application
Step S401: the URL that the URL and web crawlers that will be crawled newly are added during crawling is stored in redis
In memory database, guarantee normally execute after crawler system is restarted;
Step S402: when web crawlers acquires webpage, newly-built spider (crawlers) thread pool is read from redis
The URL for taking URL to be crawled, and starting reading carries out crawling for webpage in corresponding website;
Step S403: judging whether corresponding webpage crawls success,
If crawling failure, step S401: the URL of failure is stored into redis and is retried again;
Step S404: if crawling success, web page contents letter is extracted from the webpage got using Context resolution tool
Breath, is packaged into JSON format for the web page content information;
Step S405: the web page content information of post package JSON format is stored into ElasticSearch cluster.
According to the another side of the application, a kind of acquisition equipment of webpage information is also provided, which can execute above-mentioned Fig. 1
To method shown in Fig. 4.The equipment can be realized by way of software, hardware or soft or hard combination, for example, the equipment can wrap
Include the corresponding module or unit for executing each step in method shown in above-mentioned Fig. 1 to Fig. 4.
For example, as shown in figure 5, the equipment includes:
URL storage device 501, for will include that the web crawlers queue of uniform resource locator to be crawled (URL) is put
Enter in memory database;
Here, uniform resource locator (Uniform Resource Locator, URL) is to can obtain from internet
The position of the resource arrived and a kind of succinct expression of access method, are the addresses of standard resource on internet.On internet
Each file has a unique URL, and the information that it includes points out that the position of the file in webpage and browser should be why
It is handled, basic URL includes mode (or agreement), server name or IP address (corresponding website), path and filename;
In web crawlers queue (web crawlers task queue), (be otherwise known as web crawlers webpage spider, net machine
It is people, more frequent to be known as webpage follower among the community FOAF), be it is a kind of according to certain rules, automatically grab ten thousand dimensions
The program or script of net information, web crawlers can grab target according to the task URL in task queue, access corresponding net
Page is linked to relevant, information required for obtaining;
Data are exactly put the database directly operated in memory by memory database.Relative to disk, the data of memory
Read or write speed will be several orders of magnitude higher, and application can be greatlyd improve by saving the data in memory to compare to access from disk
Performance;
URL extraction element 502, it is described wait climb for being taken out from the web crawlers queue in the memory database
The URL taken;
Here, can be according to device data processing capacity, once from the web crawlers team in the memory database
Column take out one or more URL to be crawled;
Web crawlers device 503, for sending acquisition request to corresponding website, the acquisition request is described for requesting
The corresponding webpage of URL to be crawled;
Here, each URL include corresponding website and webpage address information, the web crawlers according to take out one or
Address information in multiple URL to be crawled sends the corresponding request for obtaining the webpage in the website to corresponding website;
Context resolution device 504, if for getting the requested webpage of the acquisition request from the corresponding website,
Web page content information is then extracted from the webpage got using Context resolution tool;
Here, when smoothly getting the requested webpage of the acquisition request and arriving, so that it may use Context resolution work
Tool presses preset rules, and all or part of web page content information is extracted from the webpage got;
Storage device 505, for storing the web page content information.
Here, after extracting all or part of web page content information in the webpage got, it can be to the web page contents
Information is stored, and further storage is counted or analyzed to web page content information in order to subsequent.
The present embodiment is by the way that before acquiring webpage, the network comprising uniform resource locator (URL) to be crawled is climbed
Worm queue is put into memory database, is avoided due to the URL of storage in memory when network crawler system needs to restart
The problem of disappearing, it is ensured that after network crawler system is restarted, can quickly be read from the web crawlers queue of memory database
URL to be crawled is taken, guarantees the normal execution of network crawler system;By using Context resolution tool from the webpage got
Web page content information is extracted, web page contents are cleaned in realization, finally store the web page content information, realize web page contents
Information storage, to improve the acquisition efficiency and reliability of web page content information.
The application can be realized based on webmagic frame or the frame of other forms.
According to the another side of the application, a kind of acquisition equipment of webpage information is also provided, the equipment includes for storing
The memory of computer program instructions and processor for executing program instructions, wherein the computer program instructions are by institute
When stating processor execution, triggers the equipment and execute method described in any of the above embodiments.
According to the another side of the application, a kind of computer-readable medium is also provided, computer-readable instruction is stored thereon with,
The computer-readable instruction can be executed by processor to realize method described in any of the above embodiments.
Each computer-readable medium of the application and the detailed content of apparatus embodiments, for details, reference can be made to each method embodiments
Corresponding content, here, repeating no more.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application
Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies
Within, then the application is also intended to include these modifications and variations.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, the software program of the application can be executed to implement the above steps or functions by processor.Similarly, the application
Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory,
Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, example
Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the application can be applied to computer program product, such as computer program instructions, when its quilt
When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution.
And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through
Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation
In the working storage of computer equipment.Here, including a device according to one embodiment of the application, which includes using
Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to
When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the application are triggered
Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie
In the case where without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This
Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple
Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table
Show title, and does not indicate any particular order.
Claims (11)
1. a kind of acquisition methods of webpage information, which is characterized in that the described method includes:
Web crawlers queue comprising URL to be crawled is put into memory database;
The URL to be crawled is taken out from the web crawlers queue in the memory database;
Acquisition request is sent to the corresponding website the URL, the acquisition request is used to request the URL to be crawled corresponding
Webpage;
If getting the webpage from the website, web page contents letter is extracted from the webpage using Context resolution tool
Breath;
Store the web page content information.
2. the method according to claim 1, wherein described by the web crawlers queue comprising URL to be crawled
Before being put into memory database, further includes:
The URL to be crawled is ranked up by pre-set priority rule;
URL to be crawled after sequence is put into the web crawlers queue.
3. method according to claim 1 or 2, which is characterized in that described obtain to the corresponding website transmission of the URL is asked
After asking, further includes:
If not getting the webpage from the website, the URL to be crawled is put back to the institute in the memory database
It states in web crawlers queue.
4. according to the method described in claim 3, it is characterized in that, described put back to the memory number for the URL to be crawled
According in the web crawlers queue in library, comprising:
If the priority of the URL to be crawled is greater than or equal to preset threshold, the URL to be crawled is put back into the net
Team's head position in network crawler queue;Alternatively,
If the priority of the URL to be crawled is less than preset threshold, the URL to be crawled is put back into the web crawlers
Tail of the queue position in queue.
5. method according to claim 1 to 4, which is characterized in that described from the memory database
After URL to be crawled described in being taken out in the web crawlers queue, further includes::
Start a thread pool, and the URL to be crawled is put into the thread pool;
It is described to send acquisition request to the corresponding website the URL, comprising:
By the thread pool, the acquisition request is sent to the website.
6. method according to claim 1 to 4, which is characterized in that described to be sent out to the corresponding website the URL
Send acquisition request, comprising:
IP address is extracted from preset agent Internet protocol IP queue;
The acquisition request is sent to the website by the IP address being drawn into.
7. method according to any one of claim 1 to 6, which is characterized in that described to be sent out to the corresponding website the URL
Before or after sending acquisition request, further includes:
Identifying code figure is obtained from the website;
It identifies identifying code from the identifying code figure by the way of text identification, and the identifying code is sent to described
Website.
8. method according to any one of claim 1 to 7, which is characterized in that include the net in the acquisition request
The cookie to stand when logging in, it is described send acquisition request to the corresponding website the URL before, further includes:
The browser used in the website log obtains the cookie.
9. method according to any one of claim 1 to 8, which is characterized in that the storage web page content information,
Include:
It is stored after the web page content information is packaged into JSON format.
10. a kind of acquisition equipment of webpage information, the equipment includes the memory and use for storing computer program instructions
In the processor executed program instructions, wherein when the computer program instructions are executed by the processor, trigger the equipment
Method described in any one of perform claim requirement 1 to 9.
11. a kind of computer-readable medium, is stored thereon with computer-readable instruction, the computer-readable instruction can be processed
Device is executed to realize method described in any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810688855.6A CN109033195A (en) | 2018-06-28 | 2018-06-28 | The acquisition methods of webpage information obtain equipment and computer-readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810688855.6A CN109033195A (en) | 2018-06-28 | 2018-06-28 | The acquisition methods of webpage information obtain equipment and computer-readable medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033195A true CN109033195A (en) | 2018-12-18 |
Family
ID=65520811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810688855.6A Pending CN109033195A (en) | 2018-06-28 | 2018-06-28 | The acquisition methods of webpage information obtain equipment and computer-readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033195A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109831491A (en) * | 2019-01-15 | 2019-05-31 | 科大国创软件股份有限公司 | Intrusive social data acquisition method based on agency |
CN109885744A (en) * | 2019-01-07 | 2019-06-14 | 平安科技(深圳)有限公司 | Web data crawling method, device, system, computer equipment and storage medium |
CN109992707A (en) * | 2019-03-18 | 2019-07-09 | 广州视源电子科技股份有限公司 | Data crawling method and device, storage medium and server |
CN110008393A (en) * | 2018-12-29 | 2019-07-12 | 义语智能科技(上海)有限公司 | It is a kind of for obtaining the method and apparatus of site information |
CN110062025A (en) * | 2019-03-14 | 2019-07-26 | 深圳绿米联创科技有限公司 | Method, apparatus, server and the storage medium of data acquisition |
CN110069686A (en) * | 2019-03-15 | 2019-07-30 | 平安科技(深圳)有限公司 | User behavior analysis method, apparatus, computer installation and storage medium |
CN110134858A (en) * | 2019-03-26 | 2019-08-16 | 国网重庆市电力公司 | Method for transformation, system, storage medium and the electronic equipment of unstructured data |
CN110262888A (en) * | 2019-06-26 | 2019-09-20 | 京东数字科技控股有限公司 | The method and apparatus that method for scheduling task and device and calculate node execute task |
CN111324797A (en) * | 2020-02-20 | 2020-06-23 | 民生科技有限责任公司 | Method and device for acquiring data accurately at high speed |
CN111538883A (en) * | 2020-03-25 | 2020-08-14 | 北京市科学技术情报研究所 | Data crawling method, system and equipment |
CN111538550A (en) * | 2020-04-17 | 2020-08-14 | 姜海强 | Webpage information screening method based on image detection algorithm |
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN112199567A (en) * | 2020-09-27 | 2021-01-08 | 深圳市伊欧乐科技有限公司 | Distributed data acquisition method, system, server and storage medium |
CN112508362A (en) * | 2020-11-24 | 2021-03-16 | 江苏省质量和标准化研究院 | Product export information processing method and device, electronic equipment and storage medium |
CN112579853A (en) * | 2019-09-30 | 2021-03-30 | 顺丰科技有限公司 | Method and device for sequencing crawling links and storage medium |
CN112905867A (en) * | 2019-03-14 | 2021-06-04 | 福建省天奕网络科技有限公司 | Efficient historical data tracing and crawling method and terminal |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101635718A (en) * | 2009-08-26 | 2010-01-27 | 中兴通讯股份有限公司 | Network crawler system and method for acquiring resource as well as network resource gripping device |
US7676553B1 (en) * | 2003-12-31 | 2010-03-09 | Microsoft Corporation | Incremental web crawler using chunks |
CN104408194A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Acquisition method and device of web crawler request |
CN105207852A (en) * | 2015-10-09 | 2015-12-30 | 西安未来国际信息股份有限公司 | Method for directionally acquiring network data based on distributed mode |
-
2018
- 2018-06-28 CN CN201810688855.6A patent/CN109033195A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7676553B1 (en) * | 2003-12-31 | 2010-03-09 | Microsoft Corporation | Incremental web crawler using chunks |
CN101635718A (en) * | 2009-08-26 | 2010-01-27 | 中兴通讯股份有限公司 | Network crawler system and method for acquiring resource as well as network resource gripping device |
CN104408194A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Acquisition method and device of web crawler request |
CN105207852A (en) * | 2015-10-09 | 2015-12-30 | 西安未来国际信息股份有限公司 | Method for directionally acquiring network data based on distributed mode |
Non-Patent Citations (4)
Title |
---|
夏征农: "《大辞海信息科学卷》", 31 December 2015, 上海辞书出版社 * |
李小平: "《网络影视课程编导论》", 30 April 2016, 北京理工大学出版社 * |
郑铁男: "《数字编辑实训教程》", 30 September 2017, 知识产权出版社 * |
韦鹏程: "《大数据巨量分析与机器学习的整合与开发》", 31 May 2017, 电子科技大学出版社 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008393A (en) * | 2018-12-29 | 2019-07-12 | 义语智能科技(上海)有限公司 | It is a kind of for obtaining the method and apparatus of site information |
CN110008393B (en) * | 2018-12-29 | 2023-03-07 | 义语智能科技(上海)有限公司 | Method and equipment for acquiring website information |
CN109885744B (en) * | 2019-01-07 | 2024-05-10 | 平安科技(深圳)有限公司 | Webpage data crawling method, device, system, computer equipment and storage medium |
CN109885744A (en) * | 2019-01-07 | 2019-06-14 | 平安科技(深圳)有限公司 | Web data crawling method, device, system, computer equipment and storage medium |
CN109831491A (en) * | 2019-01-15 | 2019-05-31 | 科大国创软件股份有限公司 | Intrusive social data acquisition method based on agency |
CN109831491B (en) * | 2019-01-15 | 2022-03-15 | 科大国创软件股份有限公司 | Invasive social data acquisition method based on agent |
CN112905866B (en) * | 2019-03-14 | 2022-06-07 | 福建省天奕网络科技有限公司 | Historical data tracing and crawling method and terminal without manual participation |
CN112905867A (en) * | 2019-03-14 | 2021-06-04 | 福建省天奕网络科技有限公司 | Efficient historical data tracing and crawling method and terminal |
CN112905867B (en) * | 2019-03-14 | 2022-06-07 | 福建省天奕网络科技有限公司 | Efficient historical data tracing and crawling method and terminal |
CN110062025A (en) * | 2019-03-14 | 2019-07-26 | 深圳绿米联创科技有限公司 | Method, apparatus, server and the storage medium of data acquisition |
CN112905866A (en) * | 2019-03-14 | 2021-06-04 | 福建省天奕网络科技有限公司 | Historical data tracing and crawling method and terminal without manual participation |
CN110069686A (en) * | 2019-03-15 | 2019-07-30 | 平安科技(深圳)有限公司 | User behavior analysis method, apparatus, computer installation and storage medium |
CN109992707A (en) * | 2019-03-18 | 2019-07-09 | 广州视源电子科技股份有限公司 | Data crawling method and device, storage medium and server |
CN110134858A (en) * | 2019-03-26 | 2019-08-16 | 国网重庆市电力公司 | Method for transformation, system, storage medium and the electronic equipment of unstructured data |
CN110262888B (en) * | 2019-06-26 | 2020-11-20 | 京东数字科技控股有限公司 | Task scheduling method and device and method and device for computing node to execute task |
CN110262888A (en) * | 2019-06-26 | 2019-09-20 | 京东数字科技控股有限公司 | The method and apparatus that method for scheduling task and device and calculate node execute task |
CN112579853A (en) * | 2019-09-30 | 2021-03-30 | 顺丰科技有限公司 | Method and device for sequencing crawling links and storage medium |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
CN111324797A (en) * | 2020-02-20 | 2020-06-23 | 民生科技有限责任公司 | Method and device for acquiring data accurately at high speed |
CN111324797B (en) * | 2020-02-20 | 2023-08-11 | 民生科技有限责任公司 | Method and device for precisely acquiring data at high speed |
CN111538883A (en) * | 2020-03-25 | 2020-08-14 | 北京市科学技术情报研究所 | Data crawling method, system and equipment |
CN111538883B (en) * | 2020-03-25 | 2023-11-17 | 北京市科学技术情报研究所 | Data crawling method, system and equipment |
CN111538550A (en) * | 2020-04-17 | 2020-08-14 | 姜海强 | Webpage information screening method based on image detection algorithm |
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN112199567A (en) * | 2020-09-27 | 2021-01-08 | 深圳市伊欧乐科技有限公司 | Distributed data acquisition method, system, server and storage medium |
CN112508362A (en) * | 2020-11-24 | 2021-03-16 | 江苏省质量和标准化研究院 | Product export information processing method and device, electronic equipment and storage medium |
CN112508362B (en) * | 2020-11-24 | 2024-04-23 | 江苏省质量和标准化研究院 | Product outlet information processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033195A (en) | The acquisition methods of webpage information obtain equipment and computer-readable medium | |
US9614862B2 (en) | System and method for webpage analysis | |
US9203720B2 (en) | Monitoring the health of web page analytics code | |
WO2016173200A1 (en) | Malicious website detection method and system | |
US9954886B2 (en) | Method and apparatus for detecting website security | |
JP5695027B2 (en) | Method and system for acquiring AJAX web page content | |
CN109376291B (en) | Website fingerprint information scanning method and device based on web crawler | |
CN104331369B (en) | Page detection method and device, server based on browser | |
US8359317B2 (en) | Method and device for indexing resource content in computer networks | |
JP6103325B2 (en) | Method, apparatus and system for acquiring user behavior | |
CN104063401B (en) | The method and apparatus that a kind of webpage pattern address merges | |
CN111552854A (en) | Webpage data capturing method and device, storage medium and equipment | |
CN104021154B (en) | A kind of method and apparatus scanned in a browser | |
CN102262635A (en) | Page crawler system and page crawler method | |
CN106599270B (en) | Network data capturing method and crawler | |
CN108632219A (en) | A kind of website vulnerability detection method, detection service device and system | |
CN107590236B (en) | Big data acquisition method and system for building construction enterprises | |
Gheorghe et al. | Modern techniques of web scraping for data scientists | |
CN112395485A (en) | Policy big data mining method and device, computer equipment and storage medium | |
CN103455492B (en) | A kind of method and apparatus of search and webpage | |
US11023590B2 (en) | Security testing tool using crowd-sourced data | |
CN109246069B (en) | Webpage login method and device and readable storage medium | |
CN109062803B (en) | Method and device for automatically generating test case based on crawler | |
US20120215757A1 (en) | Web crawling using static analysis | |
CN114491210A (en) | Data acquisition method and device based on web crawler |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |