CN109255063A - A kind of method and apparatus crawling web page contents - Google Patents

A kind of method and apparatus crawling web page contents Download PDF

Info

Publication number
CN109255063A
CN109255063A CN201810864353.4A CN201810864353A CN109255063A CN 109255063 A CN109255063 A CN 109255063A CN 201810864353 A CN201810864353 A CN 201810864353A CN 109255063 A CN109255063 A CN 109255063A
Authority
CN
China
Prior art keywords
page
configuration
present
structured message
target pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810864353.4A
Other languages
Chinese (zh)
Inventor
唐明东
覃柏瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pu Xin Heng Ye Technology Development (beijing) Co Ltd
Pleasant Sunny Technology Development (beijing) Co Ltd
Original Assignee
Pu Xin Heng Ye Technology Development (beijing) Co Ltd
Pleasant Sunny Technology Development (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pu Xin Heng Ye Technology Development (beijing) Co Ltd, Pleasant Sunny Technology Development (beijing) Co Ltd filed Critical Pu Xin Heng Ye Technology Development (beijing) Co Ltd
Priority to CN201810864353.4A priority Critical patent/CN109255063A/en
Publication of CN109255063A publication Critical patent/CN109255063A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)
  • Stored Programmes (AREA)

Abstract

Embodiments of the present invention provide a kind of method for crawling web page contents.This method comprises: calling corresponding page stream configuration according to task parameters;Target pages are downloaded according to the page stream configuration;The structured message in target pages is extracted according to page configuration.The present invention solve the problems, such as because code issue etc. maintenance reasons need system reboot, realize the heat deployment ability of crawler system.In addition, embodiments of the present invention additionally provide a kind of device for crawling web page contents, a kind of equipment and a kind of computer readable storage medium.

Description

A kind of method and apparatus crawling web page contents
Technical field
Embodiments of the present invention are related to data mining technology field, more specifically, embodiments of the present invention are related to one Kind crawls method, a kind of device crawling web page contents, a kind of equipment and a kind of computer-readable storage medium of web page contents Matter.
Background technique
Background that this section is intended to provide an explanation of the embodiments of the present invention set forth in the claims or context.Herein Description recognizes it is the prior art not because not being included in this section.
Spiders (also known as " Web Spider " etc.) is a kind of computer program, for initiating HTTP request to server, To obtain the webpage and analyzing web page of server, required information is obtained.
According to dividing to whether page flow pre-defines, spiders can be divided into two classes: orientation crawler and non-directional Crawler.Orientation crawler crawls several specific webpages of certain specific websites, and extracts structuring according to certain business model Information, for example crawl the flight and freight rate information between each city.Non-directional crawler from several kinds of sublinks of website, Kind of a subpage frame is first crawled, hyperlink all in kind of subpage frame is extracted later, then crawls the hyperlink newly obtained, so follow Ring, until all pages, which crawl, finishes or reach specified link depth.The crawler of the search engines such as Baidu, Google is just It is typical non-directional crawler.
Opposite non-directional crawler needs to crawl entire the website even page of the whole network, and the page that orientation crawler crawls will lack It is more, but exigent real-time, information extraction it is accurate, sometimes for interacting the authorization for obtaining user, quickly with user Corresponding adjustment is done to the variation of website.
Existing open source crawler frame has very much, for example Nutch, Crawler4j, WebMagic, WebCollector, Scrapy etc., these frames are designed for non-directional crawler, are solved thread scheduling and page-downloading and are linked asking for traversal Topic, is stored to Hadoop cluster or local file system for the page crawled.
However for there is the company of orientation crawler demand, the corresponding technological system of self-developing is generally required.These technologies System writes a set of computer program for each website to download and parse corresponding webpage.Due to these crawler technology systems System writes code realization by certain programmed language (such as Java, C++, Python etc.), thus has following significant drawback: It needs to shut down when first, the maintenance such as upgrading crawler technology system, being extended and being adjusted, release code again, i.e., it cannot be real Existing heat deployment;Second, the program codes such as the parameter of request page constructs, information extraction, page control stream are rubbed together, program generation Code due to factors such as programming language difference, coding style differences, not intuitively, not readily understood and bad maintenance.
Summary of the invention
Effectively to solve the problems, such as to need to restart system because of maintenance reasons such as code publications, heat deployment ability, this hair are realized Bright embodiment be intended to provide a kind of method for crawling web page contents, a kind of device crawling web page contents, a kind of equipment with And a kind of computer readable storage medium carries out so that crawler technology system has heat deployment ability to crawler technology system The service of restarting is not needed when the maintenances such as upgrading, expansion and adjustment, greatly improves the availability of system.
In the first aspect of embodiment of the present invention, provide a kind of method for crawling web page contents, comprising: according to appoint Parameter of being engaged in calls corresponding page stream configuration;Target pages are downloaded according to the page stream configuration;Mesh is extracted according to page configuration Mark the structured message in the page.
In one embodiment of the invention, the page configuration is present under page stream configuration.
In another embodiment of the invention, the page stream configuration is present in configuration rule library.
In yet another embodiment of the present invention, the configuration rule library is configured comprising custom function.
In yet another embodiment of the present invention, the custom function configuration, page stream configuration, page configuration data Format be in XML format, YML format and JSON format any one or it is a variety of.
In yet another embodiment of the present invention, further includes: extracting the structuring in target pages according to page configuration After information, by the structured message persistence.
In yet another embodiment of the present invention, the persistence refer to by structured message be placed in database, caching with And in file system any one or it is a variety of.
In the second aspect of embodiment of the present invention, a kind of device for crawling web page contents is provided, comprising: matching mould Block, for calling corresponding page stream configuration according to task parameters;Download module, for downloading mesh according to the page stream configuration Mark the page;Extraction module, for extracting the structured message in target pages according to page configuration.
In one embodiment of the invention, the page configuration is present under page stream configuration.
In another embodiment of the invention, the page stream configuration is present in configuration rule library.
In yet another embodiment of the present invention, the configuration rule library is configured comprising custom function.
In yet another embodiment of the present invention, the custom function configuration, page stream configuration, page configuration data Format be in XML format, YML format and JSON format any one or it is a variety of.
In yet another embodiment of the present invention, further includes: lasting module, for extracting page object according to page configuration After structured message in face, by the structured message persistence.
In yet another embodiment of the present invention, further includes: the persistence refer to by structured message be placed in database, Caching and file system in any one or it is a variety of.
In the third aspect of embodiment of the present invention, a kind of equipment is provided, comprising: memory is calculated for storing Machine program;Processor, for executing the computer program stored in the memory, and the computer program is performed, Realize any one method as previously described.
In the fourth aspect of embodiment of the present invention, a kind of computer readable storage medium is provided, is stored thereon with Computer program when the computer program is executed by processor, can be realized any one method as previously described.
A kind of method for crawling web page contents of embodiment, a kind of device crawling web page contents, one according to the present invention Kind equipment and a kind of computer readable storage medium are efficiently solved because the maintenance reasons such as code publication need system reboot Problem realizes heat deployment ability.
In addition, the present invention is based on the crawler technology sides that configuration file is realized compared to the crawler realized by program code Case have the advantages that standardization, it is succinct, intuitive, be easy to understand.Understand that crawler can be completed in the people of XML, JSON and YML file Exploitation and maintenance work to no longer have strong dependence to programming language reduce crawler technology threshold.
Detailed description of the invention
The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:
Fig. 1 schematically shows a kind of process of method for crawling web page contents according to an embodiment of the present invention Figure;
Fig. 2 schematically shows a kind of structures of device for crawling web page contents according to an embodiment of the present invention to show It is intended to;
Fig. 3 schematically shows a kind of structural schematic diagram of equipment according to an embodiment of the present invention;And
Fig. 4 schematically shows a kind of signals of computer readable storage medium according to an embodiment of the present invention Figure.
In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.
Specific embodiment
The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and energy It is enough that the scope of the present disclosure is completely communicated to those skilled in the art.
One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method Or computer program product etc..Therefore, the present disclosure may be embodied in the following forms, it may be assumed that complete hardware, complete software The form that (including firmware, resident software, microcode etc.) or hardware and software combine.
Embodiment according to the present invention proposes a kind of method for crawling web page contents, a kind of web page contents that crawl Device, a kind of equipment and a kind of computer readable storage medium.
It is to be appreciated that so-called heat deployment is exactly to upgrade software, system when application, system etc. are currently running Deng without restarting the software, system etc..This understanding and being generally understood for those skilled in the art are consistent, It has no special meaning herein.In addition, any number of elements in attached drawing be used to example rather than limit and it is any Name is only used for distinguishing, without any restrictions meaning.
Below with reference to several representative embodiments of the invention, the principle and spirit of the present invention are explained in detail.
Summary of the invention
The inventors discovered that this field crawl the crawler technology schemes of web page contents by certain programmed language (such as Java, C++, Python etc.) code realization is write, with following significant drawback: first, being upgraded to crawler technology system, being expanded Shutdown, again release code are needed when the maintenance such as exhibition and adjustment, that is, can not achieve heat deployment;Second, the parameter structure of request page Build, information extraction, the page control stream etc. program codes rub together, program code is poor due to programming language difference, coding style The factors such as different, not intuitively, not readily understood and bad maintenance.
The configuration text that the technical solution that the present invention provides passes through universal standard format (such as the formats such as XML, JSON, YML) Part come describe crawler logic, crawler engine parse execute configuration file efficiently solve by way of completing to crawl task because The problem of maintenance reasons such as code publication need system reboot, realizes the heat deployment ability of crawler technology scheme.In addition, comparing In the crawler realized by program code, crawler technology scheme of the present invention have standardization, it is succinct, intuitive, be easy to understand it is excellent Point.
After introduced the basic principles of the present invention, lower mask body introduces various non-limiting embodiment party of the invention Formula.
Application scenarios overview
Embodiment according to the present invention, it includes this big scene of data mining that application scenarios of the invention, which may be implemented, More specifically, application scenarios of the invention are to crawl web page contents.
Illustrative methods
A kind of method for crawling web page contents of illustrative embodiments according to the present invention is described below with reference to Fig. 1.It needs It should be noted which is shown only for the purpose of facilitating an understanding of the spirit and principles of the present invention for above-mentioned application scenarios, implementation of the invention Mode is unrestricted in this regard.On the contrary, embodiments of the present invention can be applied to applicable any scene.
Fig. 1 schematically shows a kind of flow chart of method for crawling web page contents according to an embodiment of the invention. This method usually requires to realize by similar devices such as computer, intelligent terminals.Specifically, which can To include:
S110 calls corresponding page stream configuration according to task parameters.
In general, calling end task can be submitted to task performer when execution crawls web page contents task.In system Portion, task are expressed in the form of parameter, and different tasks has different task parameters, for example crawls " train ticket information " Just there are different task parameters with " hotel information " is crawled, because structure is presented not to the utmost according to the data of business practice, the two It is identical.The information for crawling different data structure needs different modes naturally, performance be in the present invention task parameters not Together.Therefore, it is necessary first to match corresponding page stream configuration according to specific task parameters and be called.
As configuration file, the characteristics of page stream configuration is according to different web pages, is preset, they define webpage Topology, the i.e. sequencing of webpage, subpage frame and prerequisite etc..It is in the present invention, primary to crawl in task to exempt from doubt Multiple webpage (pages) crawls, referred to as page flow.The data formats such as XML, JSON, YML can be used in page stream configuration Expression.
S120 downloads target pages according to the page stream configuration.
In this step, task performer is according to the description of page stream configuration, and invoking page downloader is by the required page (mesh Mark the page), it is downloaded.The target pages are still parent page, are not handled by structuring.
The target pages are can store after being downloaded in the positions such as database, caching and file system, the present invention couple This is without limiting.
S130 extracts the structured message in target pages according to page configuration.
In this step, task performer invoking page resolver extracts the structure in target pages according to page configuration Change information, i.e., the parent page of downloading is further processed, extracts information useful, by structuring.
As configuration file, page configuration be it is preset according to different page types, they define how to solve Analyse the parent page of downloading and the information content of extraction and type etc..
Preferably, page configuration is present under page stream configuration.It may include one under i.e. one specific page stream configuration A or several different page configurations.In this way, substantially increasing working efficiency of the invention: being configured to the page from page flow Configuration, clear logic, task performer convenient reading configuration information.
Preferably, page stream configuration is present in configuration rule library.In the present invention, a configuration rule can individually be set up Then library, so that one or more page stream configurations are placed in the configuration rule library.In this way, substantially increasing of the invention Working efficiency: different page stream configurations is placed in same logic or physical location, task performer is reading page stream configuration When will have higher reading speed and working efficiency.
It is highly preferred that configuration rule library is configured comprising custom function.Function is the basis for crawling webpage information, the present invention Configuration file support custom function, to expand the expressive faculty of configuration file itself.As an example, custom function Configuration can be made of a function registration file and multiple function files.It can be by the function finished writing in function registration file It is registered, thus the function file under formation function registration file.When task performer reads configuration rule library, Ke Yitong It crosses function file name and calls the function file.As an example, the present invention can write function by JavaScipt, due to JavaScript is restarted without compiling, so as to realize heat deployment of the invention.
Preferably, in the present invention, the data of the configuration files such as custom function configuration, page stream configuration, page configuration Format be in XML format, YML format and JSON format any one or it is a variety of.The present invention is by parsing above-mentioned data The configuration file of format, which can be completed, crawls web page contents task, and by more new configuration file, the liter of crawler system can be completed Grade, extension and maintenance, to realize heat deployment.
It is closed since configuration file of the invention completely describes the topology between the logic that web page contents crawl and parse, the page System, without writing ordinal relation, inclusion relation, the precondition etc. of the code control page, so that this is furthermore achieved The heat deployment of invention.
Preferably, which can also include:
S140 is lasting by the structured message after extracting the structured message in target pages according to page configuration Change.
In this step, parsing result (structured message) is carried out persistence by task performer, to facilitate task call End obtains the structured message by interactive interface.
As an example, the persistence, which refers to, is placed in appointing in database, caching and file system for structured message Meaning is one or more kinds of, calls end to read to facilitate.
Exemplary means
After describing the method for exemplary embodiment of the invention, next, with reference to Fig. 2 to the exemplary reality of the present invention A kind of device for crawling web page contents for applying mode is illustrated.
Fig. 2 schematically shows a kind of structural representations for the device for crawling web page contents according to an embodiment of the invention Figure.In general, the device can be independently integrally formed, certainly, embodiment of the present invention is also not excluded for the device or the device A part be set in server or in other equipment, the invention does not limit this.This crawls the dress of web page contents Setting may include matching module 210, download module 220, extraction module 230, specifically:
Matching module 210, for calling corresponding page stream configuration according to task parameters.
In general, calling end task can be submitted to task performer when execution crawls web page contents task.In system Portion, task are expressed in the form of parameter, and different tasks has different task parameters, for example crawls " train ticket information " Just there are different task parameters with " hotel information " is crawled, because structure is presented not to the utmost according to the data of business practice, the two It is identical.The information for crawling different data structure needs different modes naturally, performance be in the present invention task parameters not Together.Therefore, it is necessary first to match corresponding page stream configuration according to specific task parameters and be called.
As configuration file, the characteristics of page stream configuration is according to different web pages, is preset, they define webpage Topology, the i.e. sequencing of webpage, subpage frame and prerequisite etc..It is in the present invention, primary to crawl in task to exempt from doubt Multiple webpage (pages) crawls, referred to as page flow.The data formats such as XML, JSON, YML can be used in page stream configuration Expression.
Download module 220, for downloading target pages according to the page stream configuration.
In this module, task performer is according to the description of page stream configuration, and invoking page downloader is by the required page (mesh Mark the page), it is downloaded.The target pages are still parent page, are not handled by structuring.
The target pages are can store after being downloaded in the positions such as database, caching and file system, the present invention couple This is without limiting.
Extraction module 230, for extracting the structured message in target pages according to page configuration.
In this module, task performer invoking page resolver extracts the structure in target pages according to page configuration Change information, i.e., the parent page of downloading is further processed, extracts the useful information by structuring.
As configuration file, page configuration be it is preset according to different page types, they define how to solve Analyse the parent page of downloading and the information content of extraction and type etc..
Preferably, page configuration is present under page stream configuration.It may include one under i.e. one specific page stream configuration A or several different page configurations.In this way, substantially increasing working efficiency of the invention: being configured to the page from page flow Configuration, clear logic, task performer convenient reading configuration information.
Preferably, page stream configuration is present in configuration rule library.In the present invention, a configuration rule can individually be set up Then library, so that one or more page stream configurations are placed in the configuration rule library.In this way, substantially increasing of the invention Working efficiency: different page stream configurations is placed in same logic or physical location, task performer is reading page stream configuration When will have higher reading speed and working efficiency.
It is highly preferred that configuration rule library is configured comprising custom function.Function is the basis for crawling webpage information, the present invention Configuration file support custom function, to expand the expressive faculty of configuration file itself.As an example, custom function Configuration can be made of a function registration file and multiple function files.It can be by the function finished writing in function registration file It is registered, thus the function file under formation function registration file.When task performer reads configuration rule library, Ke Yitong It crosses function file name and calls the function file.As an example, the present invention can write function by JavaScipt, due to JavaScript is restarted without compiling, so as to realize heat deployment of the invention.
Preferably, in the present invention, the data of the configuration files such as custom function configuration, page stream configuration, page configuration Format be in XML format, YML format and JSON format any one or it is a variety of.The present invention is by parsing above-mentioned data The configuration file of format, which can be completed, crawls web page contents task, and by more new configuration file, the liter of crawler system can be completed Grade, extension and maintenance, to realize heat deployment.
It is closed since configuration file of the invention completely describes the topology between the logic that web page contents crawl and parse, the page System, without writing ordinal relation, inclusion relation, the precondition etc. of the code control page, so that this is furthermore achieved The heat deployment of invention.
Preferably, which can also include:
Lasting module 240, for after extracting the structured message in target pages according to page configuration, by the structure Change information persistence.
In this module, parsing result (structured message) is carried out persistence by task performer, to facilitate task call End obtains the structured message by interactive interface.
As an example, the persistence, which refers to, is placed in appointing in database, caching and file system for structured message Meaning is one or more kinds of, calls end to read to facilitate.
Example devices
After describing the method, apparatus of exemplary embodiment of the invention, next, showing with reference to Fig. 3 the present invention A kind of equipment of example property embodiment is illustrated.
Fig. 3 shows the block diagram for being suitable for the exemplary computer system/server 30 for being used to realize embodiment of the present invention. The computer system/server 30 that Fig. 3 is shown is only an example, should not function and use scope to the embodiment of the present invention Bring any restrictions.
As shown in figure 3, computer system/server 30 is showed in the form of universal computing device.Computer system/service The component of device 30 can include but is not limited to: one or more processor or processing unit 301, system storage 302, even Connect the bus 303 of different system components (including system storage 302 and processing unit 301).
Computer system/server 30 typically comprises a variety of computer system readable media.These media, which can be, appoints What usable medium that can be accessed by computer system/server 30, including volatile and non-volatile media, it is moveable and Immovable medium.
System storage 302 may include the computer system readable media of form of volatile memory, for example, depositing at random Access to memory (RAM) 3021 and/or cache memory 3022.Computer system/server 30 may further include it Its removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, ROM 3023 can be with For reading and writing immovable, non-volatile magnetic media (not showing in Fig. 3, commonly referred to as " hard disk drive ").Although not existing It is shown in Fig. 3, disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") and right can be provided The CD drive of removable anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these feelings Under condition, each driver can be connected by one or more data media interfaces with bus 303.In system storage 302 It may include at least one program product, which has one group of (for example, at least one) program module, these program moulds Block is configured to perform the function of various embodiments of the present invention.
Program/utility 3025 with one group of (at least one) program module 3024, can store in such as system In memory 302, and such program module 3024 includes but is not limited to: operating system, one or more application program, its It may include the realization of network environment in its program module and program data, each of these examples or certain combination. Program module 3024 usually executes function and/or method in embodiment described in the invention.
Computer system/server 30 can also be with one or more external equipment 304 (such as keyboard, sensing equipment, displays Device etc.) communication.This communication can be carried out by input/output (I/O) interface 305.Also, computer system/server 30 Network adapter 306 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public affairs can also be passed through Common network network, such as internet) communication.As shown in figure 3, network adapter 306 passes through bus 303 and computer system/server 30 other modules (such as processing unit 301) communication.It should be appreciated that computer can be combined although being not shown in Fig. 3 Systems/servers 30 use other hardware and/or software module.
The computer program that processing unit 301 is stored in system storage 302 by operation, thereby executing various functions Using and data processing, for example, execute for realizing each step in above method embodiment instruction;Specifically, place Reason device 301 can execute the computer program stored in memory 302, and the computer program is performed, following instruction quilts Operation: corresponding page stream configuration is called according to task parameters;Target pages are downloaded according to the page stream configuration;According to the page Structured message in configuration extraction target pages.
Exemplary media
After the method, apparatus and equipment for describing exemplary embodiment of the invention, next, with reference to Fig. 4 pairs A kind of computer readable storage medium of exemplary embodiment of the invention is illustrated.
The computer readable storage medium of Fig. 4 is CD 40, is stored thereon with computer program (i.e. program product), the journey When sequence is executed by processor, documented each step in above method embodiment can be realized, for example, calling according to task parameters Corresponding page stream configuration;Target pages are downloaded according to the page stream configuration;It is extracted in target pages according to page configuration Structured message.
It should be noted that although being referred to a kind of several modules of device for crawling web page contents in the above detailed description, But this division is only exemplary and not enforceable.In fact, embodiment according to the present invention, above description The feature and function of two or more modules can be embodied in a module.Conversely, an above-described module Feature and function can be to be embodied by multiple modules with further division.
In addition, although describing the operation of the method for the present invention in the accompanying drawings with particular order, this do not require that or Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one Step is decomposed into execution of multiple steps.
Although detailed description of the preferred embodimentsthe spirit and principles of the present invention are described by reference to several, it should be appreciated that, this It is not limited to the specific embodiments disclosed for invention, does not also mean that the feature in these aspects cannot to the division of various aspects Combination is benefited to carry out, this to divide the convenience merely to statement.The present invention is directed to cover appended claims spirit and Included various modifications and equivalent arrangements in range.

Claims (16)

1. a kind of method for crawling web page contents characterized by comprising
Corresponding page stream configuration is called according to task parameters;
Target pages are downloaded according to the page stream configuration;
The structured message in target pages is extracted according to page configuration.
2. the method as described in claim 1, which is characterized in that the page configuration is present under page stream configuration.
3. method according to claim 2, which is characterized in that the page stream configuration is present in configuration rule library.
4. method as claimed in claim 3, which is characterized in that the configuration rule library is configured comprising custom function.
5. method as claimed in claim 4, which is characterized in that the custom function configuration, page stream configuration, page configuration Data format be in XML format, YML format and JSON format any one or it is a variety of.
6. method as claimed in claims 1-5, which is characterized in that further include: it is extracted in target pages according to page configuration Structured message after, by the structured message persistence.
7. method as claimed in claim 6, which is characterized in that the persistence refer to by structured message be placed in database, Caching and file system in any one or it is a variety of.
8. a kind of device for crawling web page contents characterized by comprising
Matching module, for calling corresponding page stream configuration according to task parameters;
Download module, for downloading target pages according to the page stream configuration;
Extraction module, for extracting the structured message in target pages according to page configuration.
9. device as claimed in claim 8, which is characterized in that the page configuration is present under page stream configuration.
10. device as claimed in claim 9, which is characterized in that the page stream configuration is present in configuration rule library.
11. device as claimed in claim 10, which is characterized in that the configuration rule library is configured comprising custom function.
12. device as claimed in claim 11, which is characterized in that the custom function configuration, page stream configuration, the page are matched The data format set be in XML format, YML format and JSON format any one or it is a variety of.
13. the device as described in claim 8-12, which is characterized in that further include:
Lasting module, for according to page configuration extract target pages in structured message after, by the structured message Persistence.
14. device as claimed in claim 13, which is characterized in that the persistence, which refers to, is placed in data for structured message In library, caching and file system any one or it is a variety of.
15. a kind of equipment, comprising:
Memory, for storing computer program;
Processor, for executing the computer program stored in the memory, and the computer program is performed, and is realized Method described in any one of claim 1-7.
16. a kind of computer readable storage medium, is stored thereon with computer program, the computer program is executed by processor When, realize method described in any one of claim 1-7.
CN201810864353.4A 2018-08-01 2018-08-01 A kind of method and apparatus crawling web page contents Pending CN109255063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810864353.4A CN109255063A (en) 2018-08-01 2018-08-01 A kind of method and apparatus crawling web page contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810864353.4A CN109255063A (en) 2018-08-01 2018-08-01 A kind of method and apparatus crawling web page contents

Publications (1)

Publication Number Publication Date
CN109255063A true CN109255063A (en) 2019-01-22

Family

ID=65049385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810864353.4A Pending CN109255063A (en) 2018-08-01 2018-08-01 A kind of method and apparatus crawling web page contents

Country Status (1)

Country Link
CN (1) CN109255063A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800011A (en) * 2019-02-02 2019-05-24 深圳携程网络技术有限公司 Ticket query method, apparatus based on crawler, electronic equipment, storage medium
CN110134403A (en) * 2019-06-04 2019-08-16 厦门大学嘉庚学院 Configurable domain name mapping crawler frame and method based on asynchronous HTTP request
CN111460253A (en) * 2020-03-24 2020-07-28 国家电网有限公司 Internet data capture method suitable for big data analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399908A (en) * 2013-07-30 2013-11-20 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105955984A (en) * 2016-04-19 2016-09-21 ***股份有限公司 Network data searching method based on crawler mode

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399908A (en) * 2013-07-30 2013-11-20 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105955984A (en) * 2016-04-19 2016-09-21 ***股份有限公司 Network data searching method based on crawler mode

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800011A (en) * 2019-02-02 2019-05-24 深圳携程网络技术有限公司 Ticket query method, apparatus based on crawler, electronic equipment, storage medium
CN110134403A (en) * 2019-06-04 2019-08-16 厦门大学嘉庚学院 Configurable domain name mapping crawler frame and method based on asynchronous HTTP request
CN110134403B (en) * 2019-06-04 2022-08-12 厦门大学嘉庚学院 Configurable domain name resolution crawler frame and method based on asynchronous HTTP request
CN111460253A (en) * 2020-03-24 2020-07-28 国家电网有限公司 Internet data capture method suitable for big data analysis

Similar Documents

Publication Publication Date Title
EP3605324B1 (en) Application development method and tool, and storage medium thereof
US10901804B2 (en) Apparatus and method to select services for executing a user program based on a code pattern included therein
US9703678B2 (en) Debugging pipeline for debugging code
US20100281463A1 (en) XML based scripting framework, and methods of providing automated interactions with remote systems
CN107220098B (en) Method and device for implementing rule engine
CN114253535A (en) H5 page multi-language rendering method and device
CN109255063A (en) A kind of method and apparatus crawling web page contents
CN108089890B (en) It is a kind of that operation method and system are applied based on disk
WO2008097816A2 (en) Direct access of language metadata
WO2016005885A2 (en) Asynchronous initialization of document object model (dom) modules
CN106484389B (en) Action stream segment management
CN117693734A (en) Front-end item processing method, device, equipment, management system and storage medium
US11604662B2 (en) System and method for accelerating modernization of user interfaces in a computing environment
US20180032510A1 (en) Automated translation of source code
CN112632333A (en) Query statement generation method, device, equipment and computer readable storage medium
CN117539490A (en) Low-code engine page rendering method and system running at browser end
US20160012023A1 (en) Self-Referencing of Running Script Elements in Asynchronously Loaded DOM Modules
CN117008920A (en) Engine system, request processing method and device, computer equipment and storage medium
WO2015183235A1 (en) Response based on browser engine
US20230418566A1 (en) Programmatically generating evaluation data sets for code generation models
US20230419036A1 (en) Random token segmentation for training next token prediction models
JP6055366B2 (en) Virtual Web server program and function substitution method
WO2007035229A1 (en) Method for automatically defining icons
CN109960522B (en) Software upgrading method and device
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190122