CN106599001A

CN106599001A - Webpage content acquisition method and system

Info

Publication number: CN106599001A
Application number: CN201510680981.3A
Authority: CN
Inventors: 庞涛; 武娟; 钱锋
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2015-10-20
Filing date: 2015-10-20
Publication date: 2017-04-26

Abstract

The invention discloses a webpage content acquisition method and system. The method comprises the steps of acquiring a target website; acquiring a corresponding target webpage according to the target website; processing content shown on the target webpage to an image format to obtain a target image; and recognizing text information in the target image. By a mode of converting the target webpage to the image and recognizing the content of the image, the content of the target webpage can be acquired, a source code is not needed to be acquired, and the method is relatively high in universality.

Description

Web page contents acquisition methods and system

Technical field

The present invention relates to internet arena, especially a kind of web page contents acquisition methods and system.

Background technology

Traditional reptile from the beginning of one or several initial URL (URL), The URL and other guide on the corresponding webpages of initial URL is obtained, while also will be current The new URL obtained on the page is put into queue and continues crawl, until meeting necessarily stopping for system Only condition.All contents by crawler capturing will be stored, according to keyword, text, figure Piece, audio frequency and video etc. carry out classifying, analyze, filter, and set up index, so as to inquiry afterwards And retrieval.Existing crawler system obtains the content stream of target web after target network address is obtained Journey as shown in figure 1, including：

Step S102, obtains the web page source code in target web.

Step S104, the target information in analysis source code.

Step S106, the result after parsing is saved in data base.

However, some websites take anti-reptile measure, reptile is prevented to obtain web page source code, So as to reptile cannot complete the acquisition to target web information.

The content of the invention

An embodiment of the present invention technical problem to be solved is：How in web page source generation, is not being obtained The content information of target web is obtained in the case of code.

A kind of one side according to embodiments of the present invention, there is provided web page contents acquisition methods, Including：Obtain target network address；Corresponding target web is obtained according to target network address；By target network The contents processing that page shows obtains Target Photo into picture format；Text in identification Target Photo Word information.

In one embodiment, method also includes：Using web crawlers technical limit spacing target network address； Corresponding target web is obtained according to target network address using browser.

In one embodiment, method also includes：Cutting is carried out to Target Photo and obtains target figure The identification region of piece；Word message in the identification region of identification Target Photo.

In one embodiment, recognize that the Word message in Target Photo includes：By server set Word message in group or cloud computing resource pool identification Target Photo.

In one embodiment, recognize that the Word message in Target Photo includes：Using optics word Word message in symbol technology of identification identification Target Photo.

In one embodiment, method also includes：It is clear that Word message to identifying carries out data Wash, classifying stores and/or set up index.

Second aspect according to embodiments of the present invention, there is provided a kind of web page contents obtain system, bag Include：Website acquisition module, for obtaining target network address；Web analysis module, for according to mesh Mark network address obtains corresponding target web；Picture acquisition module, for what is shown target web Contents processing obtains Target Photo into picture format；Identification module, for recognizing Target Photo In Word message.

In one embodiment, system also includes cutting module, for cutting out to Target Photo The identification region for obtaining Target Photo is cut, identification module is used for the identification region for recognizing Target Photo In Word message.

In one embodiment, identification module is used to recognize target using OCR Word message in picture.

In one embodiment, system also includes：Data cleansing module, for identifying Word message carries out data cleansing, and memory module of classifying, the Word message for will identify that are entered Row classification storage, and/or, index module sets up index for the Word message to identifying.

In one embodiment, system also includes web crawlers, browser, and server set Group or cloud computing resource pool；Web crawlers includes website acquisition module, and browser includes webpage solution Analysis module and picture acquisition module, server cluster or cloud computing resource pool include identification module.

The present invention at least has advantages below：By target web is converted to picture, then to picture Carry out content aware mode, you can to obtain the content of target web, and source code need not be obtained, Versatility is stronger.

By detailed description referring to the drawings to exemplary embodiment of the invention, the present invention Further feature and its advantage will be made apparent from.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will Accompanying drawing to be used needed for embodiment or description of the prior art is briefly described, it is clear that Ground, drawings in the following description are only some embodiments of the present invention, for the common skill in this area For art personnel, without having to pay creative labor, can be being obtained according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 illustrates the schematic diagram of web page contents acquisition methods in prior art.

Fig. 2 illustrates the schematic flow sheet of web page contents acquisition methods one embodiment of the present invention.

Fig. 3 illustrates the schematic flow sheet of another embodiment of web page contents acquisition methods of the present invention.

Fig. 4 illustrates the schematic flow sheet of another embodiment of web page contents acquisition methods of the present invention.

Fig. 5 illustrates that the present invention carries out the schematic diagram of the method for content obtaining to web page portions region.

Fig. 6 (a), 6 (b) illustrate the schematic diagram of picture region cutting of the present invention.

Fig. 7 illustrates that web page contents of the present invention obtain the structural representation of system one embodiment.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical scheme in the embodiment of the present invention It is clearly and completely described, it is clear that described embodiment is only that a present invention part is real Apply example, rather than the embodiment of whole.Description reality at least one exemplary embodiment below It is merely illustrative on border, never as to the present invention and its application or any restriction for using. Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment for being obtained is put, the scope of protection of the invention is belonged to.

The web page contents acquisition methods of one embodiment of the invention are described below with reference to Fig. 2.

Fig. 2 is the flow chart of one embodiment of web page contents acquisition methods of the present invention.Such as Fig. 2 institutes Show, the method for the embodiment includes：

Step S202, obtains target network address.

Step S204, obtains corresponding target web according to target network address.

Step S206, the contents processing that target web is shown obtain target figure into picture format Piece.

Step S208, recognizes the Word message in Target Photo.

By target web is converted to picture, then content aware mode is carried out to picture, you can To obtain the content of target web, and source code need not be obtained, versatility is stronger.

In step S208, for example, Target Photo can be identified using following methods：Make The Word message in Target Photo is recognized with OCR.Optical character recognition (Optical Character Recognition, hereinafter referred to as OCR) refers to electronic equipment For printed character, the text conversion in paper document is become by black and white using optical mode The image file of dot matrix, and by identification software by the text conversion in image into text formatting, For the technology that word processor is further edited and processed.In the present invention, using OCR technique It is identified mainly being made up of following step：First, Target Photo is input into into identification module； Then, pretreatment, including binaryzation, image noise reduction and/or slant correction are carried out to Target Photo, To improve the precision of follow-up identification；Finally, character features extraction is carried out, selects corresponding right The identification of Word message is carried out than data base.If the required precision to recognizing is higher, can be with Manual synchronizing is carried out after identification software is identified, to avoid producing more manifest error. As the Word message in webpage is mostly the block letter of standard, therefore, using OCR technique The Word message in Target Photo can preferably be recognized.OCR tool can for example be adopted The Open-Source Tools such as Tesseract, OCRFeeder.

When obtaining webpage contents in batch is needed, it is possible to use crawler technology.Below with reference to Fig. 3 The method that description web page contents of the present invention obtain one embodiment.

Fig. 3 is the flow chart of another embodiment of web page contents acquisition methods of the present invention.Such as Fig. 3 Shown, the method for the embodiment includes：

Step S300, using web crawlers technical limit spacing target network address, and is sent to browser.

Step S302, browser obtain target network address.

Step S304, browser obtain corresponding target web according to target network address.

May then continue with execution step S206～S208.

Method by obtaining target network address using crawler technology, can obtain target network in bulk The Word message included by page, it is adaptable to big data field.When using crawler technology, can be with Reptile function is realized using the technology increased income, the WebCollector that for example realized using Java, JSpider, Crawler4j, can also use Python provide urllib2, cookielib, Re, threading storehouse is writing reptile script.When needing, which can further be determined System, simplifies reptile function, only retains the part for obtaining and parsing URL.Most of browser The function of wanting needed for said method can be realized, if necessary to the partial function to browser It is modified, increase income browser such as Fifth, Dooscape, Qupzilla etc. can be selected, Browser is made to be applied to the performing environment of the inventive method.

Further describe with reference to Fig. 4 carries out of web page contents acquisition using crawler technology Application scenarios.

Fig. 4 is the flow chart of another embodiment of web page contents acquisition methods of the present invention.Such as Fig. 4 Shown, the method for the embodiment includes：

Step S402, reptile obtain the corresponding webpages of URL for crawling.

Step S404, obtains the url list for needing parsing from webpage.

Step S406, reptile is according to the URL in url list successively request list.Repeat to walk Rapid S402 to S406, until obtaining target url list.

URL in target url list is resolved to target web by step S408, browser, Webpage view is generated, and webpage view is stored as into picture.

Step S410, recognizes the Word message in picture.

In step S410, due to the operand of the identification of Word message it is larger, to being identified Equipment performance requirement it is higher, it may be thus possible, for example, to adopt following methods：By server set Word message in group or cloud computing resource pool identification Target Photo.Server cluster can be utilized Multiple computers carry out parallel computation so as to obtain very high calculating speed, while can also pass through The stability of system is improved using the method that multiple computers backup.Cloud computing resource pool then enters The integration and distribution of row resource, can lift resource utilization.Thus, it is possible to improve identification effect Rate, and the further performance of lift system.

Further, it is also possible to for the Word message for identifying is further processed, such as may be used also With including step S412～S414：

Step S412, the Word message to identifying carry out data cleansing, classification storage and/or Set up index.

Step S414, the Word message after process is preserved to data base.

The Word message directly obtained during identification picture is raw information.In the situation that data volume is larger Under, the system docking of web page contents acquisition methods can will be realized to other big data platforms to original Beginning information is continued with, the information higher to obtain availability.For example, Word message is entered Row data cleansing, can detect incomplete data, wrong data and duplicate data, and carry out further Amendment；Word message is classified, can according to business need import information into data bins In storehouse or relation database table；Word message is set up and is indexed, when conveniently can use afterwards Quick-searching.Obviously, it will be appreciated by those skilled in the art that except the Word message of foregoing description Beyond processing method, additive method can also be adopted as needed, it is no longer exhaustive here.

Sometimes, for the webpage of specified arrangement, the region that required Word message is located is fixed , and the word outside region is not intended to the information for obtaining.Only therefore, it can to subregion It is identified.With reference to the method that Fig. 5 descriptions carry out content obtaining to web page portions region.

Fig. 5 is the stream of one embodiment that the present invention carries out content acquisition method to web page portions region Cheng Tu.As shown in figure 5, the method for the embodiment includes：

Step S202, obtains target network address.

Step S508, carries out the identification region that cutting obtains Target Photo to Target Photo.

Step S510, recognizes the Word message in the identification region of Target Photo.

By in this way, it is to avoid identification to garbage, recognition efficiency is improve, Improve the performance of system.

Specifically, in step S508, cutting is carried out to Target Photo and can for example adopts following step Suddenly：First, the coordinate system in picture is defined, including origin position, x-axis positive direction, y-axis is just Direction；Secondly, it is input into the vertex value of clipping region；Finally, it is each with rule connection set in advance Individual summit, cutting closed area.By taking Fig. 6 (a) and 6 (b) as an example：First, by Fig. 6 (a) The upper left corner be set to coordinate origin, level direction to the right is set to x-axis positive direction, vertically to Under direction be set to y-axis positive direction；Then, obtain the top left co-ordinate (x of clipping region₁,y₁) With bottom right angular coordinate (x₂,y₂), and determine therefrom that clipping rectangle region upper right angular coordinate be (x₂, y₁), lower-left angular coordinate is (x₁,y₂)；Finally, due to target area is rectangle, therefore root Rectangular area is determined according to the coordinate on four summits of aforesaid rectangular area and Target Photo is carried out Cutting, obtains Fig. 6 (b).Obviously, it will be appreciated by those skilled in the art that as needed or The concrete setting rule of person's module, can define other coordinate systems, it is also possible to carry out other shapes Cutting out for shape, is repeated no more here.

Step S206 can also be realized by browser, i.e.,：Browser is by target web displaying Appearance is processed into picture format, obtains Target Photo.For example, Dooscape has sectional drawing function. Rather than browser such as IE browser of increasing income, it is also possible to the interface (API) provided using browser Coordinated with other modules, realized sectional drawing function.For example, can be provided based on IE browser Win32API in PrintWindow functions, carry out with other interfaces, module or function Realize after integration.

The enforcement of this method needs to coordinate between modules and completes, therefore, it is possible to make each reality Workflow is formed between the corresponding module of the step of applying.For example, the method for the present invention can be with Including：Modules are called using the sequence of modules with management and dispatching function, modules are made Each step in preceding method is performed successively.Thus, it is possible to modules of connecting so as to from Web page contents acquisition is completed dynamicization.Module with management and dispatching function can for example be adopted Timed task in linux system performs instrument crontab.Crontab instruments can be The execution time of each order is set in crontab files, after crond start orders are performed, System will make corresponding module perform corresponding order on the time point of setting.For example, if climbing Worm, browser, the startup order of content identifier module are respectively crawler start, browser Start, ocr start, start script be respectively positioned on/etc/init.d files in, respectively 8:10、 8:30、8:50 perform modules, then the relevant order for including in crontab files performs each The content of individual module can be：

108***/etc/init.d/crawlerstart

308***/etc/init.d/browserstart

508***/etc/init.d/ocrstart

Additionally, aforesaid each step can be performed in generic server or cloud main frame, make Safety, stability, reliability are higher.

The web page contents that one embodiment of the invention is described below with reference to Fig. 7 obtain system.

Fig. 7 is the structure chart of one embodiment that web page contents of the present invention obtain system.Such as Fig. 7 institutes Show, the system of the embodiment includes：Website acquisition module 72, for obtaining target network address；Net Page parsing module 74, for obtaining corresponding target web according to target network address；Picture obtains mould Block 76, for the contents processing that shows target web into picture format, obtains Target Photo； Identification module 78, for recognizing the Word message in Target Photo.

System can also include cutting module, obtain target figure for carrying out cutting to Target Photo The identification region of piece, identification module are used to recognize the Word message in the identification region of Target Photo.

Identification module can be also used for the text in Target Photo is recognized using OCR Word information.

System can also include：Data cleansing module, is carried out for the Word message to identifying Data cleansing, memory module of classifying, the Word message for will identify that carry out classification storage, And/or, index module sets up index for the Word message to identifying.

System can also include：Web crawlers, browser, and server cluster or cloud computing Resource pool.Web crawlers includes website acquisition module, and browser includes web analysis module and figure Piece acquisition module, server cluster or cloud computing resource pool include identification module.

Additionally, the method according to the invention is also implemented as a kind of computer program, should Computer program includes computer-readable medium, is stored with the computer-readable medium For performing the computer program of the above-mentioned functions limited in the method for the present invention.Art technology Personnel will also understand is that, the various illustrative logical blocks with reference to described by disclosure herein, mould Block, circuit and algorithm steps may be implemented as the group of electronic hardware, computer software or both Close.

Presently preferred embodiments of the present invention is the foregoing is only, it is not to limit the present invention, all at this Within the spirit and principle of invention, any modification, equivalent substitution and improvements made etc. all should be wrapped It is contained within protection scope of the present invention.

Claims

1. a kind of web page contents acquisition methods, including：

Obtain target network address；

Corresponding target web is obtained according to the target network address；

The contents processing that the target web is shown obtains Target Photo into picture format；

Recognize the Word message in the Target Photo.

2. method according to claim 1, it is characterised in that methods described also includes：

Using web crawlers technical limit spacing target network address；

Corresponding target web is obtained according to the target network address using browser.

3. method according to claim 1, it is characterised in that methods described also includes：

The identification region that cutting obtains the Target Photo is carried out to Target Photo；

Recognize the Word message in the identification region of the Target Photo.

4. method according to claim 1, it is characterised in that the identification target figure Word message in piece includes：

The Word message in the Target Photo is recognized by server cluster or cloud computing resource pool.

5. the method according to claim 1 or 4, it is characterised in that the identification mesh The Word message marked on a map in piece includes：

The Word message in the Target Photo is recognized using OCR.

6. method according to claim 1, it is characterised in that methods described also includes：

Word message to identifying carries out data cleansing, classification storage and/or sets up index.

7. a kind of web page contents obtain system, including：

Website acquisition module, for obtaining target network address；

Web analysis module, for obtaining corresponding target web according to the target network address；

Picture acquisition module, for the contents processing that shows the target web into picture format, Obtain Target Photo；

Identification module, for recognizing the Word message in the Target Photo.

8. system according to claim 7, it is characterised in that also including cutting module, For the identification region that cutting obtains the Target Photo, the identification mould are carried out to Target Photo Block is used to recognize the Word message in the identification region of the Target Photo.

9. system according to claim 7, it is characterised in that the identification module is used Word message in using the OCR identification Target Photo.

10. system according to claim 7, it is characterised in that also include：

Data cleansing module, carries out data cleansing for the Word message to identifying,

Classification memory module, the Word message for will identify that carry out classification storage,

And/or, index module sets up index for the Word message to identifying.

11. systems according to claim 7, it is characterised in that the system includes net Network reptile, browser, and server cluster or cloud computing resource pool；

Web crawlers includes the website acquisition module, and the browser includes the web analysis Module and the picture acquisition module, the server cluster or cloud computing resource pool include described Identification module.