CN106599001A - Webpage content acquisition method and system - Google Patents

Webpage content acquisition method and system Download PDF

Info

Publication number
CN106599001A
CN106599001A CN201510680981.3A CN201510680981A CN106599001A CN 106599001 A CN106599001 A CN 106599001A CN 201510680981 A CN201510680981 A CN 201510680981A CN 106599001 A CN106599001 A CN 106599001A
Authority
CN
China
Prior art keywords
target
word message
web
module
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510680981.3A
Other languages
Chinese (zh)
Inventor
庞涛
武娟
钱锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201510680981.3A priority Critical patent/CN106599001A/en
Publication of CN106599001A publication Critical patent/CN106599001A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage content acquisition method and system. The method comprises the steps of acquiring a target website; acquiring a corresponding target webpage according to the target website; processing content shown on the target webpage to an image format to obtain a target image; and recognizing text information in the target image. By a mode of converting the target webpage to the image and recognizing the content of the image, the content of the target webpage can be acquired, a source code is not needed to be acquired, and the method is relatively high in universality.

Description

Web page contents acquisition methods and system
Technical field
The present invention relates to internet arena, especially a kind of web page contents acquisition methods and system.
Background technology
Traditional reptile from the beginning of one or several initial URL (URL), The URL and other guide on the corresponding webpages of initial URL is obtained, while also will be current The new URL obtained on the page is put into queue and continues crawl, until meeting necessarily stopping for system Only condition.All contents by crawler capturing will be stored, according to keyword, text, figure Piece, audio frequency and video etc. carry out classifying, analyze, filter, and set up index, so as to inquiry afterwards And retrieval.Existing crawler system obtains the content stream of target web after target network address is obtained Journey as shown in figure 1, including:
Step S102, obtains the web page source code in target web.
Step S104, the target information in analysis source code.
Step S106, the result after parsing is saved in data base.
However, some websites take anti-reptile measure, reptile is prevented to obtain web page source code, So as to reptile cannot complete the acquisition to target web information.
The content of the invention
An embodiment of the present invention technical problem to be solved is:How in web page source generation, is not being obtained The content information of target web is obtained in the case of code.
A kind of one side according to embodiments of the present invention, there is provided web page contents acquisition methods, Including:Obtain target network address;Corresponding target web is obtained according to target network address;By target network The contents processing that page shows obtains Target Photo into picture format;Text in identification Target Photo Word information.
In one embodiment, method also includes:Using web crawlers technical limit spacing target network address; Corresponding target web is obtained according to target network address using browser.
In one embodiment, method also includes:Cutting is carried out to Target Photo and obtains target figure The identification region of piece;Word message in the identification region of identification Target Photo.
In one embodiment, recognize that the Word message in Target Photo includes:By server set Word message in group or cloud computing resource pool identification Target Photo.
In one embodiment, recognize that the Word message in Target Photo includes:Using optics word Word message in symbol technology of identification identification Target Photo.
In one embodiment, method also includes:It is clear that Word message to identifying carries out data Wash, classifying stores and/or set up index.
Second aspect according to embodiments of the present invention, there is provided a kind of web page contents obtain system, bag Include:Website acquisition module, for obtaining target network address;Web analysis module, for according to mesh Mark network address obtains corresponding target web;Picture acquisition module, for what is shown target web Contents processing obtains Target Photo into picture format;Identification module, for recognizing Target Photo In Word message.
In one embodiment, system also includes cutting module, for cutting out to Target Photo The identification region for obtaining Target Photo is cut, identification module is used for the identification region for recognizing Target Photo In Word message.
In one embodiment, identification module is used to recognize target using OCR Word message in picture.
In one embodiment, system also includes:Data cleansing module, for identifying Word message carries out data cleansing, and memory module of classifying, the Word message for will identify that are entered Row classification storage, and/or, index module sets up index for the Word message to identifying.
In one embodiment, system also includes web crawlers, browser, and server set Group or cloud computing resource pool;Web crawlers includes website acquisition module, and browser includes webpage solution Analysis module and picture acquisition module, server cluster or cloud computing resource pool include identification module.
The present invention at least has advantages below:By target web is converted to picture, then to picture Carry out content aware mode, you can to obtain the content of target web, and source code need not be obtained, Versatility is stronger.
By detailed description referring to the drawings to exemplary embodiment of the invention, the present invention Further feature and its advantage will be made apparent from.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will Accompanying drawing to be used needed for embodiment or description of the prior art is briefly described, it is clear that Ground, drawings in the following description are only some embodiments of the present invention, for the common skill in this area For art personnel, without having to pay creative labor, can be being obtained according to these accompanying drawings Obtain other accompanying drawings.
Fig. 1 illustrates the schematic diagram of web page contents acquisition methods in prior art.
Fig. 2 illustrates the schematic flow sheet of web page contents acquisition methods one embodiment of the present invention.
Fig. 3 illustrates the schematic flow sheet of another embodiment of web page contents acquisition methods of the present invention.
Fig. 4 illustrates the schematic flow sheet of another embodiment of web page contents acquisition methods of the present invention.
Fig. 5 illustrates that the present invention carries out the schematic diagram of the method for content obtaining to web page portions region.
Fig. 6 (a), 6 (b) illustrate the schematic diagram of picture region cutting of the present invention.
Fig. 7 illustrates that web page contents of the present invention obtain the structural representation of system one embodiment.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical scheme in the embodiment of the present invention It is clearly and completely described, it is clear that described embodiment is only that a present invention part is real Apply example, rather than the embodiment of whole.Description reality at least one exemplary embodiment below It is merely illustrative on border, never as to the present invention and its application or any restriction for using. Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment for being obtained is put, the scope of protection of the invention is belonged to.
The web page contents acquisition methods of one embodiment of the invention are described below with reference to Fig. 2.
Fig. 2 is the flow chart of one embodiment of web page contents acquisition methods of the present invention.Such as Fig. 2 institutes Show, the method for the embodiment includes:
Step S202, obtains target network address.
Step S204, obtains corresponding target web according to target network address.
Step S206, the contents processing that target web is shown obtain target figure into picture format Piece.
Step S208, recognizes the Word message in Target Photo.
By target web is converted to picture, then content aware mode is carried out to picture, you can To obtain the content of target web, and source code need not be obtained, versatility is stronger.
In step S208, for example, Target Photo can be identified using following methods:Make The Word message in Target Photo is recognized with OCR.Optical character recognition (Optical Character Recognition, hereinafter referred to as OCR) refers to electronic equipment For printed character, the text conversion in paper document is become by black and white using optical mode The image file of dot matrix, and by identification software by the text conversion in image into text formatting, For the technology that word processor is further edited and processed.In the present invention, using OCR technique It is identified mainly being made up of following step:First, Target Photo is input into into identification module; Then, pretreatment, including binaryzation, image noise reduction and/or slant correction are carried out to Target Photo, To improve the precision of follow-up identification;Finally, character features extraction is carried out, selects corresponding right The identification of Word message is carried out than data base.If the required precision to recognizing is higher, can be with Manual synchronizing is carried out after identification software is identified, to avoid producing more manifest error. As the Word message in webpage is mostly the block letter of standard, therefore, using OCR technique The Word message in Target Photo can preferably be recognized.OCR tool can for example be adopted The Open-Source Tools such as Tesseract, OCRFeeder.
When obtaining webpage contents in batch is needed, it is possible to use crawler technology.Below with reference to Fig. 3 The method that description web page contents of the present invention obtain one embodiment.
Fig. 3 is the flow chart of another embodiment of web page contents acquisition methods of the present invention.Such as Fig. 3 Shown, the method for the embodiment includes:
Step S300, using web crawlers technical limit spacing target network address, and is sent to browser.
Step S302, browser obtain target network address.
Step S304, browser obtain corresponding target web according to target network address.
May then continue with execution step S206~S208.
Method by obtaining target network address using crawler technology, can obtain target network in bulk The Word message included by page, it is adaptable to big data field.When using crawler technology, can be with Reptile function is realized using the technology increased income, the WebCollector that for example realized using Java, JSpider, Crawler4j, can also use Python provide urllib2, cookielib, Re, threading storehouse is writing reptile script.When needing, which can further be determined System, simplifies reptile function, only retains the part for obtaining and parsing URL.Most of browser The function of wanting needed for said method can be realized, if necessary to the partial function to browser It is modified, increase income browser such as Fifth, Dooscape, Qupzilla etc. can be selected, Browser is made to be applied to the performing environment of the inventive method.
Further describe with reference to Fig. 4 carries out of web page contents acquisition using crawler technology Application scenarios.
Fig. 4 is the flow chart of another embodiment of web page contents acquisition methods of the present invention.Such as Fig. 4 Shown, the method for the embodiment includes:
Step S402, reptile obtain the corresponding webpages of URL for crawling.
Step S404, obtains the url list for needing parsing from webpage.
Step S406, reptile is according to the URL in url list successively request list.Repeat to walk Rapid S402 to S406, until obtaining target url list.
URL in target url list is resolved to target web by step S408, browser, Webpage view is generated, and webpage view is stored as into picture.
Step S410, recognizes the Word message in picture.
In step S410, due to the operand of the identification of Word message it is larger, to being identified Equipment performance requirement it is higher, it may be thus possible, for example, to adopt following methods:By server set Word message in group or cloud computing resource pool identification Target Photo.Server cluster can be utilized Multiple computers carry out parallel computation so as to obtain very high calculating speed, while can also pass through The stability of system is improved using the method that multiple computers backup.Cloud computing resource pool then enters The integration and distribution of row resource, can lift resource utilization.Thus, it is possible to improve identification effect Rate, and the further performance of lift system.
Further, it is also possible to for the Word message for identifying is further processed, such as may be used also With including step S412~S414:
Step S412, the Word message to identifying carry out data cleansing, classification storage and/or Set up index.
Step S414, the Word message after process is preserved to data base.
The Word message directly obtained during identification picture is raw information.In the situation that data volume is larger Under, the system docking of web page contents acquisition methods can will be realized to other big data platforms to original Beginning information is continued with, the information higher to obtain availability.For example, Word message is entered Row data cleansing, can detect incomplete data, wrong data and duplicate data, and carry out further Amendment;Word message is classified, can according to business need import information into data bins In storehouse or relation database table;Word message is set up and is indexed, when conveniently can use afterwards Quick-searching.Obviously, it will be appreciated by those skilled in the art that except the Word message of foregoing description Beyond processing method, additive method can also be adopted as needed, it is no longer exhaustive here.
Sometimes, for the webpage of specified arrangement, the region that required Word message is located is fixed , and the word outside region is not intended to the information for obtaining.Only therefore, it can to subregion It is identified.With reference to the method that Fig. 5 descriptions carry out content obtaining to web page portions region.
Fig. 5 is the stream of one embodiment that the present invention carries out content acquisition method to web page portions region Cheng Tu.As shown in figure 5, the method for the embodiment includes:
Step S202, obtains target network address.
Step S204, obtains corresponding target web according to target network address.
Step S206, the contents processing that target web is shown obtain target figure into picture format Piece.
Step S508, carries out the identification region that cutting obtains Target Photo to Target Photo.
Step S510, recognizes the Word message in the identification region of Target Photo.
By in this way, it is to avoid identification to garbage, recognition efficiency is improve, Improve the performance of system.
Specifically, in step S508, cutting is carried out to Target Photo and can for example adopts following step Suddenly:First, the coordinate system in picture is defined, including origin position, x-axis positive direction, y-axis is just Direction;Secondly, it is input into the vertex value of clipping region;Finally, it is each with rule connection set in advance Individual summit, cutting closed area.By taking Fig. 6 (a) and 6 (b) as an example:First, by Fig. 6 (a) The upper left corner be set to coordinate origin, level direction to the right is set to x-axis positive direction, vertically to Under direction be set to y-axis positive direction;Then, obtain the top left co-ordinate (x of clipping region1,y1) With bottom right angular coordinate (x2,y2), and determine therefrom that clipping rectangle region upper right angular coordinate be (x2, y1), lower-left angular coordinate is (x1,y2);Finally, due to target area is rectangle, therefore root Rectangular area is determined according to the coordinate on four summits of aforesaid rectangular area and Target Photo is carried out Cutting, obtains Fig. 6 (b).Obviously, it will be appreciated by those skilled in the art that as needed or The concrete setting rule of person's module, can define other coordinate systems, it is also possible to carry out other shapes Cutting out for shape, is repeated no more here.
Step S206 can also be realized by browser, i.e.,:Browser is by target web displaying Appearance is processed into picture format, obtains Target Photo.For example, Dooscape has sectional drawing function. Rather than browser such as IE browser of increasing income, it is also possible to the interface (API) provided using browser Coordinated with other modules, realized sectional drawing function.For example, can be provided based on IE browser Win32API in PrintWindow functions, carry out with other interfaces, module or function Realize after integration.
The enforcement of this method needs to coordinate between modules and completes, therefore, it is possible to make each reality Workflow is formed between the corresponding module of the step of applying.For example, the method for the present invention can be with Including:Modules are called using the sequence of modules with management and dispatching function, modules are made Each step in preceding method is performed successively.Thus, it is possible to modules of connecting so as to from Web page contents acquisition is completed dynamicization.Module with management and dispatching function can for example be adopted Timed task in linux system performs instrument crontab.Crontab instruments can be The execution time of each order is set in crontab files, after crond start orders are performed, System will make corresponding module perform corresponding order on the time point of setting.For example, if climbing Worm, browser, the startup order of content identifier module are respectively crawler start, browser Start, ocr start, start script be respectively positioned on/etc/init.d files in, respectively 8:10、 8:30、8:50 perform modules, then the relevant order for including in crontab files performs each The content of individual module can be:
108***/etc/init.d/crawlerstart
308***/etc/init.d/browserstart
508***/etc/init.d/ocrstart
Additionally, aforesaid each step can be performed in generic server or cloud main frame, make Safety, stability, reliability are higher.
The web page contents that one embodiment of the invention is described below with reference to Fig. 7 obtain system.
Fig. 7 is the structure chart of one embodiment that web page contents of the present invention obtain system.Such as Fig. 7 institutes Show, the system of the embodiment includes:Website acquisition module 72, for obtaining target network address;Net Page parsing module 74, for obtaining corresponding target web according to target network address;Picture obtains mould Block 76, for the contents processing that shows target web into picture format, obtains Target Photo; Identification module 78, for recognizing the Word message in Target Photo.
System can also include cutting module, obtain target figure for carrying out cutting to Target Photo The identification region of piece, identification module are used to recognize the Word message in the identification region of Target Photo.
Identification module can be also used for the text in Target Photo is recognized using OCR Word information.
System can also include:Data cleansing module, is carried out for the Word message to identifying Data cleansing, memory module of classifying, the Word message for will identify that carry out classification storage, And/or, index module sets up index for the Word message to identifying.
System can also include:Web crawlers, browser, and server cluster or cloud computing Resource pool.Web crawlers includes website acquisition module, and browser includes web analysis module and figure Piece acquisition module, server cluster or cloud computing resource pool include identification module.
Additionally, the method according to the invention is also implemented as a kind of computer program, should Computer program includes computer-readable medium, is stored with the computer-readable medium For performing the computer program of the above-mentioned functions limited in the method for the present invention.Art technology Personnel will also understand is that, the various illustrative logical blocks with reference to described by disclosure herein, mould Block, circuit and algorithm steps may be implemented as the group of electronic hardware, computer software or both Close.
Presently preferred embodiments of the present invention is the foregoing is only, it is not to limit the present invention, all at this Within the spirit and principle of invention, any modification, equivalent substitution and improvements made etc. all should be wrapped It is contained within protection scope of the present invention.

Claims (11)

1. a kind of web page contents acquisition methods, including:
Obtain target network address;
Corresponding target web is obtained according to the target network address;
The contents processing that the target web is shown obtains Target Photo into picture format;
Recognize the Word message in the Target Photo.
2. method according to claim 1, it is characterised in that methods described also includes:
Using web crawlers technical limit spacing target network address;
Corresponding target web is obtained according to the target network address using browser.
3. method according to claim 1, it is characterised in that methods described also includes:
The identification region that cutting obtains the Target Photo is carried out to Target Photo;
Recognize the Word message in the identification region of the Target Photo.
4. method according to claim 1, it is characterised in that the identification target figure Word message in piece includes:
The Word message in the Target Photo is recognized by server cluster or cloud computing resource pool.
5. the method according to claim 1 or 4, it is characterised in that the identification mesh The Word message marked on a map in piece includes:
The Word message in the Target Photo is recognized using OCR.
6. method according to claim 1, it is characterised in that methods described also includes:
Word message to identifying carries out data cleansing, classification storage and/or sets up index.
7. a kind of web page contents obtain system, including:
Website acquisition module, for obtaining target network address;
Web analysis module, for obtaining corresponding target web according to the target network address;
Picture acquisition module, for the contents processing that shows the target web into picture format, Obtain Target Photo;
Identification module, for recognizing the Word message in the Target Photo.
8. system according to claim 7, it is characterised in that also including cutting module, For the identification region that cutting obtains the Target Photo, the identification mould are carried out to Target Photo Block is used to recognize the Word message in the identification region of the Target Photo.
9. system according to claim 7, it is characterised in that the identification module is used Word message in using the OCR identification Target Photo.
10. system according to claim 7, it is characterised in that also include:
Data cleansing module, carries out data cleansing for the Word message to identifying,
Classification memory module, the Word message for will identify that carry out classification storage,
And/or, index module sets up index for the Word message to identifying.
11. systems according to claim 7, it is characterised in that the system includes net Network reptile, browser, and server cluster or cloud computing resource pool;
Web crawlers includes the website acquisition module, and the browser includes the web analysis Module and the picture acquisition module, the server cluster or cloud computing resource pool include described Identification module.
CN201510680981.3A 2015-10-20 2015-10-20 Webpage content acquisition method and system Pending CN106599001A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510680981.3A CN106599001A (en) 2015-10-20 2015-10-20 Webpage content acquisition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510680981.3A CN106599001A (en) 2015-10-20 2015-10-20 Webpage content acquisition method and system

Publications (1)

Publication Number Publication Date
CN106599001A true CN106599001A (en) 2017-04-26

Family

ID=58555112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510680981.3A Pending CN106599001A (en) 2015-10-20 2015-10-20 Webpage content acquisition method and system

Country Status (1)

Country Link
CN (1) CN106599001A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109144567A (en) * 2018-08-03 2019-01-04 苏州麦迪斯顿医疗科技股份有限公司 Cross-platform webpage rendering method, device, server and storage medium
CN109639770A (en) * 2018-11-22 2019-04-16 山东中创软件工程股份有限公司 A kind of data access method, device, equipment and medium
CN109656563A (en) * 2018-11-28 2019-04-19 北京旷视科技有限公司 Code inspection method, apparatus, system and storage medium
CN109753907A (en) * 2018-12-27 2019-05-14 金现代信息产业股份有限公司 Information flag method and system on a kind of line based on image recognition
CN110020066A (en) * 2017-07-31 2019-07-16 北京国双科技有限公司 A kind of method and device of past crawler platform note task
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN110516191A (en) * 2019-08-30 2019-11-29 深圳点猫科技有限公司 Webpage data is converted into the method and apparatus of picture file
CN112131448A (en) * 2020-08-06 2020-12-25 亿存(北京)信息科技有限公司 Network information acquisition method and device and electronic equipment
CN115657916A (en) * 2022-12-20 2023-01-31 北京数智新天信息技术咨询有限公司 Method and device for acquiring e-commerce data and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927334A (en) * 2014-03-24 2014-07-16 小米科技有限责任公司 Webpage acquiring method and device
CN104156490A (en) * 2014-09-01 2014-11-19 北京奇虎科技有限公司 Method and device for detecting suspicious fishing webpage based on character recognition
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927334A (en) * 2014-03-24 2014-07-16 小米科技有限责任公司 Webpage acquiring method and device
CN104156490A (en) * 2014-09-01 2014-11-19 北京奇虎科技有限公司 Method and device for detecting suspicious fishing webpage based on character recognition
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高鹏 等: "《高性能LINUX平台建构实践指南》", 31 July 2014, 中国铁道出版社 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020066A (en) * 2017-07-31 2019-07-16 北京国双科技有限公司 A kind of method and device of past crawler platform note task
CN109144567A (en) * 2018-08-03 2019-01-04 苏州麦迪斯顿医疗科技股份有限公司 Cross-platform webpage rendering method, device, server and storage medium
CN109144567B (en) * 2018-08-03 2021-09-14 苏州麦迪斯顿医疗科技股份有限公司 Cross-platform webpage rendering method and device, server and storage medium
CN109639770A (en) * 2018-11-22 2019-04-16 山东中创软件工程股份有限公司 A kind of data access method, device, equipment and medium
CN109656563A (en) * 2018-11-28 2019-04-19 北京旷视科技有限公司 Code inspection method, apparatus, system and storage medium
CN109753907A (en) * 2018-12-27 2019-05-14 金现代信息产业股份有限公司 Information flag method and system on a kind of line based on image recognition
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN110516191A (en) * 2019-08-30 2019-11-29 深圳点猫科技有限公司 Webpage data is converted into the method and apparatus of picture file
CN112131448A (en) * 2020-08-06 2020-12-25 亿存(北京)信息科技有限公司 Network information acquisition method and device and electronic equipment
CN115657916A (en) * 2022-12-20 2023-01-31 北京数智新天信息技术咨询有限公司 Method and device for acquiring e-commerce data and electronic equipment

Similar Documents

Publication Publication Date Title
CN106599001A (en) Webpage content acquisition method and system
CN108595583B (en) Dynamic graph page data crawling method, device, terminal and storage medium
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
US8838657B1 (en) Document fingerprints using block encoding of text
KR20160132842A (en) Detecting and extracting image document components to create flow document
JP6827116B2 (en) Web page clustering method and equipment
US10296552B1 (en) System and method for automated identification of internet advertising and creating rules for blocking of internet advertising
CN102902693A (en) Method for detecting repeat mode on webpages
CN111966868B (en) Data management method based on identification analysis and related equipment
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN111639648A (en) Certificate identification method and device, computing equipment and storage medium
CN109977337A (en) A kind of webpage design control methods, device, equipment and readable storage medium storing program for executing
CN110738049A (en) Similar text processing method and device and computer readable storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN109710224A (en) Page processing method, device, equipment and storage medium
CN114912417A (en) Service data processing method, device, equipment and storage medium
CN107391650A (en) A kind of structuring method for splitting of document, apparatus and system
CN110688315A (en) Interface code detection report generation method, electronic device, and storage medium
US10963690B2 (en) Method for identifying main picture in web page
CN105790967A (en) Weblog processing method and device
CN117423124A (en) Table data processing method, device, equipment and medium based on table image
CN111581299A (en) Inter-library data conversion system and method of multi-source data warehouse based on big data
CN110147516A (en) The intelligent identification Method and relevant device of front-end code in Pages Design
US20130163873A1 (en) Detecting Separator Lines in a Web Page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170426