CN106599001A - Webpage content acquisition method and system - Google Patents
Webpage content acquisition method and system Download PDFInfo
- Publication number
- CN106599001A CN106599001A CN201510680981.3A CN201510680981A CN106599001A CN 106599001 A CN106599001 A CN 106599001A CN 201510680981 A CN201510680981 A CN 201510680981A CN 106599001 A CN106599001 A CN 106599001A
- Authority
- CN
- China
- Prior art keywords
- target
- word message
- web
- module
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a webpage content acquisition method and system. The method comprises the steps of acquiring a target website; acquiring a corresponding target webpage according to the target website; processing content shown on the target webpage to an image format to obtain a target image; and recognizing text information in the target image. By a mode of converting the target webpage to the image and recognizing the content of the image, the content of the target webpage can be acquired, a source code is not needed to be acquired, and the method is relatively high in universality.
Description
Technical field
The present invention relates to internet arena, especially a kind of web page contents acquisition methods and system.
Background technology
Traditional reptile from the beginning of one or several initial URL (URL),
The URL and other guide on the corresponding webpages of initial URL is obtained, while also will be current
The new URL obtained on the page is put into queue and continues crawl, until meeting necessarily stopping for system
Only condition.All contents by crawler capturing will be stored, according to keyword, text, figure
Piece, audio frequency and video etc. carry out classifying, analyze, filter, and set up index, so as to inquiry afterwards
And retrieval.Existing crawler system obtains the content stream of target web after target network address is obtained
Journey as shown in figure 1, including:
Step S102, obtains the web page source code in target web.
Step S104, the target information in analysis source code.
Step S106, the result after parsing is saved in data base.
However, some websites take anti-reptile measure, reptile is prevented to obtain web page source code,
So as to reptile cannot complete the acquisition to target web information.
The content of the invention
An embodiment of the present invention technical problem to be solved is:How in web page source generation, is not being obtained
The content information of target web is obtained in the case of code.
A kind of one side according to embodiments of the present invention, there is provided web page contents acquisition methods,
Including:Obtain target network address;Corresponding target web is obtained according to target network address;By target network
The contents processing that page shows obtains Target Photo into picture format;Text in identification Target Photo
Word information.
In one embodiment, method also includes:Using web crawlers technical limit spacing target network address;
Corresponding target web is obtained according to target network address using browser.
In one embodiment, method also includes:Cutting is carried out to Target Photo and obtains target figure
The identification region of piece;Word message in the identification region of identification Target Photo.
In one embodiment, recognize that the Word message in Target Photo includes:By server set
Word message in group or cloud computing resource pool identification Target Photo.
In one embodiment, recognize that the Word message in Target Photo includes:Using optics word
Word message in symbol technology of identification identification Target Photo.
In one embodiment, method also includes:It is clear that Word message to identifying carries out data
Wash, classifying stores and/or set up index.
Second aspect according to embodiments of the present invention, there is provided a kind of web page contents obtain system, bag
Include:Website acquisition module, for obtaining target network address;Web analysis module, for according to mesh
Mark network address obtains corresponding target web;Picture acquisition module, for what is shown target web
Contents processing obtains Target Photo into picture format;Identification module, for recognizing Target Photo
In Word message.
In one embodiment, system also includes cutting module, for cutting out to Target Photo
The identification region for obtaining Target Photo is cut, identification module is used for the identification region for recognizing Target Photo
In Word message.
In one embodiment, identification module is used to recognize target using OCR
Word message in picture.
In one embodiment, system also includes:Data cleansing module, for identifying
Word message carries out data cleansing, and memory module of classifying, the Word message for will identify that are entered
Row classification storage, and/or, index module sets up index for the Word message to identifying.
In one embodiment, system also includes web crawlers, browser, and server set
Group or cloud computing resource pool;Web crawlers includes website acquisition module, and browser includes webpage solution
Analysis module and picture acquisition module, server cluster or cloud computing resource pool include identification module.
The present invention at least has advantages below:By target web is converted to picture, then to picture
Carry out content aware mode, you can to obtain the content of target web, and source code need not be obtained,
Versatility is stronger.
By detailed description referring to the drawings to exemplary embodiment of the invention, the present invention
Further feature and its advantage will be made apparent from.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will
Accompanying drawing to be used needed for embodiment or description of the prior art is briefly described, it is clear that
Ground, drawings in the following description are only some embodiments of the present invention, for the common skill in this area
For art personnel, without having to pay creative labor, can be being obtained according to these accompanying drawings
Obtain other accompanying drawings.
Fig. 1 illustrates the schematic diagram of web page contents acquisition methods in prior art.
Fig. 2 illustrates the schematic flow sheet of web page contents acquisition methods one embodiment of the present invention.
Fig. 3 illustrates the schematic flow sheet of another embodiment of web page contents acquisition methods of the present invention.
Fig. 4 illustrates the schematic flow sheet of another embodiment of web page contents acquisition methods of the present invention.
Fig. 5 illustrates that the present invention carries out the schematic diagram of the method for content obtaining to web page portions region.
Fig. 6 (a), 6 (b) illustrate the schematic diagram of picture region cutting of the present invention.
Fig. 7 illustrates that web page contents of the present invention obtain the structural representation of system one embodiment.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical scheme in the embodiment of the present invention
It is clearly and completely described, it is clear that described embodiment is only that a present invention part is real
Apply example, rather than the embodiment of whole.Description reality at least one exemplary embodiment below
It is merely illustrative on border, never as to the present invention and its application or any restriction for using.
Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made
The every other embodiment for being obtained is put, the scope of protection of the invention is belonged to.
The web page contents acquisition methods of one embodiment of the invention are described below with reference to Fig. 2.
Fig. 2 is the flow chart of one embodiment of web page contents acquisition methods of the present invention.Such as Fig. 2 institutes
Show, the method for the embodiment includes:
Step S202, obtains target network address.
Step S204, obtains corresponding target web according to target network address.
Step S206, the contents processing that target web is shown obtain target figure into picture format
Piece.
Step S208, recognizes the Word message in Target Photo.
By target web is converted to picture, then content aware mode is carried out to picture, you can
To obtain the content of target web, and source code need not be obtained, versatility is stronger.
In step S208, for example, Target Photo can be identified using following methods:Make
The Word message in Target Photo is recognized with OCR.Optical character recognition
(Optical Character Recognition, hereinafter referred to as OCR) refers to electronic equipment
For printed character, the text conversion in paper document is become by black and white using optical mode
The image file of dot matrix, and by identification software by the text conversion in image into text formatting,
For the technology that word processor is further edited and processed.In the present invention, using OCR technique
It is identified mainly being made up of following step:First, Target Photo is input into into identification module;
Then, pretreatment, including binaryzation, image noise reduction and/or slant correction are carried out to Target Photo,
To improve the precision of follow-up identification;Finally, character features extraction is carried out, selects corresponding right
The identification of Word message is carried out than data base.If the required precision to recognizing is higher, can be with
Manual synchronizing is carried out after identification software is identified, to avoid producing more manifest error.
As the Word message in webpage is mostly the block letter of standard, therefore, using OCR technique
The Word message in Target Photo can preferably be recognized.OCR tool can for example be adopted
The Open-Source Tools such as Tesseract, OCRFeeder.
When obtaining webpage contents in batch is needed, it is possible to use crawler technology.Below with reference to Fig. 3
The method that description web page contents of the present invention obtain one embodiment.
Fig. 3 is the flow chart of another embodiment of web page contents acquisition methods of the present invention.Such as Fig. 3
Shown, the method for the embodiment includes:
Step S300, using web crawlers technical limit spacing target network address, and is sent to browser.
Step S302, browser obtain target network address.
Step S304, browser obtain corresponding target web according to target network address.
May then continue with execution step S206~S208.
Method by obtaining target network address using crawler technology, can obtain target network in bulk
The Word message included by page, it is adaptable to big data field.When using crawler technology, can be with
Reptile function is realized using the technology increased income, the WebCollector that for example realized using Java,
JSpider, Crawler4j, can also use Python provide urllib2, cookielib,
Re, threading storehouse is writing reptile script.When needing, which can further be determined
System, simplifies reptile function, only retains the part for obtaining and parsing URL.Most of browser
The function of wanting needed for said method can be realized, if necessary to the partial function to browser
It is modified, increase income browser such as Fifth, Dooscape, Qupzilla etc. can be selected,
Browser is made to be applied to the performing environment of the inventive method.
Further describe with reference to Fig. 4 carries out of web page contents acquisition using crawler technology
Application scenarios.
Fig. 4 is the flow chart of another embodiment of web page contents acquisition methods of the present invention.Such as Fig. 4
Shown, the method for the embodiment includes:
Step S402, reptile obtain the corresponding webpages of URL for crawling.
Step S404, obtains the url list for needing parsing from webpage.
Step S406, reptile is according to the URL in url list successively request list.Repeat to walk
Rapid S402 to S406, until obtaining target url list.
URL in target url list is resolved to target web by step S408, browser,
Webpage view is generated, and webpage view is stored as into picture.
Step S410, recognizes the Word message in picture.
In step S410, due to the operand of the identification of Word message it is larger, to being identified
Equipment performance requirement it is higher, it may be thus possible, for example, to adopt following methods:By server set
Word message in group or cloud computing resource pool identification Target Photo.Server cluster can be utilized
Multiple computers carry out parallel computation so as to obtain very high calculating speed, while can also pass through
The stability of system is improved using the method that multiple computers backup.Cloud computing resource pool then enters
The integration and distribution of row resource, can lift resource utilization.Thus, it is possible to improve identification effect
Rate, and the further performance of lift system.
Further, it is also possible to for the Word message for identifying is further processed, such as may be used also
With including step S412~S414:
Step S412, the Word message to identifying carry out data cleansing, classification storage and/or
Set up index.
Step S414, the Word message after process is preserved to data base.
The Word message directly obtained during identification picture is raw information.In the situation that data volume is larger
Under, the system docking of web page contents acquisition methods can will be realized to other big data platforms to original
Beginning information is continued with, the information higher to obtain availability.For example, Word message is entered
Row data cleansing, can detect incomplete data, wrong data and duplicate data, and carry out further
Amendment;Word message is classified, can according to business need import information into data bins
In storehouse or relation database table;Word message is set up and is indexed, when conveniently can use afterwards
Quick-searching.Obviously, it will be appreciated by those skilled in the art that except the Word message of foregoing description
Beyond processing method, additive method can also be adopted as needed, it is no longer exhaustive here.
Sometimes, for the webpage of specified arrangement, the region that required Word message is located is fixed
, and the word outside region is not intended to the information for obtaining.Only therefore, it can to subregion
It is identified.With reference to the method that Fig. 5 descriptions carry out content obtaining to web page portions region.
Fig. 5 is the stream of one embodiment that the present invention carries out content acquisition method to web page portions region
Cheng Tu.As shown in figure 5, the method for the embodiment includes:
Step S202, obtains target network address.
Step S204, obtains corresponding target web according to target network address.
Step S206, the contents processing that target web is shown obtain target figure into picture format
Piece.
Step S508, carries out the identification region that cutting obtains Target Photo to Target Photo.
Step S510, recognizes the Word message in the identification region of Target Photo.
By in this way, it is to avoid identification to garbage, recognition efficiency is improve,
Improve the performance of system.
Specifically, in step S508, cutting is carried out to Target Photo and can for example adopts following step
Suddenly:First, the coordinate system in picture is defined, including origin position, x-axis positive direction, y-axis is just
Direction;Secondly, it is input into the vertex value of clipping region;Finally, it is each with rule connection set in advance
Individual summit, cutting closed area.By taking Fig. 6 (a) and 6 (b) as an example:First, by Fig. 6 (a)
The upper left corner be set to coordinate origin, level direction to the right is set to x-axis positive direction, vertically to
Under direction be set to y-axis positive direction;Then, obtain the top left co-ordinate (x of clipping region1,y1)
With bottom right angular coordinate (x2,y2), and determine therefrom that clipping rectangle region upper right angular coordinate be (x2,
y1), lower-left angular coordinate is (x1,y2);Finally, due to target area is rectangle, therefore root
Rectangular area is determined according to the coordinate on four summits of aforesaid rectangular area and Target Photo is carried out
Cutting, obtains Fig. 6 (b).Obviously, it will be appreciated by those skilled in the art that as needed or
The concrete setting rule of person's module, can define other coordinate systems, it is also possible to carry out other shapes
Cutting out for shape, is repeated no more here.
Step S206 can also be realized by browser, i.e.,:Browser is by target web displaying
Appearance is processed into picture format, obtains Target Photo.For example, Dooscape has sectional drawing function.
Rather than browser such as IE browser of increasing income, it is also possible to the interface (API) provided using browser
Coordinated with other modules, realized sectional drawing function.For example, can be provided based on IE browser
Win32API in PrintWindow functions, carry out with other interfaces, module or function
Realize after integration.
The enforcement of this method needs to coordinate between modules and completes, therefore, it is possible to make each reality
Workflow is formed between the corresponding module of the step of applying.For example, the method for the present invention can be with
Including:Modules are called using the sequence of modules with management and dispatching function, modules are made
Each step in preceding method is performed successively.Thus, it is possible to modules of connecting so as to from
Web page contents acquisition is completed dynamicization.Module with management and dispatching function can for example be adopted
Timed task in linux system performs instrument crontab.Crontab instruments can be
The execution time of each order is set in crontab files, after crond start orders are performed,
System will make corresponding module perform corresponding order on the time point of setting.For example, if climbing
Worm, browser, the startup order of content identifier module are respectively crawler start, browser
Start, ocr start, start script be respectively positioned on/etc/init.d files in, respectively 8:10、
8:30、8:50 perform modules, then the relevant order for including in crontab files performs each
The content of individual module can be:
108***/etc/init.d/crawlerstart
308***/etc/init.d/browserstart
508***/etc/init.d/ocrstart
Additionally, aforesaid each step can be performed in generic server or cloud main frame, make
Safety, stability, reliability are higher.
The web page contents that one embodiment of the invention is described below with reference to Fig. 7 obtain system.
Fig. 7 is the structure chart of one embodiment that web page contents of the present invention obtain system.Such as Fig. 7 institutes
Show, the system of the embodiment includes:Website acquisition module 72, for obtaining target network address;Net
Page parsing module 74, for obtaining corresponding target web according to target network address;Picture obtains mould
Block 76, for the contents processing that shows target web into picture format, obtains Target Photo;
Identification module 78, for recognizing the Word message in Target Photo.
System can also include cutting module, obtain target figure for carrying out cutting to Target Photo
The identification region of piece, identification module are used to recognize the Word message in the identification region of Target Photo.
Identification module can be also used for the text in Target Photo is recognized using OCR
Word information.
System can also include:Data cleansing module, is carried out for the Word message to identifying
Data cleansing, memory module of classifying, the Word message for will identify that carry out classification storage,
And/or, index module sets up index for the Word message to identifying.
System can also include:Web crawlers, browser, and server cluster or cloud computing
Resource pool.Web crawlers includes website acquisition module, and browser includes web analysis module and figure
Piece acquisition module, server cluster or cloud computing resource pool include identification module.
Additionally, the method according to the invention is also implemented as a kind of computer program, should
Computer program includes computer-readable medium, is stored with the computer-readable medium
For performing the computer program of the above-mentioned functions limited in the method for the present invention.Art technology
Personnel will also understand is that, the various illustrative logical blocks with reference to described by disclosure herein, mould
Block, circuit and algorithm steps may be implemented as the group of electronic hardware, computer software or both
Close.
Presently preferred embodiments of the present invention is the foregoing is only, it is not to limit the present invention, all at this
Within the spirit and principle of invention, any modification, equivalent substitution and improvements made etc. all should be wrapped
It is contained within protection scope of the present invention.
Claims (11)
1. a kind of web page contents acquisition methods, including:
Obtain target network address;
Corresponding target web is obtained according to the target network address;
The contents processing that the target web is shown obtains Target Photo into picture format;
Recognize the Word message in the Target Photo.
2. method according to claim 1, it is characterised in that methods described also includes:
Using web crawlers technical limit spacing target network address;
Corresponding target web is obtained according to the target network address using browser.
3. method according to claim 1, it is characterised in that methods described also includes:
The identification region that cutting obtains the Target Photo is carried out to Target Photo;
Recognize the Word message in the identification region of the Target Photo.
4. method according to claim 1, it is characterised in that the identification target figure
Word message in piece includes:
The Word message in the Target Photo is recognized by server cluster or cloud computing resource pool.
5. the method according to claim 1 or 4, it is characterised in that the identification mesh
The Word message marked on a map in piece includes:
The Word message in the Target Photo is recognized using OCR.
6. method according to claim 1, it is characterised in that methods described also includes:
Word message to identifying carries out data cleansing, classification storage and/or sets up index.
7. a kind of web page contents obtain system, including:
Website acquisition module, for obtaining target network address;
Web analysis module, for obtaining corresponding target web according to the target network address;
Picture acquisition module, for the contents processing that shows the target web into picture format,
Obtain Target Photo;
Identification module, for recognizing the Word message in the Target Photo.
8. system according to claim 7, it is characterised in that also including cutting module,
For the identification region that cutting obtains the Target Photo, the identification mould are carried out to Target Photo
Block is used to recognize the Word message in the identification region of the Target Photo.
9. system according to claim 7, it is characterised in that the identification module is used
Word message in using the OCR identification Target Photo.
10. system according to claim 7, it is characterised in that also include:
Data cleansing module, carries out data cleansing for the Word message to identifying,
Classification memory module, the Word message for will identify that carry out classification storage,
And/or, index module sets up index for the Word message to identifying.
11. systems according to claim 7, it is characterised in that the system includes net
Network reptile, browser, and server cluster or cloud computing resource pool;
Web crawlers includes the website acquisition module, and the browser includes the web analysis
Module and the picture acquisition module, the server cluster or cloud computing resource pool include described
Identification module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510680981.3A CN106599001A (en) | 2015-10-20 | 2015-10-20 | Webpage content acquisition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510680981.3A CN106599001A (en) | 2015-10-20 | 2015-10-20 | Webpage content acquisition method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106599001A true CN106599001A (en) | 2017-04-26 |
Family
ID=58555112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510680981.3A Pending CN106599001A (en) | 2015-10-20 | 2015-10-20 | Webpage content acquisition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599001A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109144567A (en) * | 2018-08-03 | 2019-01-04 | 苏州麦迪斯顿医疗科技股份有限公司 | Cross-platform webpage rendering method, device, server and storage medium |
CN109639770A (en) * | 2018-11-22 | 2019-04-16 | 山东中创软件工程股份有限公司 | A kind of data access method, device, equipment and medium |
CN109656563A (en) * | 2018-11-28 | 2019-04-19 | 北京旷视科技有限公司 | Code inspection method, apparatus, system and storage medium |
CN109753907A (en) * | 2018-12-27 | 2019-05-14 | 金现代信息产业股份有限公司 | Information flag method and system on a kind of line based on image recognition |
CN110020066A (en) * | 2017-07-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device of past crawler platform note task |
CN110069688A (en) * | 2019-03-16 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | Page display method, server, storage medium and the device of anti-crawler |
CN110516191A (en) * | 2019-08-30 | 2019-11-29 | 深圳点猫科技有限公司 | Webpage data is converted into the method and apparatus of picture file |
CN112131448A (en) * | 2020-08-06 | 2020-12-25 | 亿存(北京)信息科技有限公司 | Network information acquisition method and device and electronic equipment |
CN115657916A (en) * | 2022-12-20 | 2023-01-31 | 北京数智新天信息技术咨询有限公司 | Method and device for acquiring e-commerce data and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927334A (en) * | 2014-03-24 | 2014-07-16 | 小米科技有限责任公司 | Webpage acquiring method and device |
CN104156490A (en) * | 2014-09-01 | 2014-11-19 | 北京奇虎科技有限公司 | Method and device for detecting suspicious fishing webpage based on character recognition |
CN104933138A (en) * | 2015-06-16 | 2015-09-23 | 携程计算机技术(上海)有限公司 | Webpage crawler system and webpage crawling method |
-
2015
- 2015-10-20 CN CN201510680981.3A patent/CN106599001A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927334A (en) * | 2014-03-24 | 2014-07-16 | 小米科技有限责任公司 | Webpage acquiring method and device |
CN104156490A (en) * | 2014-09-01 | 2014-11-19 | 北京奇虎科技有限公司 | Method and device for detecting suspicious fishing webpage based on character recognition |
CN104933138A (en) * | 2015-06-16 | 2015-09-23 | 携程计算机技术(上海)有限公司 | Webpage crawler system and webpage crawling method |
Non-Patent Citations (1)
Title |
---|
高鹏 等: "《高性能LINUX平台建构实践指南》", 31 July 2014, 中国铁道出版社 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110020066A (en) * | 2017-07-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device of past crawler platform note task |
CN109144567A (en) * | 2018-08-03 | 2019-01-04 | 苏州麦迪斯顿医疗科技股份有限公司 | Cross-platform webpage rendering method, device, server and storage medium |
CN109144567B (en) * | 2018-08-03 | 2021-09-14 | 苏州麦迪斯顿医疗科技股份有限公司 | Cross-platform webpage rendering method and device, server and storage medium |
CN109639770A (en) * | 2018-11-22 | 2019-04-16 | 山东中创软件工程股份有限公司 | A kind of data access method, device, equipment and medium |
CN109656563A (en) * | 2018-11-28 | 2019-04-19 | 北京旷视科技有限公司 | Code inspection method, apparatus, system and storage medium |
CN109753907A (en) * | 2018-12-27 | 2019-05-14 | 金现代信息产业股份有限公司 | Information flag method and system on a kind of line based on image recognition |
CN110069688A (en) * | 2019-03-16 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | Page display method, server, storage medium and the device of anti-crawler |
CN110516191A (en) * | 2019-08-30 | 2019-11-29 | 深圳点猫科技有限公司 | Webpage data is converted into the method and apparatus of picture file |
CN112131448A (en) * | 2020-08-06 | 2020-12-25 | 亿存(北京)信息科技有限公司 | Network information acquisition method and device and electronic equipment |
CN115657916A (en) * | 2022-12-20 | 2023-01-31 | 北京数智新天信息技术咨询有限公司 | Method and device for acquiring e-commerce data and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599001A (en) | Webpage content acquisition method and system | |
CN108595583B (en) | Dynamic graph page data crawling method, device, terminal and storage medium | |
CN110334346B (en) | Information extraction method and device of PDF (Portable document Format) file | |
US8838657B1 (en) | Document fingerprints using block encoding of text | |
KR20160132842A (en) | Detecting and extracting image document components to create flow document | |
JP6827116B2 (en) | Web page clustering method and equipment | |
US10296552B1 (en) | System and method for automated identification of internet advertising and creating rules for blocking of internet advertising | |
CN102902693A (en) | Method for detecting repeat mode on webpages | |
CN111966868B (en) | Data management method based on identification analysis and related equipment | |
WO2019041442A1 (en) | Method and system for structural extraction of figure data, electronic device, and computer readable storage medium | |
CN111639648A (en) | Certificate identification method and device, computing equipment and storage medium | |
CN109977337A (en) | A kind of webpage design control methods, device, equipment and readable storage medium storing program for executing | |
CN110738049A (en) | Similar text processing method and device and computer readable storage medium | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN115828874A (en) | Industry table digital processing method based on image recognition technology | |
CN109710224A (en) | Page processing method, device, equipment and storage medium | |
CN114912417A (en) | Service data processing method, device, equipment and storage medium | |
CN107391650A (en) | A kind of structuring method for splitting of document, apparatus and system | |
CN110688315A (en) | Interface code detection report generation method, electronic device, and storage medium | |
US10963690B2 (en) | Method for identifying main picture in web page | |
CN105790967A (en) | Weblog processing method and device | |
CN117423124A (en) | Table data processing method, device, equipment and medium based on table image | |
CN111581299A (en) | Inter-library data conversion system and method of multi-source data warehouse based on big data | |
CN110147516A (en) | The intelligent identification Method and relevant device of front-end code in Pages Design | |
US20130163873A1 (en) | Detecting Separator Lines in a Web Page |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170426 |