CN102682085A - Method for removing duplicated web page - Google Patents

Method for removing duplicated web page Download PDF

Info

Publication number
CN102682085A
CN102682085A CN2012101142636A CN201210114263A CN102682085A CN 102682085 A CN102682085 A CN 102682085A CN 2012101142636 A CN2012101142636 A CN 2012101142636A CN 201210114263 A CN201210114263 A CN 201210114263A CN 102682085 A CN102682085 A CN 102682085A
Authority
CN
China
Prior art keywords
string
characteristic
value
word
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101142636A
Other languages
Chinese (zh)
Inventor
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KUYUN INTERACTIVE TECHNOLOGY LIMITED
Original Assignee
TENFEN Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TENFEN Inc filed Critical TENFEN Inc
Priority to CN2012101142636A priority Critical patent/CN102682085A/en
Publication of CN102682085A publication Critical patent/CN102682085A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for removing a duplicated web page. The method comprises the following steps of: firstly extracting web page text messages, performing word segmentation on the web page text messages, counting word segmentation results, sorting according to word frequency sequence, selecting words of which word frequencies exceed a preset value to be as a feature word resultant string, performing MD5 operation on the feature word resultant string to be used as a web page unique feature value, comparing an MD5 value in the feature word resultant string with MD5 values of feature word resultant strings of all web pages in a feature string duplicate removal judgment system, performing duplicate removal if the same values exist, and storing the MD5 value of the feature word resultant string of the web page to the feature string duplicate removal judgment system if no same values exist. By adopting the technical scheme provided by the invnetion, near-replicas web pages, repeated web pages and mirror web pages in the existing system can be effectively removed.

Description

A kind of method of removing duplicate webpages
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of method of removing duplicate webpages.
Background technology
Along with the development of Internet technology, the internet becomes the important source that people obtain various information, but on the internet, it is to belong to duplicate message that a lot of information are also arranged.The webpage that has bulk information to repeat in the present tens over ten billion webpages, the existence of these repeated pages bothers for information processing very much.
The removing duplicate webpages technology is all based on such basic thought now: for each web document calculates one group of fingerprint (fingerprint); If two documents have the identical fingerprints of some; Thinking that then the content plyability of these two documents is higher, also is that the two is that content is reprinted.
The mode of obtaining the web document fingerprint is to have adopted a kind of algorithm to full text segmentation signature; This algorithm is divided into N section (capable of a section like every n) to one piece of web document by certain principle; Then each section is signed (being calculated fingerprint), so each piece document just can be represented with the fingerprint behind N the signature.
But there is the problem that computing is complicated, EMS memory occupation is big in this removing duplicate webpages technical scheme.
Summary of the invention
The objective of the invention is to propose a kind of method of removing duplicate webpages, can remove reprinting webpage, repeated pages and the mirror-image web page of existing system effectively.
For reaching this purpose, the present invention adopts following technical scheme:
A kind of method of removing duplicate webpages may further comprise the steps:
A, extraction Web page text information;
B, said Web page text information is carried out word segmentation processing;
C, the word segmentation processing result is added up, and sort according to word frequency;
D, the selected ci poem that word frequency is surpassed preset value take out, as characteristic speech resultant string;
E, said characteristic speech resultant string is carried out the MD5 computing, as unique eigenwert of said webpage;
F, go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to compare the MD5 value and the characteristic string of said characteristic speech resultant string; If it is identical; Then go heavily; If do not have identically, then store the MD5 value of the characteristic speech resultant string of said webpage into said characteristic string and go to the other system of major punishment.
Step e is further comprising the steps of:
Each characteristic speech in the characteristic speech resultant string is all carried out MD5 to be calculated;
In the step F, adopt the MD5 value of said characteristic speech resultant string to compare earlier, in comparison result, adopt the MD5 value of each characteristic speech in the characteristic speech resultant string to compare again.
Said characteristic string goes the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to adopt the Hash table mode to store.
Among the step B,, adopt the forward maximum match to divide word algorithm that said Web page text information is carried out word segmentation processing according to the set of terminological dictionary and universaling dictionary.
Among the step C, adopt the data structure of dictionary tree that the word segmentation processing result is added up.
Among the step C, the disposal route that adopts inner fast row is arranged by the speech frequency of occurrences the statistics processing of sorting from high to low.
Professional speech is carried out weighting, carry out the word frequency ordering again.
Adopt technical scheme of the present invention, had following technique effect:
(1) can remove reprinting webpage and the repeated pages and the mirror-image web page of existing system effectively;
(2) can handle the Hash positioning system quickly and efficiently, reach the effect that reasonable differentiation repetition and similar content are handled;
(3) the internet web page amount of reply is big more, and hierarchical system is ability embodiment advantage more;
(4) can tackle the processing that the short run webpage is removed repetition simply fast, can inquire about apace with adding in batches and handle;
(5) adopt file storage to go the quick storage management of heavy system to deal with the removing duplicate webpages of big data quantity, the Hash positioning system can accomplish that the offline storage Hash removes the weight structure file fast.
Description of drawings
Fig. 1 is the process flow diagram of removing duplicate webpages in the specific embodiment of the invention.
Embodiment
Further specify technical scheme of the present invention below in conjunction with accompanying drawing and through embodiment.
Fig. 1 is the process flow diagram of removing duplicate webpages in the specific embodiment of the invention.As shown in Figure 1, the flow process of this removing duplicate webpages may further comprise the steps:
Step 101, the existing webpage identification of employing and Web page text extractive technique are extracted the Web page text information that obtains.
Step 102, according to the set of terminological dictionary and universaling dictionary, adopt the forward maximum match to divide word algorithm that this Web page text information is carried out word segmentation processing.
Step 103, the word segmentation processing result is added up, adopt the data structure of dictionary tree that the word segmentation processing result is added up, can reduce the use of internal memory, reach more high-level efficiency.
Step 104, adopt inner fast row disposal route to the statistics processing of sorting, arrange from high to low by the speech frequency of occurrences.
Step 105, the selected ci poem that word frequency is surpassed preset value (for example 10 times) take out, as characteristic speech resultant string.In order to increase accuracy, can carry out weighting to professional speech, improve ordering, the interference of high frequency stop words can be avoided like this, but the ordering time can be increased.
Step 106, this characteristic speech resultant string is carried out the MD5 computing, obtain unique eigenwert of regular length, as unique eigenwert of webpage.And each characteristic speech in the characteristic speech resultant string is all carried out MD5 calculate, be stored in the background data base.
Step 107, characteristic string go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to adopt the Hash table mode to store.
Step 108, go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to compare the MD5 value and the characteristic string of the characteristic speech resultant string of this webpage; If it is identical; Then go heavily; If do not have identically, then store the MD5 value of the characteristic speech resultant string of this webpage into the characteristic string with the mode of Hash table and go to the other system of major punishment, go next time to continue to use when heavy.
In order to raise the efficiency, if the webpage source data is a lot, internal memory is limited, can adopt twice localization process by different level, organizes the location for the first time, organizes interior location for the second time.Adopt the MD5 value of the characteristic speech resultant string of this webpage to compare earlier, in comparison result, adopt the MD5 value of each characteristic speech in the characteristic speech resultant string to compare again, to determine whether repetition.
The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with this technological people in the technical scope that the present invention disclosed; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (8)

1. the method for a removing duplicate webpages is characterized in that, may further comprise the steps:
A, extraction Web page text information;
B, said Web page text information is carried out word segmentation processing;
C, the word segmentation processing result is added up, and sort according to word frequency;
D, the selected ci poem that word frequency is surpassed preset value take out, as characteristic speech resultant string;
E, said characteristic speech resultant string is carried out the MD5 computing, as unique eigenwert of said webpage;
F, go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to compare the MD5 value and the characteristic string of said characteristic speech resultant string; If it is identical; Then go heavily; If do not have identically, then store the MD5 value of the characteristic speech resultant string of said webpage into said characteristic string and go to the other system of major punishment.
2. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, step e is further comprising the steps of:
Each characteristic speech in the characteristic speech resultant string is all carried out MD5 to be calculated;
In the step F, adopt the MD5 value of said characteristic speech resultant string to compare earlier, in comparison result, adopt the MD5 value of each characteristic speech in the characteristic speech resultant string to compare again.
3. according to the method for claim 1 or 2 described a kind of removing duplicate webpages, it is characterized in that said characteristic string goes the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to adopt the Hash table mode to store.
4. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, among the step B, according to the set of terminological dictionary and universaling dictionary, adopts the forward maximum match to divide word algorithm that said Web page text information is carried out word segmentation processing.
5. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, among the step C, adopts the data structure of dictionary tree that the word segmentation processing result is added up.
6. according to the method for claim 1 or 5 described a kind of removing duplicate webpages, it is characterized in that among the step C, the disposal route that adopts inner fast row is arranged by the speech frequency of occurrences the statistics processing of sorting from high to low.
7. the method for a kind of removing duplicate webpages according to claim 4 is characterized in that, professional speech is carried out weighting, carries out the word frequency ordering again.
8. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, in the step F, stores the mode of going heavy result with Hash table into said characteristic string and goes to the other system of major punishment.
CN2012101142636A 2012-04-18 2012-04-18 Method for removing duplicated web page Pending CN102682085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101142636A CN102682085A (en) 2012-04-18 2012-04-18 Method for removing duplicated web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101142636A CN102682085A (en) 2012-04-18 2012-04-18 Method for removing duplicated web page

Publications (1)

Publication Number Publication Date
CN102682085A true CN102682085A (en) 2012-09-19

Family

ID=46814010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101142636A Pending CN102682085A (en) 2012-04-18 2012-04-18 Method for removing duplicated web page

Country Status (1)

Country Link
CN (1) CN102682085A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399874A (en) * 2013-07-10 2013-11-20 北京奇虎科技有限公司 Method and device for optimizing capture of webpages under same domain name
CN103399872A (en) * 2013-07-10 2013-11-20 北京奇虎科技有限公司 Method and device for optimizing webpage capture
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN104021178A (en) * 2014-06-04 2014-09-03 深圳市腾讯计算机***有限公司 Multimedia information filtering method and device
CN104050294A (en) * 2014-06-30 2014-09-17 北京奇虎科技有限公司 Method and device for exploiting rare resources of internet
CN104636340A (en) * 2013-11-06 2015-05-20 腾讯科技(深圳)有限公司 Webpage URL filtering method, device and system
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN105681046A (en) * 2016-02-29 2016-06-15 郑州悉知信息科技股份有限公司 UGC fingerprint signature determination method and device as well as UGC deduplication method and device
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
CN106528510A (en) * 2016-11-18 2017-03-22 山东浪潮云服务信息科技有限公司 Method and device for processing data
CN106547780A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 Article reprints statistics of variables method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
CN106547777A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 The statistical method and device of article reprinting amount
CN106569989A (en) * 2016-10-20 2017-04-19 北京智能管家科技有限公司 De-weighting method and apparatus for short text
CN107045513A (en) * 2016-02-05 2017-08-15 北京迅奥科技有限公司 Web page title denoising
CN110147363A (en) * 2019-04-09 2019-08-20 华迪计算机集团有限公司 A kind of the data deduplication method for cleaning and system of information full-text search
CN112307303A (en) * 2020-10-29 2021-02-02 扆亮海 Efficient and accurate network page duplicate removal system based on cloud computing
CN113704240A (en) * 2021-09-23 2021-11-26 世纪龙信息网络有限责任公司 Data deduplication method
CN116263792A (en) * 2023-04-21 2023-06-16 云目未来科技(湖南)有限公司 Method and system for crawling complex internet data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN103399872B (en) * 2013-07-10 2016-09-28 北京奇虎科技有限公司 The method and apparatus that webpage capture is optimized
CN103399872A (en) * 2013-07-10 2013-11-20 北京奇虎科技有限公司 Method and device for optimizing webpage capture
CN103399874A (en) * 2013-07-10 2013-11-20 北京奇虎科技有限公司 Method and device for optimizing capture of webpages under same domain name
CN103399874B (en) * 2013-07-10 2016-12-28 北京奇虎科技有限公司 The method and apparatus that webpage capture under same domain name is optimized
CN104636340A (en) * 2013-11-06 2015-05-20 腾讯科技(深圳)有限公司 Webpage URL filtering method, device and system
CN104021178A (en) * 2014-06-04 2014-09-03 深圳市腾讯计算机***有限公司 Multimedia information filtering method and device
CN104021178B (en) * 2014-06-04 2018-02-02 深圳市腾讯计算机***有限公司 Multimedia messages filter method and device
CN104050294A (en) * 2014-06-30 2014-09-17 北京奇虎科技有限公司 Method and device for exploiting rare resources of internet
CN105574004B (en) * 2014-10-10 2019-06-21 阿里巴巴集团控股有限公司 A kind of removing duplicate webpages method and apparatus
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
CN106547777A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 The statistical method and device of article reprinting amount
CN106547780A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 Article reprints statistics of variables method and device
CN107045513A (en) * 2016-02-05 2017-08-15 北京迅奥科技有限公司 Web page title denoising
CN105681046A (en) * 2016-02-29 2016-06-15 郑州悉知信息科技股份有限公司 UGC fingerprint signature determination method and device as well as UGC deduplication method and device
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
CN106569989A (en) * 2016-10-20 2017-04-19 北京智能管家科技有限公司 De-weighting method and apparatus for short text
CN106528510A (en) * 2016-11-18 2017-03-22 山东浪潮云服务信息科技有限公司 Method and device for processing data
CN110147363A (en) * 2019-04-09 2019-08-20 华迪计算机集团有限公司 A kind of the data deduplication method for cleaning and system of information full-text search
CN112307303A (en) * 2020-10-29 2021-02-02 扆亮海 Efficient and accurate network page duplicate removal system based on cloud computing
CN113704240A (en) * 2021-09-23 2021-11-26 世纪龙信息网络有限责任公司 Data deduplication method
CN116263792A (en) * 2023-04-21 2023-06-16 云目未来科技(湖南)有限公司 Method and system for crawling complex internet data
CN116263792B (en) * 2023-04-21 2023-07-18 云目未来科技(湖南)有限公司 Method and system for crawling complex internet data

Similar Documents

Publication Publication Date Title
CN102682085A (en) Method for removing duplicated web page
CN102799647B (en) Method and device for webpage reduplication deletion
US10579661B2 (en) System and method for machine learning and classifying data
CN106033416B (en) Character string processing method and device
CN110941959B (en) Text violation detection, text restoration method, data processing method and equipment
CN109165202A (en) A kind of preprocess method of multi-source heterogeneous big data
CN103646029B (en) A kind of similarity calculating method for blog article
CN105243389A (en) Industry classification tag determining method and apparatus for company name
JP6912488B2 (en) Character string distance calculation method and device
CN103123618A (en) Text similarity obtaining method and device
WO2014068990A1 (en) Relatedness determination device, permanent physical computer-readable medium for same, and relatedness determination method
CN107402960B (en) Reverse index optimization algorithm based on semantic mood weighting
CN101751475B (en) Method for compressing section records and device therefor
CN103886077B (en) Short text clustering method and system
US20180276244A1 (en) Method and system for searching for similar images that is nearly independent of the scale of the collection of images
CN107229694A (en) A kind of data message consistency processing method, system and device based on big data
CN112527948A (en) Data real-time duplicate removal method and system based on sentence-level index
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN109670153A (en) A kind of determination method, apparatus, storage medium and the terminal of similar model
CN111160445B (en) Bid file similarity calculation method and device
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
CN111475600B (en) Data management method, device and computer readable storage medium
Hadi ECAR: a new enhanced class association rule
CN116028873A (en) Multi-class server fault prediction method based on support vector machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: BEIJING KUYUN INTERACTION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: BEIJING TENFEN TECHNOLOGY CO., LTD.

Effective date: 20140123

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100004 CHAOYANG, BEIJING TO: 100007 DONGCHENG, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20140123

Address after: 100007 Beijing City, Dongcheng District Andingmen East Street, No. 28, building B block 15 layer

Applicant after: KUYUN INTERACTIVE TECHNOLOGY LIMITED

Address before: No. 7 East Hanwei building 18A1 100004 Beijing City Guanghua Road Chaoyang District

Applicant before: Tenfen Inc.

C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120919