CN102682085A

CN102682085A - Method for removing duplicated web page

Info

Publication number: CN102682085A
Application number: CN2012101142636A
Authority: CN
Inventors: 李鹏
Original assignee: TENFEN Inc
Current assignee: KUYUN INTERACTIVE TECHNOLOGY LIMITED
Priority date: 2012-04-18
Filing date: 2012-04-18
Publication date: 2012-09-19

Abstract

The invention discloses a method for removing a duplicated web page. The method comprises the following steps of: firstly extracting web page text messages, performing word segmentation on the web page text messages, counting word segmentation results, sorting according to word frequency sequence, selecting words of which word frequencies exceed a preset value to be as a feature word resultant string, performing MD5 operation on the feature word resultant string to be used as a web page unique feature value, comparing an MD5 value in the feature word resultant string with MD5 values of feature word resultant strings of all web pages in a feature string duplicate removal judgment system, performing duplicate removal if the same values exist, and storing the MD5 value of the feature word resultant string of the web page to the feature string duplicate removal judgment system if no same values exist. By adopting the technical scheme provided by the invnetion, near-replicas web pages, repeated web pages and mirror web pages in the existing system can be effectively removed.

Description

A kind of method of removing duplicate webpages

Technical field

The present invention relates to Internet technical field, relate in particular to a kind of method of removing duplicate webpages.

Background technology

Along with the development of Internet technology, the internet becomes the important source that people obtain various information, but on the internet, it is to belong to duplicate message that a lot of information are also arranged.The webpage that has bulk information to repeat in the present tens over ten billion webpages, the existence of these repeated pages bothers for information processing very much.

The removing duplicate webpages technology is all based on such basic thought now: for each web document calculates one group of fingerprint (fingerprint); If two documents have the identical fingerprints of some; Thinking that then the content plyability of these two documents is higher, also is that the two is that content is reprinted.

The mode of obtaining the web document fingerprint is to have adopted a kind of algorithm to full text segmentation signature; This algorithm is divided into N section (capable of a section like every n) to one piece of web document by certain principle; Then each section is signed (being calculated fingerprint), so each piece document just can be represented with the fingerprint behind N the signature.

But there is the problem that computing is complicated, EMS memory occupation is big in this removing duplicate webpages technical scheme.

Summary of the invention

The objective of the invention is to propose a kind of method of removing duplicate webpages, can remove reprinting webpage, repeated pages and the mirror-image web page of existing system effectively.

For reaching this purpose, the present invention adopts following technical scheme:

A kind of method of removing duplicate webpages may further comprise the steps:

A, extraction Web page text information;

B, said Web page text information is carried out word segmentation processing;

C, the word segmentation processing result is added up, and sort according to word frequency;

D, the selected ci poem that word frequency is surpassed preset value take out, as characteristic speech resultant string;

E, said characteristic speech resultant string is carried out the MD5 computing, as unique eigenwert of said webpage;

F, go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to compare the MD5 value and the characteristic string of said characteristic speech resultant string; If it is identical; Then go heavily; If do not have identically, then store the MD5 value of the characteristic speech resultant string of said webpage into said characteristic string and go to the other system of major punishment.

Step e is further comprising the steps of:

Each characteristic speech in the characteristic speech resultant string is all carried out MD5 to be calculated;

In the step F, adopt the MD5 value of said characteristic speech resultant string to compare earlier, in comparison result, adopt the MD5 value of each characteristic speech in the characteristic speech resultant string to compare again.

Said characteristic string goes the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to adopt the Hash table mode to store.

Among the step B,, adopt the forward maximum match to divide word algorithm that said Web page text information is carried out word segmentation processing according to the set of terminological dictionary and universaling dictionary.

Among the step C, adopt the data structure of dictionary tree that the word segmentation processing result is added up.

Among the step C, the disposal route that adopts inner fast row is arranged by the speech frequency of occurrences the statistics processing of sorting from high to low.

Professional speech is carried out weighting, carry out the word frequency ordering again.

Adopt technical scheme of the present invention, had following technique effect:

(1) can remove reprinting webpage and the repeated pages and the mirror-image web page of existing system effectively;

(2) can handle the Hash positioning system quickly and efficiently, reach the effect that reasonable differentiation repetition and similar content are handled;

(3) the internet web page amount of reply is big more, and hierarchical system is ability embodiment advantage more;

(4) can tackle the processing that the short run webpage is removed repetition simply fast, can inquire about apace with adding in batches and handle;

(5) adopt file storage to go the quick storage management of heavy system to deal with the removing duplicate webpages of big data quantity, the Hash positioning system can accomplish that the offline storage Hash removes the weight structure file fast.

Description of drawings

Fig. 1 is the process flow diagram of removing duplicate webpages in the specific embodiment of the invention.

Embodiment

Further specify technical scheme of the present invention below in conjunction with accompanying drawing and through embodiment.

Fig. 1 is the process flow diagram of removing duplicate webpages in the specific embodiment of the invention.As shown in Figure 1, the flow process of this removing duplicate webpages may further comprise the steps:

Step 101, the existing webpage identification of employing and Web page text extractive technique are extracted the Web page text information that obtains.

Step 102, according to the set of terminological dictionary and universaling dictionary, adopt the forward maximum match to divide word algorithm that this Web page text information is carried out word segmentation processing.

Step 103, the word segmentation processing result is added up, adopt the data structure of dictionary tree that the word segmentation processing result is added up, can reduce the use of internal memory, reach more high-level efficiency.

Step 104, adopt inner fast row disposal route to the statistics processing of sorting, arrange from high to low by the speech frequency of occurrences.

Step 105, the selected ci poem that word frequency is surpassed preset value (for example 10 times) take out, as characteristic speech resultant string.In order to increase accuracy, can carry out weighting to professional speech, improve ordering, the interference of high frequency stop words can be avoided like this, but the ordering time can be increased.

Step 106, this characteristic speech resultant string is carried out the MD5 computing, obtain unique eigenwert of regular length, as unique eigenwert of webpage.And each characteristic speech in the characteristic speech resultant string is all carried out MD5 calculate, be stored in the background data base.

Step 107, characteristic string go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to adopt the Hash table mode to store.

Step 108, go the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to compare the MD5 value and the characteristic string of the characteristic speech resultant string of this webpage; If it is identical; Then go heavily; If do not have identically, then store the MD5 value of the characteristic speech resultant string of this webpage into the characteristic string with the mode of Hash table and go to the other system of major punishment, go next time to continue to use when heavy.

In order to raise the efficiency, if the webpage source data is a lot, internal memory is limited, can adopt twice localization process by different level, organizes the location for the first time, organizes interior location for the second time.Adopt the MD5 value of the characteristic speech resultant string of this webpage to compare earlier, in comparison result, adopt the MD5 value of each characteristic speech in the characteristic speech resultant string to compare again, to determine whether repetition.

The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with this technological people in the technical scope that the present invention disclosed; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the method for a removing duplicate webpages is characterized in that, may further comprise the steps:

A, extraction Web page text information;

B, said Web page text information is carried out word segmentation processing;

2. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, step e is further comprising the steps of:

3. according to the method for claim 1 or 2 described a kind of removing duplicate webpages, it is characterized in that said characteristic string goes the MD5 value of the characteristic speech resultant string of all webpages in the other system of major punishment to adopt the Hash table mode to store.

4. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, among the step B, according to the set of terminological dictionary and universaling dictionary, adopts the forward maximum match to divide word algorithm that said Web page text information is carried out word segmentation processing.

5. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, among the step C, adopts the data structure of dictionary tree that the word segmentation processing result is added up.

6. according to the method for claim 1 or 5 described a kind of removing duplicate webpages, it is characterized in that among the step C, the disposal route that adopts inner fast row is arranged by the speech frequency of occurrences the statistics processing of sorting from high to low.

7. the method for a kind of removing duplicate webpages according to claim 4 is characterized in that, professional speech is carried out weighting, carries out the word frequency ordering again.

8. the method for a kind of removing duplicate webpages according to claim 1 is characterized in that, in the step F, stores the mode of going heavy result with Hash table into said characteristic string and goes to the other system of major punishment.