CN109635084A

CN109635084A - A kind of real-time quick De-weight method of multi-source data document and system

Info

Publication number: CN109635084A
Application number: CN201811456999.5A
Authority: CN
Inventors: 柴志伟; 丑晓慧; 许冠宇; 宋乐安; 许涵洋
Original assignee: Shanghai Shenqin Information Technology Co Ltd; Ningbo Shenqin Information Technology Co Ltd
Current assignee: Shanghai Shenqin Information Technology Co Ltd; Ningbo Shenqin Information Technology Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2019-04-16
Anticipated expiration: 2038-11-30
Also published as: CN109635084B

Abstract

The invention belongs to technical field of information processing, and in particular to a kind of real-time quick De-weight method of multi-source data document and system, comprising the following steps: receive current document and be simultaneously filtered to current document, obtain filtered document data；By local susceptibility hash algorithm, the tagged word of document data is calculated；According to tagged word and document data, judge whether current document and document before database purchase are similar；If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.The present invention can carry out real-time quick duplicate removal processing to the similar document data of separate sources, and the repetition of similar document is avoided to store.

Description

A kind of real-time quick De-weight method of multi-source data document and system

Technical field

The invention belongs to technical field of information processing, and in particular to a kind of real-time quick De-weight method of multi-source data document and System.

Background technique

The document data in different web sites source perhaps uses for reference that identical there are article content or repetitive rate is higher due to forwarding The case where, in practical applications, need such similar article screening and filtering, the method for previous human-edited needs to consume big The human cost of amount, and for the data that news category needs to push in real time, the timeliness of manual operation duplicate removal is very low, and general Duplicate Removal Algorithm, memory usage is high in online calculating process, when data volume is excessive, memory is easy to cause to overflow, offline to count Though calculating the timeliness that can solve the problem of memory overflows but not can guarantee duplicate removal.

Summary of the invention

For the defects in the prior art, the present invention provides a kind of real-time quick De-weight method of multi-source data document and it is System, can carry out real-time quick duplicate removal processing to the similar document data of separate sources, the repetition of similar document is avoided to deposit Storage.

In a first aspect, the present invention provides a kind of real-time quick De-weight methods of multi-source data document, comprising the following steps:

It receives current document and current document is filtered, obtain filtered document data；

By local susceptibility hash algorithm, the tagged word of document data is calculated；

According to tagged word and document data, judge whether current document and document before database purchase are similar；

If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.

Preferably, described by local susceptibility hash algorithm, calculate the tagged word of document data, specific steps are as follows:

Body matter in document data is segmented, and is obtained such as dried fruit word；

Go out the weight of each word by word frequency statistics method statistic；

Each word is mapped out into a hash value with hash algorithm；

It is weighted according to hash value of the weight to each word, the numeric string after being weighted；

The numeric string of all words is subjected to step-by-step summation, obtains final numeric string；

Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.

Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is It is no similar, specific steps are as follows:

The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to N, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document；

The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than M, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.

Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is It is no similar, further includes:

Extract the keyword of current document and the keyword of two degree of similar documents；

When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if identical pass Keyword number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents；

When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if identical pass Keyword number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.

The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.

Preferably, the document data includes article title, ID number, body matter and data source identification, the general The tagged word of current document and document datastore are to database, specific steps are as follows:

The tagged word of current document and number ID are combined into the key value of current document；

By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number According to library.

Preferably, in current document, document carries out the likelihood ratio earlier above with before, the document before extracting in database Key value, the tagged word of document before being obtained according to key value.

Second aspect, the present invention provides a kind of real-time quick machining systems of multi-source data document, are suitable for first aspect A kind of real-time quick De-weight method of multi-source data document, comprising:

Data processing unit obtains filtered number of files for receiving current document and being filtered to current document According to；

Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm；

Similar judging unit, for before according to tagged word and document data, judging current document and database purchase Whether document is similar；

Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word of current document and Otherwise document datastore is not stored to database.

Preferably, the similar judging unit, is specifically used for:

The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to 3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document；

The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents；

Extract the keyword of current document and the keyword of two degree of similar documents；The keyword number of current document or two degree When the keyword number of similar document is less than or equal to 3, if same keyword number, less than 2, current document is similar to two degree Document is dissimilar, otherwise similar, and obtains three degree of similar documents；The pass of the keyword number of current document and two degree of similar documents When keyword number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, no It is then similar, and obtain three degree of similar documents；

The third aspect, the present invention provides a kind of terminal, including processor and memory connected to the processor, The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for Described program instruction is called, method as described in relation to the first aspect is executed.

Technical solution of the present invention, the identification for carrying out similarity to current document and stored document before judges, right Dissimilar current document is stored, and similar current document is not stored then, to realize to the similar of separate sources The real-time quick duplicate removal processing of document avoids the repetition of similar document from storing.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.

Fig. 1 is the flow diagram of the real-time quick De-weight method of multi-source data document in the present embodiment；

Fig. 2 is the structural schematic diagram of the real-time quick machining system of multi-source data document in the present embodiment.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.

It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

Embodiment one:

A kind of real-time quick De-weight method of multi-source data document is present embodiments provided, as shown in Figure 1, including following step It is rapid:

S1 receives current document and is filtered to current document, obtains filtered document data；

S2 calculates the tagged word of document data by local susceptibility hash algorithm；

S3 judges whether current document and document before database purchase are similar according to tagged word and document data；

S4, by the tagged word of current document and document datastore to database, is not otherwise stored if dissimilar.

In the present embodiment, an article is read as certain user enters certain website from circle of friends, has felt this article not Mistake, just reprints this article of the website, receives the current document of reprinting from the background, is filtered to current document, The contents such as website top margin, website page footing are eliminated, filtered document data is obtained, the document data include the phase of this article Hold inside the Pass, such as: article title, ID number, body matter, data source mark etc..

Tagged word is calculated further according to filtered document data, wherein pass through local susceptibility hash algorithm in the S2, Calculate the tagged word of document data, specific steps are as follows:

S21 segments the body matter in document data, and obtains such as dried fruit word；

S22 goes out the weight of each word by word frequency statistics method statistic；

Each word is mapped out a hash value with hash algorithm by S23；

S24 is weighted, the numeric string after being weighted according to hash value of the weight to each word；

The numeric string of all words is carried out step-by-step summation, obtains final numeric string by S25；

Final numeric string is converted to the tagged word of 64 bit bytes of 01 form by S26.

In the present embodiment, body matter is segmented, removes some meaningless words, such as " ", " uh ", " ". The number that each word occurs in the text is counted, and divided by total word number of full text, obtains word frequency, the weight as this word.It will be every A word maps out a hash value with hash algorithm, for example, " robot " this word be mapped as 11001 (example is 5, we this In really 64).Then it is weighted according to hash value of the weight to each word, if the weight of robot is 3, plus Numeric string after power are as follows: 33-3-33 (being 1:1x3 in hash, be 0:-1x3 in hash).After the numeric string for calculating each word, Numeric string step-by-step summation to each word, such as: intelligent numeric string is 6-666-6, the two word step-by-steps of intelligent robot are asked Numeric string later are as follows:

(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3

And so on, it sums to the word step-by-step of full text, obtains final numeric string.Final numeric string is converted to 01 again The tagged word of 64 bit bytes of form (rule is the note 1 greater than 0, the note 0 less than 0, result 10110).

After obtaining tagged word, the identification for carrying out similarity with document before to current document judges, wherein root in the S3 According to tagged word and document data, judge whether current document and document before database purchase are similar, specific steps are as follows:

S31 calculates the Hamming distances of the tagged word and the tagged word of document before of current document, if Hamming distances are greater than Equal to 3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document；

S32, the number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words is poor Greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.

S33 extracts the keyword of current document and the keyword of two degree of similar documents；

S34 calculates the data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents, if two documents Data value occupy-place amount it is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.

In the present embodiment, four steps are passed sequentially through to the judgement of similarity and are carried out, when previous step determine it is dissimilar, The judgement for just no longer needing to carry out subsequent step then carries out the judgement of subsequent step when previous step is determined as similar again.

The first step judged by tagged word, the key value of document before extracting in redis database, according to key The tagged word of document before value obtains.The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to 3, then current document and document before are dissimilar.If Hamming distances, less than 3, current document is therewith Preceding document is similar, obtains preliminary similar document.Because of the similarity carried out by the tagged word that local sensitivity hash algorithm obtains Judgement, have the defects that it is certain, therefore in order to judge more acurrate, subsequent further progress judgement.

Second step is judged that the body matter of document, compares two before extracting in redis database by number of words The number of words of the body matter of document is poor, if number of words difference is greater than 500, current document and preliminary similar document are dissimilar, otherwise It is similar, and obtain two degree of similar documents.Since the word number of long length document is more, it is possible to include short articles document, two at this time Length, which differs greatly, may also be considered similar, therefore the document of 500 words or more is differed to length length, then is judged as not phase Seemingly.

Third step is judged by keyword, is increased this dimension of keyword to be judged, is improved the accurate of judgement Property.The keyword that document is extracted using textrank algorithm, is divided into that keyword is less and keyword according to keyword extraction quantity More two kinds of situations: the keyword of one of document is less than or equal to 3, if the identical keyword number of the two documents is less than 2, then two documents are dissimilar, otherwise similar, and obtain three degree of similar documents；The keyword number of two documents is all larger than 3, if this The identical keyword number of two documents is less than 3, then two documents are dissimilar, otherwise similar, and obtains three degree of similar documents.

4th step is judged that news data is no lack of the template that many content multiplicities are high but data are different by data Article, at this point, document is it is believed that dissmilarity.The data value occupy-place amount of two documents is identical but data are different, then judges two documents It is dissimilar.The byte digit that the data occupy-place amount, i.e. data are occupied, such as the report article of stock class, daily template one Sample, only data are different.

Current document stores current document with when document is similar before, dissimilar then do not store.Wherein, described By the tagged word of current document and document datastore to database, specific steps are as follows:

In the present embodiment, being stored using redis database for data, the characteristic of redis is used for the caching of mass data, Reading speed is very fast.The article title of document, ID number, body matter, data source mark, tagged word are stored in redis, To guarantee uniqueness, uses the combination of tagged word and ID as the Key value of redis, needed before being extracted in redis later When document carries out the calculating of similarity, it is only necessary to which extracting key value can be obtained tagged word, relative to using tagged word as redis Value improve several times the speed that accesses.

In conclusion the method for the present embodiment, the identification of similarity is carried out to current document and stored document before Judgement stores dissimilar current document, similar current document is not stored then, to realize to separate sources Similar document real-time quick duplicate removal processing, avoid the repetition of similar document from storing.

Implement two:

A kind of real-time quick machining system of multi-source data document is present embodiments provided, suitable for one described in embodiment one The kind real-time quick De-weight method of multi-source data document, as shown in Figure 2, comprising:

Tagged word is calculated further according to filtered document data, wherein the computing unit is specifically used for:

Go out the weight of each word by word frequency statistics method statistic；

Each word is mapped out into a hash value with hash algorithm；

(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3

After obtaining tagged word, the identification for carrying out similarity with document before to current document judges, wherein described similar to sentence Disconnected unit is specifically used for:

The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.

Extract the keyword of current document and the keyword of two degree of similar documents；The keyword number of current document or two degree When the keyword number of similar document is less than or equal to 3, if same keyword number, less than 2, current document is similar to two degree Document is dissimilar, otherwise similar, and obtains three degree of similar documents；The pass of the keyword number of current document and two degree of similar documents When keyword number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, no It is then similar, and obtain three degree of similar documents.

Current document stores current document with when document is similar before, dissimilar then do not store.Wherein, described By the tagged word of current document and document datastore to database, specifically:

In the present embodiment, using redis database for data store, duplicate removal access unit to redis be stored in data or Data are extracted from redis, the characteristic of redis is used for the caching of mass data, and reading speed is very fast.By the article mark of document Topic, ID number, body matter, data source mark, tagged word are stored in redis, to guarantee uniqueness, use tagged word and ID The Key value as redis is combined, when needing the calculating of document progress similarity before extracting in redis later, it is only necessary to Extracting key value can be obtained tagged word, improve several times relative to the speed for accessing tagged word as the Value of redis.

In conclusion the system of the present embodiment, the identification of similarity is carried out to current document and stored document before Judgement stores dissimilar current document, similar current document is not stored then, to realize to separate sources Similar document real-time quick duplicate removal processing, avoid the repetition of similar document from storing.

Embodiment three:

A kind of terminal, including processor and memory connected to the processor are present embodiments provided, it is described to deposit For reservoir for storing computer program, the computer program includes program instruction, and the processor is configured for calling institute Program instruction is stated, method described in embodiment one is executed.

It should be appreciated that in the present embodiment, alleged processor can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing At programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components etc..

The memory may include read-only memory and random access memory, and provide instruction and data to processor. The a part of of memory can also include nonvolatile RAM.For example, memory can be with storage device type Information.

The terminal of the present embodiment executes method described in embodiment one, to current document and it is stored above Shelves carry out the identification judgement of similarity, store to dissimilar current document, similar current document is not stored then, To realize to the real-time quick duplicate removal processing of the similar document of separate sources, the repetition of similar document is avoided to store.

Those of ordinary skill in the art may be aware that system unit described in conjunction with the examples disclosed in this document and Method and step can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and soft The interchangeability of part generally describes each exemplary composition and step according to function in the above description.These function It can be implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Professional skill Art personnel can use different methods to achieve the described function each specific application, but this realization should not be recognized It is beyond the scope of this invention.

In several embodiments provided herein, it should be understood that disclosed method and system can pass through it Its mode is realized.For example, in addition the division of the above unit, only a kind of logical function partition can have in actual implementation Division mode, such as multiple units or components can be combined or can be integrated into another system or some features can be with Ignore, or does not execute.Said units may or may not be physically separated, and component shown as a unit can be with It is or may not be physical unit, it can it is in one place, or may be distributed over multiple network units.It can The purpose of the embodiment of the present invention is realized to select some or all of unit therein according to the actual needs.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme should all cover within the scope of the claims and the description of the invention.

Claims

1. a kind of real-time quick De-weight method of multi-source data document, which comprises the following steps:

2. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that described to pass through Local susceptibility hash algorithm calculates the tagged word of document data, specific steps are as follows:

Go out the weight of each word by word frequency statistics method statistic；

Each word is mapped out into a hash value with hash algorithm；

3. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that the basis Tagged word and document data judge whether current document and document before database purchase are similar, specific steps are as follows:

The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to N, Current document is dissimilar with document before, otherwise similar, and obtains preliminary similar document；

The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than M, Current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.

4. a kind of real-time quick De-weight method of multi-source data document according to claim 3, which is characterized in that the basis Tagged word and document data judge whether current document and document before database purchase are similar, further includes:

When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if same keyword Number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents；

When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if same keyword Number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.

5. a kind of real-time quick De-weight method of multi-source data document according to claim 4, which is characterized in that the basis Tagged word and document data judge whether current document and document before database purchase are similar, further includes:

The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the data value of two documents Occupy-place amount is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.

6. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that the document Data include article title, ID number, body matter and data source identification, the tagged word and number of files by current document Database, specific steps are arrived according to storage are as follows:

By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis data Library.

7. a kind of real-time quick De-weight method of multi-source data document according to claim 6, which is characterized in that ought be above Shelves carry out the likelihood ratio earlier above with document before, the key value of document before extracting in database, before being obtained according to key value The tagged word of document.

8. a kind of real-time quick machining system of multi-source data document characterized by comprising

Data processing unit obtains filtered document data for receiving current document and being filtered to current document；

Similar judging unit, for judging the document before of current document and database purchase according to tagged word and document data It is whether similar；

Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word and document of current document Data are stored to database, are not otherwise stored.

9. a kind of real-time quick machining system of multi-source data document according to claim 8, which is characterized in that described similar Judging unit is specifically used for:

The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to 3, Current document is dissimilar with document before, otherwise similar, and obtains preliminary similar document；

Extract the keyword of current document and the keyword of two degree of similar documents；The keyword number of current document or two degree it is similar The keyword number of document be less than or equal to 3 when, if same keyword number less than 2, current document and two degree of similar documents Dissmilarity, it is otherwise similar, and obtain three degree of similar documents；The keyword of the keyword number of current document and two degree of similar documents When number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, otherwise phase Seemingly, and three degree of similar documents are obtained；

10. a kind of terminal, including processor and memory connected to the processor, the memory is calculated for storing Machine program, the computer program include program instruction, which is characterized in that the processor is configured for calling described program Instruction executes the method according to claim 1 to 7.