CN109635084A - A kind of real-time quick De-weight method of multi-source data document and system - Google Patents

A kind of real-time quick De-weight method of multi-source data document and system Download PDF

Info

Publication number
CN109635084A
CN109635084A CN201811456999.5A CN201811456999A CN109635084A CN 109635084 A CN109635084 A CN 109635084A CN 201811456999 A CN201811456999 A CN 201811456999A CN 109635084 A CN109635084 A CN 109635084A
Authority
CN
China
Prior art keywords
document
similar
data
current document
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811456999.5A
Other languages
Chinese (zh)
Other versions
CN109635084B (en
Inventor
柴志伟
丑晓慧
许冠宇
宋乐安
许涵洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shenqin Information Technology Co Ltd
Ningbo Shenqin Information Technology Co Ltd
Original Assignee
Shanghai Shenqin Information Technology Co Ltd
Ningbo Shenqin Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shenqin Information Technology Co Ltd, Ningbo Shenqin Information Technology Co Ltd filed Critical Shanghai Shenqin Information Technology Co Ltd
Priority to CN201811456999.5A priority Critical patent/CN109635084B/en
Publication of CN109635084A publication Critical patent/CN109635084A/en
Application granted granted Critical
Publication of CN109635084B publication Critical patent/CN109635084B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to technical field of information processing, and in particular to a kind of real-time quick De-weight method of multi-source data document and system, comprising the following steps: receive current document and be simultaneously filtered to current document, obtain filtered document data;By local susceptibility hash algorithm, the tagged word of document data is calculated;According to tagged word and document data, judge whether current document and document before database purchase are similar;If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.The present invention can carry out real-time quick duplicate removal processing to the similar document data of separate sources, and the repetition of similar document is avoided to store.

Description

A kind of real-time quick De-weight method of multi-source data document and system
Technical field
The invention belongs to technical field of information processing, and in particular to a kind of real-time quick De-weight method of multi-source data document and System.
Background technique
The document data in different web sites source perhaps uses for reference that identical there are article content or repetitive rate is higher due to forwarding The case where, in practical applications, need such similar article screening and filtering, the method for previous human-edited needs to consume big The human cost of amount, and for the data that news category needs to push in real time, the timeliness of manual operation duplicate removal is very low, and general Duplicate Removal Algorithm, memory usage is high in online calculating process, when data volume is excessive, memory is easy to cause to overflow, offline to count Though calculating the timeliness that can solve the problem of memory overflows but not can guarantee duplicate removal.
Summary of the invention
For the defects in the prior art, the present invention provides a kind of real-time quick De-weight method of multi-source data document and it is System, can carry out real-time quick duplicate removal processing to the similar document data of separate sources, the repetition of similar document is avoided to deposit Storage.
In a first aspect, the present invention provides a kind of real-time quick De-weight methods of multi-source data document, comprising the following steps:
It receives current document and current document is filtered, obtain filtered document data;
By local susceptibility hash algorithm, the tagged word of document data is calculated;
According to tagged word and document data, judge whether current document and document before database purchase are similar;
If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.
Preferably, described by local susceptibility hash algorithm, calculate the tagged word of document data, specific steps are as follows:
Body matter in document data is segmented, and is obtained such as dried fruit word;
Go out the weight of each word by word frequency statistics method statistic;
Each word is mapped out into a hash value with hash algorithm;
It is weighted according to hash value of the weight to each word, the numeric string after being weighted;
The numeric string of all words is subjected to step-by-step summation, obtains final numeric string;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.
Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is It is no similar, specific steps are as follows:
The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to N, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than M, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is It is no similar, further includes:
Extract the keyword of current document and the keyword of two degree of similar documents;
When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if identical pass Keyword number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents;
When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if identical pass Keyword number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is It is no similar, further includes:
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.
Preferably, the document data includes article title, ID number, body matter and data source identification, the general The tagged word of current document and document datastore are to database, specific steps are as follows:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number According to library.
Preferably, in current document, document carries out the likelihood ratio earlier above with before, the document before extracting in database Key value, the tagged word of document before being obtained according to key value.
Second aspect, the present invention provides a kind of real-time quick machining systems of multi-source data document, are suitable for first aspect A kind of real-time quick De-weight method of multi-source data document, comprising:
Data processing unit obtains filtered number of files for receiving current document and being filtered to current document According to;
Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm;
Similar judging unit, for before according to tagged word and document data, judging current document and database purchase Whether document is similar;
Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word of current document and Otherwise document datastore is not stored to database.
Preferably, the similar judging unit, is specifically used for:
The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to 3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents;
Extract the keyword of current document and the keyword of two degree of similar documents;The keyword number of current document or two degree When the keyword number of similar document is less than or equal to 3, if same keyword number, less than 2, current document is similar to two degree Document is dissimilar, otherwise similar, and obtains three degree of similar documents;The pass of the keyword number of current document and two degree of similar documents When keyword number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, no It is then similar, and obtain three degree of similar documents;
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.
The third aspect, the present invention provides a kind of terminal, including processor and memory connected to the processor, The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for Described program instruction is called, method as described in relation to the first aspect is executed.
Technical solution of the present invention, the identification for carrying out similarity to current document and stored document before judges, right Dissimilar current document is stored, and similar current document is not stored then, to realize to the similar of separate sources The real-time quick duplicate removal processing of document avoids the repetition of similar document from storing.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.
Fig. 1 is the flow diagram of the real-time quick De-weight method of multi-source data document in the present embodiment;
Fig. 2 is the structural schematic diagram of the real-time quick machining system of multi-source data document in the present embodiment.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Embodiment one:
A kind of real-time quick De-weight method of multi-source data document is present embodiments provided, as shown in Figure 1, including following step It is rapid:
S1 receives current document and is filtered to current document, obtains filtered document data;
S2 calculates the tagged word of document data by local susceptibility hash algorithm;
S3 judges whether current document and document before database purchase are similar according to tagged word and document data;
S4, by the tagged word of current document and document datastore to database, is not otherwise stored if dissimilar.
In the present embodiment, an article is read as certain user enters certain website from circle of friends, has felt this article not Mistake, just reprints this article of the website, receives the current document of reprinting from the background, is filtered to current document, The contents such as website top margin, website page footing are eliminated, filtered document data is obtained, the document data include the phase of this article Hold inside the Pass, such as: article title, ID number, body matter, data source mark etc..
Tagged word is calculated further according to filtered document data, wherein pass through local susceptibility hash algorithm in the S2, Calculate the tagged word of document data, specific steps are as follows:
S21 segments the body matter in document data, and obtains such as dried fruit word;
S22 goes out the weight of each word by word frequency statistics method statistic;
Each word is mapped out a hash value with hash algorithm by S23;
S24 is weighted, the numeric string after being weighted according to hash value of the weight to each word;
The numeric string of all words is carried out step-by-step summation, obtains final numeric string by S25;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form by S26.
In the present embodiment, body matter is segmented, removes some meaningless words, such as " ", " uh ", " ". The number that each word occurs in the text is counted, and divided by total word number of full text, obtains word frequency, the weight as this word.It will be every A word maps out a hash value with hash algorithm, for example, " robot " this word be mapped as 11001 (example is 5, we this In really 64).Then it is weighted according to hash value of the weight to each word, if the weight of robot is 3, plus Numeric string after power are as follows: 33-3-33 (being 1:1x3 in hash, be 0:-1x3 in hash).After the numeric string for calculating each word, Numeric string step-by-step summation to each word, such as: intelligent numeric string is 6-666-6, the two word step-by-steps of intelligent robot are asked Numeric string later are as follows:
(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3
And so on, it sums to the word step-by-step of full text, obtains final numeric string.Final numeric string is converted to 01 again The tagged word of 64 bit bytes of form (rule is the note 1 greater than 0, the note 0 less than 0, result 10110).
After obtaining tagged word, the identification for carrying out similarity with document before to current document judges, wherein root in the S3 According to tagged word and document data, judge whether current document and document before database purchase are similar, specific steps are as follows:
S31 calculates the Hamming distances of the tagged word and the tagged word of document before of current document, if Hamming distances are greater than Equal to 3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
S32, the number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words is poor Greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
S33 extracts the keyword of current document and the keyword of two degree of similar documents;
When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if identical pass Keyword number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents;
When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if identical pass Keyword number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
S34 calculates the data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents, if two documents Data value occupy-place amount it is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.
In the present embodiment, four steps are passed sequentially through to the judgement of similarity and are carried out, when previous step determine it is dissimilar, The judgement for just no longer needing to carry out subsequent step then carries out the judgement of subsequent step when previous step is determined as similar again.
The first step judged by tagged word, the key value of document before extracting in redis database, according to key The tagged word of document before value obtains.The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to 3, then current document and document before are dissimilar.If Hamming distances, less than 3, current document is therewith Preceding document is similar, obtains preliminary similar document.Because of the similarity carried out by the tagged word that local sensitivity hash algorithm obtains Judgement, have the defects that it is certain, therefore in order to judge more acurrate, subsequent further progress judgement.
Second step is judged that the body matter of document, compares two before extracting in redis database by number of words The number of words of the body matter of document is poor, if number of words difference is greater than 500, current document and preliminary similar document are dissimilar, otherwise It is similar, and obtain two degree of similar documents.Since the word number of long length document is more, it is possible to include short articles document, two at this time Length, which differs greatly, may also be considered similar, therefore the document of 500 words or more is differed to length length, then is judged as not phase Seemingly.
Third step is judged by keyword, is increased this dimension of keyword to be judged, is improved the accurate of judgement Property.The keyword that document is extracted using textrank algorithm, is divided into that keyword is less and keyword according to keyword extraction quantity More two kinds of situations: the keyword of one of document is less than or equal to 3, if the identical keyword number of the two documents is less than 2, then two documents are dissimilar, otherwise similar, and obtain three degree of similar documents;The keyword number of two documents is all larger than 3, if this The identical keyword number of two documents is less than 3, then two documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
4th step is judged that news data is no lack of the template that many content multiplicities are high but data are different by data Article, at this point, document is it is believed that dissmilarity.The data value occupy-place amount of two documents is identical but data are different, then judges two documents It is dissimilar.The byte digit that the data occupy-place amount, i.e. data are occupied, such as the report article of stock class, daily template one Sample, only data are different.
Current document stores current document with when document is similar before, dissimilar then do not store.Wherein, described By the tagged word of current document and document datastore to database, specific steps are as follows:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number According to library.
In the present embodiment, being stored using redis database for data, the characteristic of redis is used for the caching of mass data, Reading speed is very fast.The article title of document, ID number, body matter, data source mark, tagged word are stored in redis, To guarantee uniqueness, uses the combination of tagged word and ID as the Key value of redis, needed before being extracted in redis later When document carries out the calculating of similarity, it is only necessary to which extracting key value can be obtained tagged word, relative to using tagged word as redis Value improve several times the speed that accesses.
In conclusion the method for the present embodiment, the identification of similarity is carried out to current document and stored document before Judgement stores dissimilar current document, similar current document is not stored then, to realize to separate sources Similar document real-time quick duplicate removal processing, avoid the repetition of similar document from storing.
Implement two:
A kind of real-time quick machining system of multi-source data document is present embodiments provided, suitable for one described in embodiment one The kind real-time quick De-weight method of multi-source data document, as shown in Figure 2, comprising:
Data processing unit obtains filtered number of files for receiving current document and being filtered to current document According to;
Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm;
Similar judging unit, for before according to tagged word and document data, judging current document and database purchase Whether document is similar;
Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word of current document and Otherwise document datastore is not stored to database.
In the present embodiment, an article is read as certain user enters certain website from circle of friends, has felt this article not Mistake, just reprints this article of the website, receives the current document of reprinting from the background, is filtered to current document, The contents such as website top margin, website page footing are eliminated, filtered document data is obtained, the document data include the phase of this article Hold inside the Pass, such as: article title, ID number, body matter, data source mark etc..
Tagged word is calculated further according to filtered document data, wherein the computing unit is specifically used for:
Body matter in document data is segmented, and is obtained such as dried fruit word;
Go out the weight of each word by word frequency statistics method statistic;
Each word is mapped out into a hash value with hash algorithm;
It is weighted according to hash value of the weight to each word, the numeric string after being weighted;
The numeric string of all words is subjected to step-by-step summation, obtains final numeric string;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.
In the present embodiment, body matter is segmented, removes some meaningless words, such as " ", " uh ", " ". The number that each word occurs in the text is counted, and divided by total word number of full text, obtains word frequency, the weight as this word.It will be every A word maps out a hash value with hash algorithm, for example, " robot " this word be mapped as 11001 (example is 5, we this In really 64).Then it is weighted according to hash value of the weight to each word, if the weight of robot is 3, plus Numeric string after power are as follows: 33-3-33 (being 1:1x3 in hash, be 0:-1x3 in hash).After the numeric string for calculating each word, Numeric string step-by-step summation to each word, such as: intelligent numeric string is 6-666-6, the two word step-by-steps of intelligent robot are asked Numeric string later are as follows:
(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3
And so on, it sums to the word step-by-step of full text, obtains final numeric string.Final numeric string is converted to 01 again The tagged word of 64 bit bytes of form (rule is the note 1 greater than 0, the note 0 less than 0, result 10110).
After obtaining tagged word, the identification for carrying out similarity with document before to current document judges, wherein described similar to sentence Disconnected unit is specifically used for:
The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to 3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
Extract the keyword of current document and the keyword of two degree of similar documents;The keyword number of current document or two degree When the keyword number of similar document is less than or equal to 3, if same keyword number, less than 2, current document is similar to two degree Document is dissimilar, otherwise similar, and obtains three degree of similar documents;The pass of the keyword number of current document and two degree of similar documents When keyword number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, no It is then similar, and obtain three degree of similar documents.
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.
In the present embodiment, four steps are passed sequentially through to the judgement of similarity and are carried out, when previous step determine it is dissimilar, The judgement for just no longer needing to carry out subsequent step then carries out the judgement of subsequent step when previous step is determined as similar again.
The first step judged by tagged word, the key value of document before extracting in redis database, according to key The tagged word of document before value obtains.The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to 3, then current document and document before are dissimilar.If Hamming distances, less than 3, current document is therewith Preceding document is similar, obtains preliminary similar document.Because of the similarity carried out by the tagged word that local sensitivity hash algorithm obtains Judgement, have the defects that it is certain, therefore in order to judge more acurrate, subsequent further progress judgement.
Second step is judged that the body matter of document, compares two before extracting in redis database by number of words The number of words of the body matter of document is poor, if number of words difference is greater than 500, current document and preliminary similar document are dissimilar, otherwise It is similar, and obtain two degree of similar documents.Since the word number of long length document is more, it is possible to include short articles document, two at this time Length, which differs greatly, may also be considered similar, therefore the document of 500 words or more is differed to length length, then is judged as not phase Seemingly.
Third step is judged by keyword, is increased this dimension of keyword to be judged, is improved the accurate of judgement Property.The keyword that document is extracted using textrank algorithm, is divided into that keyword is less and keyword according to keyword extraction quantity More two kinds of situations: the keyword of one of document is less than or equal to 3, if the identical keyword number of the two documents is less than 2, then two documents are dissimilar, otherwise similar, and obtain three degree of similar documents;The keyword number of two documents is all larger than 3, if this The identical keyword number of two documents is less than 3, then two documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
4th step is judged that news data is no lack of the template that many content multiplicities are high but data are different by data Article, at this point, document is it is believed that dissmilarity.The data value occupy-place amount of two documents is identical but data are different, then judges two documents It is dissimilar.The byte digit that the data occupy-place amount, i.e. data are occupied, such as the report article of stock class, daily template one Sample, only data are different.
Current document stores current document with when document is similar before, dissimilar then do not store.Wherein, described By the tagged word of current document and document datastore to database, specifically:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number According to library.
In the present embodiment, using redis database for data store, duplicate removal access unit to redis be stored in data or Data are extracted from redis, the characteristic of redis is used for the caching of mass data, and reading speed is very fast.By the article mark of document Topic, ID number, body matter, data source mark, tagged word are stored in redis, to guarantee uniqueness, use tagged word and ID The Key value as redis is combined, when needing the calculating of document progress similarity before extracting in redis later, it is only necessary to Extracting key value can be obtained tagged word, improve several times relative to the speed for accessing tagged word as the Value of redis.
In conclusion the system of the present embodiment, the identification of similarity is carried out to current document and stored document before Judgement stores dissimilar current document, similar current document is not stored then, to realize to separate sources Similar document real-time quick duplicate removal processing, avoid the repetition of similar document from storing.
Embodiment three:
A kind of terminal, including processor and memory connected to the processor are present embodiments provided, it is described to deposit For reservoir for storing computer program, the computer program includes program instruction, and the processor is configured for calling institute Program instruction is stated, method described in embodiment one is executed.
It should be appreciated that in the present embodiment, alleged processor can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing At programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components etc..
The memory may include read-only memory and random access memory, and provide instruction and data to processor. The a part of of memory can also include nonvolatile RAM.For example, memory can be with storage device type Information.
The terminal of the present embodiment executes method described in embodiment one, to current document and it is stored above Shelves carry out the identification judgement of similarity, store to dissimilar current document, similar current document is not stored then, To realize to the real-time quick duplicate removal processing of the similar document of separate sources, the repetition of similar document is avoided to store.
Those of ordinary skill in the art may be aware that system unit described in conjunction with the examples disclosed in this document and Method and step can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and soft The interchangeability of part generally describes each exemplary composition and step according to function in the above description.These function It can be implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Professional skill Art personnel can use different methods to achieve the described function each specific application, but this realization should not be recognized It is beyond the scope of this invention.
In several embodiments provided herein, it should be understood that disclosed method and system can pass through it Its mode is realized.For example, in addition the division of the above unit, only a kind of logical function partition can have in actual implementation Division mode, such as multiple units or components can be combined or can be integrated into another system or some features can be with Ignore, or does not execute.Said units may or may not be physically separated, and component shown as a unit can be with It is or may not be physical unit, it can it is in one place, or may be distributed over multiple network units.It can The purpose of the embodiment of the present invention is realized to select some or all of unit therein according to the actual needs.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme should all cover within the scope of the claims and the description of the invention.

Claims (10)

1. a kind of real-time quick De-weight method of multi-source data document, which comprises the following steps:
It receives current document and current document is filtered, obtain filtered document data;
By local susceptibility hash algorithm, the tagged word of document data is calculated;
According to tagged word and document data, judge whether current document and document before database purchase are similar;
If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.
2. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that described to pass through Local susceptibility hash algorithm calculates the tagged word of document data, specific steps are as follows:
Body matter in document data is segmented, and is obtained such as dried fruit word;
Go out the weight of each word by word frequency statistics method statistic;
Each word is mapped out into a hash value with hash algorithm;
It is weighted according to hash value of the weight to each word, the numeric string after being weighted;
The numeric string of all words is subjected to step-by-step summation, obtains final numeric string;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.
3. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that the basis Tagged word and document data judge whether current document and document before database purchase are similar, specific steps are as follows:
The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to N, Current document is dissimilar with document before, otherwise similar, and obtains preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than M, Current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
4. a kind of real-time quick De-weight method of multi-source data document according to claim 3, which is characterized in that the basis Tagged word and document data judge whether current document and document before database purchase are similar, further includes:
Extract the keyword of current document and the keyword of two degree of similar documents;
When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if same keyword Number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents;
When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if same keyword Number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
5. a kind of real-time quick De-weight method of multi-source data document according to claim 4, which is characterized in that the basis Tagged word and document data judge whether current document and document before database purchase are similar, further includes:
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the data value of two documents Occupy-place amount is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.
6. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that the document Data include article title, ID number, body matter and data source identification, the tagged word and number of files by current document Database, specific steps are arrived according to storage are as follows:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis data Library.
7. a kind of real-time quick De-weight method of multi-source data document according to claim 6, which is characterized in that ought be above Shelves carry out the likelihood ratio earlier above with document before, the key value of document before extracting in database, before being obtained according to key value The tagged word of document.
8. a kind of real-time quick machining system of multi-source data document characterized by comprising
Data processing unit obtains filtered document data for receiving current document and being filtered to current document;
Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm;
Similar judging unit, for judging the document before of current document and database purchase according to tagged word and document data It is whether similar;
Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word and document of current document Data are stored to database, are not otherwise stored.
9. a kind of real-time quick machining system of multi-source data document according to claim 8, which is characterized in that described similar Judging unit is specifically used for:
The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to 3, Current document is dissimilar with document before, otherwise similar, and obtains preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than 500, Then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents;
Extract the keyword of current document and the keyword of two degree of similar documents;The keyword number of current document or two degree it is similar The keyword number of document be less than or equal to 3 when, if same keyword number less than 2, current document and two degree of similar documents Dissmilarity, it is otherwise similar, and obtain three degree of similar documents;The keyword of the keyword number of current document and two degree of similar documents When number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, otherwise phase Seemingly, and three degree of similar documents are obtained;
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the data value of two documents Occupy-place amount is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.
10. a kind of terminal, including processor and memory connected to the processor, the memory is calculated for storing Machine program, the computer program include program instruction, which is characterized in that the processor is configured for calling described program Instruction executes the method according to claim 1 to 7.
CN201811456999.5A 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document Active CN109635084B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811456999.5A CN109635084B (en) 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811456999.5A CN109635084B (en) 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document

Publications (2)

Publication Number Publication Date
CN109635084A true CN109635084A (en) 2019-04-16
CN109635084B CN109635084B (en) 2020-11-24

Family

ID=66070616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811456999.5A Active CN109635084B (en) 2018-11-30 2018-11-30 Real-time rapid duplicate removal method and system for multi-source data document

Country Status (1)

Country Link
CN (1) CN109635084B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750217A (en) * 2019-10-18 2020-02-04 北京浪潮数据技术有限公司 Information management method and related device
CN111368521A (en) * 2020-02-29 2020-07-03 重庆百事得大牛机器人有限公司 Management method for legal advisor service
CN111597178A (en) * 2020-05-18 2020-08-28 山东浪潮通软信息科技有限公司 Method, system, equipment and medium for cleaning repeating data
CN111737966A (en) * 2020-06-11 2020-10-02 北京百度网讯科技有限公司 Document repetition degree detection method, device, equipment and readable storage medium
CN114661771A (en) * 2022-04-14 2022-06-24 广州经传多赢投资咨询有限公司 Stock data storage and reading method, equipment and readable storage medium
CN115422125A (en) * 2022-09-29 2022-12-02 浙江星汉信息技术股份有限公司 Electronic document automatic filing method and system based on intelligent algorithm

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8661341B1 (en) * 2011-01-19 2014-02-25 Google, Inc. Simhash based spell correction
CN104346443A (en) * 2014-10-20 2015-02-11 北京国双科技有限公司 Web text processing method and device
EP2846499A1 (en) * 2013-09-06 2015-03-11 Alcatel Lucent Method And Device For Classifying A Message
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
US20180068023A1 (en) * 2016-09-07 2018-03-08 Facebook, Inc. Similarity Search Using Polysemous Codes
CN108009152A (en) * 2017-12-04 2018-05-08 陕西识代运筹信息科技股份有限公司 A kind of data processing method and device of the text similarity analysis based on Spark-Streaming
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN108763486A (en) * 2018-05-30 2018-11-06 湖南写邦科技有限公司 Paper duplicate checking method, terminal and storage medium based on terminal
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8661341B1 (en) * 2011-01-19 2014-02-25 Google, Inc. Simhash based spell correction
EP2846499A1 (en) * 2013-09-06 2015-03-11 Alcatel Lucent Method And Device For Classifying A Message
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN104346443A (en) * 2014-10-20 2015-02-11 北京国双科技有限公司 Web text processing method and device
US20180068023A1 (en) * 2016-09-07 2018-03-08 Facebook, Inc. Similarity Search Using Polysemous Codes
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN108009152A (en) * 2017-12-04 2018-05-08 陕西识代运筹信息科技股份有限公司 A kind of data processing method and device of the text similarity analysis based on Spark-Streaming
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN108763486A (en) * 2018-05-30 2018-11-06 湖南写邦科技有限公司 Paper duplicate checking method, terminal and storage medium based on terminal
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
游春晖: ""基于语义情感倾向的文本相似度计算"", 《电子科技大学硕士学位论文》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750217A (en) * 2019-10-18 2020-02-04 北京浪潮数据技术有限公司 Information management method and related device
CN111368521A (en) * 2020-02-29 2020-07-03 重庆百事得大牛机器人有限公司 Management method for legal advisor service
CN111368521B (en) * 2020-02-29 2023-04-07 重庆百事得大牛机器人有限公司 Management method for legal advisor service
CN111597178A (en) * 2020-05-18 2020-08-28 山东浪潮通软信息科技有限公司 Method, system, equipment and medium for cleaning repeating data
CN111737966A (en) * 2020-06-11 2020-10-02 北京百度网讯科技有限公司 Document repetition degree detection method, device, equipment and readable storage medium
CN111737966B (en) * 2020-06-11 2024-03-01 北京百度网讯科技有限公司 Document repetition detection method, device, equipment and readable storage medium
CN114661771A (en) * 2022-04-14 2022-06-24 广州经传多赢投资咨询有限公司 Stock data storage and reading method, equipment and readable storage medium
CN115422125A (en) * 2022-09-29 2022-12-02 浙江星汉信息技术股份有限公司 Electronic document automatic filing method and system based on intelligent algorithm

Also Published As

Publication number Publication date
CN109635084B (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN109635084A (en) A kind of real-time quick De-weight method of multi-source data document and system
US10394956B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN104008106B (en) A kind of method and device obtaining much-talked-about topic
CN109033200A (en) Method, apparatus, equipment and the computer-readable medium of event extraction
TW201546633A (en) Method and Apparatus of Matching Text Information and Pushing a Business Object
CN102890689A (en) Method and system for building user interest model
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
CN103593418A (en) Distributed subject finding method and system for big data
KR20190075962A (en) Data processing method and data processing apparatus
CN112650923A (en) Public opinion processing method and device for news events, storage medium and computer equipment
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN104679738A (en) Method and device for mining Internet hot words
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN110309234A (en) A kind of client of knowledge based map holds position method for early warning, device and storage medium
CN109800292A (en) The determination method, device and equipment of question and answer matching degree
CN111061837A (en) Topic identification method, device, equipment and medium
CN107527289B (en) Investment portfolio industry configuration method, device, server and storage medium
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
CN112035449A (en) Data processing method and device, computer equipment and storage medium
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN109241361A (en) Data processing method based on block chain
CN111324725B (en) Topic acquisition method, terminal and computer readable storage medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN108628875A (en) A kind of extracting method of text label, device and server
CN114691835A (en) Audit plan data generation method, device and equipment based on text mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant