CN109635084A - A kind of real-time quick De-weight method of multi-source data document and system - Google Patents
A kind of real-time quick De-weight method of multi-source data document and system Download PDFInfo
- Publication number
- CN109635084A CN109635084A CN201811456999.5A CN201811456999A CN109635084A CN 109635084 A CN109635084 A CN 109635084A CN 201811456999 A CN201811456999 A CN 201811456999A CN 109635084 A CN109635084 A CN 109635084A
- Authority
- CN
- China
- Prior art keywords
- document
- similar
- data
- current document
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to technical field of information processing, and in particular to a kind of real-time quick De-weight method of multi-source data document and system, comprising the following steps: receive current document and be simultaneously filtered to current document, obtain filtered document data;By local susceptibility hash algorithm, the tagged word of document data is calculated;According to tagged word and document data, judge whether current document and document before database purchase are similar;If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.The present invention can carry out real-time quick duplicate removal processing to the similar document data of separate sources, and the repetition of similar document is avoided to store.
Description
Technical field
The invention belongs to technical field of information processing, and in particular to a kind of real-time quick De-weight method of multi-source data document and
System.
Background technique
The document data in different web sites source perhaps uses for reference that identical there are article content or repetitive rate is higher due to forwarding
The case where, in practical applications, need such similar article screening and filtering, the method for previous human-edited needs to consume big
The human cost of amount, and for the data that news category needs to push in real time, the timeliness of manual operation duplicate removal is very low, and general
Duplicate Removal Algorithm, memory usage is high in online calculating process, when data volume is excessive, memory is easy to cause to overflow, offline to count
Though calculating the timeliness that can solve the problem of memory overflows but not can guarantee duplicate removal.
Summary of the invention
For the defects in the prior art, the present invention provides a kind of real-time quick De-weight method of multi-source data document and it is
System, can carry out real-time quick duplicate removal processing to the similar document data of separate sources, the repetition of similar document is avoided to deposit
Storage.
In a first aspect, the present invention provides a kind of real-time quick De-weight methods of multi-source data document, comprising the following steps:
It receives current document and current document is filtered, obtain filtered document data;
By local susceptibility hash algorithm, the tagged word of document data is calculated;
According to tagged word and document data, judge whether current document and document before database purchase are similar;
If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.
Preferably, described by local susceptibility hash algorithm, calculate the tagged word of document data, specific steps are as follows:
Body matter in document data is segmented, and is obtained such as dried fruit word;
Go out the weight of each word by word frequency statistics method statistic;
Each word is mapped out into a hash value with hash algorithm;
It is weighted according to hash value of the weight to each word, the numeric string after being weighted;
The numeric string of all words is subjected to step-by-step summation, obtains final numeric string;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.
Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is
It is no similar, specific steps are as follows:
The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to
N, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than
M, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is
It is no similar, further includes:
Extract the keyword of current document and the keyword of two degree of similar documents;
When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if identical pass
Keyword number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents;
When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if identical pass
Keyword number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
Preferably, described according to tagged word and document data, judge that the document before of current document and database purchase is
It is no similar, further includes:
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents
Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.
Preferably, the document data includes article title, ID number, body matter and data source identification, the general
The tagged word of current document and document datastore are to database, specific steps are as follows:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number
According to library.
Preferably, in current document, document carries out the likelihood ratio earlier above with before, the document before extracting in database
Key value, the tagged word of document before being obtained according to key value.
Second aspect, the present invention provides a kind of real-time quick machining systems of multi-source data document, are suitable for first aspect
A kind of real-time quick De-weight method of multi-source data document, comprising:
Data processing unit obtains filtered number of files for receiving current document and being filtered to current document
According to;
Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm;
Similar judging unit, for before according to tagged word and document data, judging current document and database purchase
Whether document is similar;
Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word of current document and
Otherwise document datastore is not stored to database.
Preferably, the similar judging unit, is specifically used for:
The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to
3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than
500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents;
Extract the keyword of current document and the keyword of two degree of similar documents;The keyword number of current document or two degree
When the keyword number of similar document is less than or equal to 3, if same keyword number, less than 2, current document is similar to two degree
Document is dissimilar, otherwise similar, and obtains three degree of similar documents;The pass of the keyword number of current document and two degree of similar documents
When keyword number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, no
It is then similar, and obtain three degree of similar documents;
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents
Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.
The third aspect, the present invention provides a kind of terminal, including processor and memory connected to the processor,
The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for
Described program instruction is called, method as described in relation to the first aspect is executed.
Technical solution of the present invention, the identification for carrying out similarity to current document and stored document before judges, right
Dissimilar current document is stored, and similar current document is not stored then, to realize to the similar of separate sources
The real-time quick duplicate removal processing of document avoids the repetition of similar document from storing.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art
Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element
Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.
Fig. 1 is the flow diagram of the real-time quick De-weight method of multi-source data document in the present embodiment;
Fig. 2 is the structural schematic diagram of the real-time quick machining system of multi-source data document in the present embodiment.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Embodiment one:
A kind of real-time quick De-weight method of multi-source data document is present embodiments provided, as shown in Figure 1, including following step
It is rapid:
S1 receives current document and is filtered to current document, obtains filtered document data;
S2 calculates the tagged word of document data by local susceptibility hash algorithm;
S3 judges whether current document and document before database purchase are similar according to tagged word and document data;
S4, by the tagged word of current document and document datastore to database, is not otherwise stored if dissimilar.
In the present embodiment, an article is read as certain user enters certain website from circle of friends, has felt this article not
Mistake, just reprints this article of the website, receives the current document of reprinting from the background, is filtered to current document,
The contents such as website top margin, website page footing are eliminated, filtered document data is obtained, the document data include the phase of this article
Hold inside the Pass, such as: article title, ID number, body matter, data source mark etc..
Tagged word is calculated further according to filtered document data, wherein pass through local susceptibility hash algorithm in the S2,
Calculate the tagged word of document data, specific steps are as follows:
S21 segments the body matter in document data, and obtains such as dried fruit word;
S22 goes out the weight of each word by word frequency statistics method statistic;
Each word is mapped out a hash value with hash algorithm by S23;
S24 is weighted, the numeric string after being weighted according to hash value of the weight to each word;
The numeric string of all words is carried out step-by-step summation, obtains final numeric string by S25;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form by S26.
In the present embodiment, body matter is segmented, removes some meaningless words, such as " ", " uh ", " ".
The number that each word occurs in the text is counted, and divided by total word number of full text, obtains word frequency, the weight as this word.It will be every
A word maps out a hash value with hash algorithm, for example, " robot " this word be mapped as 11001 (example is 5, we this
In really 64).Then it is weighted according to hash value of the weight to each word, if the weight of robot is 3, plus
Numeric string after power are as follows: 33-3-33 (being 1:1x3 in hash, be 0:-1x3 in hash).After the numeric string for calculating each word,
Numeric string step-by-step summation to each word, such as: intelligent numeric string is 6-666-6, the two word step-by-steps of intelligent robot are asked
Numeric string later are as follows:
(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3
And so on, it sums to the word step-by-step of full text, obtains final numeric string.Final numeric string is converted to 01 again
The tagged word of 64 bit bytes of form (rule is the note 1 greater than 0, the note 0 less than 0, result 10110).
After obtaining tagged word, the identification for carrying out similarity with document before to current document judges, wherein root in the S3
According to tagged word and document data, judge whether current document and document before database purchase are similar, specific steps are as follows:
S31 calculates the Hamming distances of the tagged word and the tagged word of document before of current document, if Hamming distances are greater than
Equal to 3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
S32, the number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words is poor
Greater than 500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
S33 extracts the keyword of current document and the keyword of two degree of similar documents;
When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if identical pass
Keyword number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents;
When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if identical pass
Keyword number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
S34 calculates the data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents, if two documents
Data value occupy-place amount it is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.
In the present embodiment, four steps are passed sequentially through to the judgement of similarity and are carried out, when previous step determine it is dissimilar,
The judgement for just no longer needing to carry out subsequent step then carries out the judgement of subsequent step when previous step is determined as similar again.
The first step judged by tagged word, the key value of document before extracting in redis database, according to key
The tagged word of document before value obtains.The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if
Hamming distances are more than or equal to 3, then current document and document before are dissimilar.If Hamming distances, less than 3, current document is therewith
Preceding document is similar, obtains preliminary similar document.Because of the similarity carried out by the tagged word that local sensitivity hash algorithm obtains
Judgement, have the defects that it is certain, therefore in order to judge more acurrate, subsequent further progress judgement.
Second step is judged that the body matter of document, compares two before extracting in redis database by number of words
The number of words of the body matter of document is poor, if number of words difference is greater than 500, current document and preliminary similar document are dissimilar, otherwise
It is similar, and obtain two degree of similar documents.Since the word number of long length document is more, it is possible to include short articles document, two at this time
Length, which differs greatly, may also be considered similar, therefore the document of 500 words or more is differed to length length, then is judged as not phase
Seemingly.
Third step is judged by keyword, is increased this dimension of keyword to be judged, is improved the accurate of judgement
Property.The keyword that document is extracted using textrank algorithm, is divided into that keyword is less and keyword according to keyword extraction quantity
More two kinds of situations: the keyword of one of document is less than or equal to 3, if the identical keyword number of the two documents is less than
2, then two documents are dissimilar, otherwise similar, and obtain three degree of similar documents;The keyword number of two documents is all larger than 3, if this
The identical keyword number of two documents is less than 3, then two documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
4th step is judged that news data is no lack of the template that many content multiplicities are high but data are different by data
Article, at this point, document is it is believed that dissmilarity.The data value occupy-place amount of two documents is identical but data are different, then judges two documents
It is dissimilar.The byte digit that the data occupy-place amount, i.e. data are occupied, such as the report article of stock class, daily template one
Sample, only data are different.
Current document stores current document with when document is similar before, dissimilar then do not store.Wherein, described
By the tagged word of current document and document datastore to database, specific steps are as follows:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number
According to library.
In the present embodiment, being stored using redis database for data, the characteristic of redis is used for the caching of mass data,
Reading speed is very fast.The article title of document, ID number, body matter, data source mark, tagged word are stored in redis,
To guarantee uniqueness, uses the combination of tagged word and ID as the Key value of redis, needed before being extracted in redis later
When document carries out the calculating of similarity, it is only necessary to which extracting key value can be obtained tagged word, relative to using tagged word as redis
Value improve several times the speed that accesses.
In conclusion the method for the present embodiment, the identification of similarity is carried out to current document and stored document before
Judgement stores dissimilar current document, similar current document is not stored then, to realize to separate sources
Similar document real-time quick duplicate removal processing, avoid the repetition of similar document from storing.
Implement two:
A kind of real-time quick machining system of multi-source data document is present embodiments provided, suitable for one described in embodiment one
The kind real-time quick De-weight method of multi-source data document, as shown in Figure 2, comprising:
Data processing unit obtains filtered number of files for receiving current document and being filtered to current document
According to;
Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm;
Similar judging unit, for before according to tagged word and document data, judging current document and database purchase
Whether document is similar;
Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word of current document and
Otherwise document datastore is not stored to database.
In the present embodiment, an article is read as certain user enters certain website from circle of friends, has felt this article not
Mistake, just reprints this article of the website, receives the current document of reprinting from the background, is filtered to current document,
The contents such as website top margin, website page footing are eliminated, filtered document data is obtained, the document data include the phase of this article
Hold inside the Pass, such as: article title, ID number, body matter, data source mark etc..
Tagged word is calculated further according to filtered document data, wherein the computing unit is specifically used for:
Body matter in document data is segmented, and is obtained such as dried fruit word;
Go out the weight of each word by word frequency statistics method statistic;
Each word is mapped out into a hash value with hash algorithm;
It is weighted according to hash value of the weight to each word, the numeric string after being weighted;
The numeric string of all words is subjected to step-by-step summation, obtains final numeric string;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.
In the present embodiment, body matter is segmented, removes some meaningless words, such as " ", " uh ", " ".
The number that each word occurs in the text is counted, and divided by total word number of full text, obtains word frequency, the weight as this word.It will be every
A word maps out a hash value with hash algorithm, for example, " robot " this word be mapped as 11001 (example is 5, we this
In really 64).Then it is weighted according to hash value of the weight to each word, if the weight of robot is 3, plus
Numeric string after power are as follows: 33-3-33 (being 1:1x3 in hash, be 0:-1x3 in hash).After the numeric string for calculating each word,
Numeric string step-by-step summation to each word, such as: intelligent numeric string is 6-666-6, the two word step-by-steps of intelligent robot are asked
Numeric string later are as follows:
(3+6) (3-6) (-3+6) (-3+6) (3-6)->9 -3 3 3 -3
And so on, it sums to the word step-by-step of full text, obtains final numeric string.Final numeric string is converted to 01 again
The tagged word of 64 bit bytes of form (rule is the note 1 greater than 0, the note 0 less than 0, result 10110).
After obtaining tagged word, the identification for carrying out similarity with document before to current document judges, wherein described similar to sentence
Disconnected unit is specifically used for:
The Hamming distances for calculating the tagged word and the tagged word of document before of current document, if Hamming distances are more than or equal to
3, then current document and document before are dissimilar, otherwise similar, and obtain preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than
500, then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
Extract the keyword of current document and the keyword of two degree of similar documents;The keyword number of current document or two degree
When the keyword number of similar document is less than or equal to 3, if same keyword number, less than 2, current document is similar to two degree
Document is dissimilar, otherwise similar, and obtains three degree of similar documents;The pass of the keyword number of current document and two degree of similar documents
When keyword number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, no
It is then similar, and obtain three degree of similar documents.
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the number of two documents
Identical and data are different according to value occupy-place amount, then current document and three degree similar documents dissmilarities, otherwise similar.
In the present embodiment, four steps are passed sequentially through to the judgement of similarity and are carried out, when previous step determine it is dissimilar,
The judgement for just no longer needing to carry out subsequent step then carries out the judgement of subsequent step when previous step is determined as similar again.
The first step judged by tagged word, the key value of document before extracting in redis database, according to key
The tagged word of document before value obtains.The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if
Hamming distances are more than or equal to 3, then current document and document before are dissimilar.If Hamming distances, less than 3, current document is therewith
Preceding document is similar, obtains preliminary similar document.Because of the similarity carried out by the tagged word that local sensitivity hash algorithm obtains
Judgement, have the defects that it is certain, therefore in order to judge more acurrate, subsequent further progress judgement.
Second step is judged that the body matter of document, compares two before extracting in redis database by number of words
The number of words of the body matter of document is poor, if number of words difference is greater than 500, current document and preliminary similar document are dissimilar, otherwise
It is similar, and obtain two degree of similar documents.Since the word number of long length document is more, it is possible to include short articles document, two at this time
Length, which differs greatly, may also be considered similar, therefore the document of 500 words or more is differed to length length, then is judged as not phase
Seemingly.
Third step is judged by keyword, is increased this dimension of keyword to be judged, is improved the accurate of judgement
Property.The keyword that document is extracted using textrank algorithm, is divided into that keyword is less and keyword according to keyword extraction quantity
More two kinds of situations: the keyword of one of document is less than or equal to 3, if the identical keyword number of the two documents is less than
2, then two documents are dissimilar, otherwise similar, and obtain three degree of similar documents;The keyword number of two documents is all larger than 3, if this
The identical keyword number of two documents is less than 3, then two documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
4th step is judged that news data is no lack of the template that many content multiplicities are high but data are different by data
Article, at this point, document is it is believed that dissmilarity.The data value occupy-place amount of two documents is identical but data are different, then judges two documents
It is dissimilar.The byte digit that the data occupy-place amount, i.e. data are occupied, such as the report article of stock class, daily template one
Sample, only data are different.
Current document stores current document with when document is similar before, dissimilar then do not store.Wherein, described
By the tagged word of current document and document datastore to database, specifically:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis number
According to library.
In the present embodiment, using redis database for data store, duplicate removal access unit to redis be stored in data or
Data are extracted from redis, the characteristic of redis is used for the caching of mass data, and reading speed is very fast.By the article mark of document
Topic, ID number, body matter, data source mark, tagged word are stored in redis, to guarantee uniqueness, use tagged word and ID
The Key value as redis is combined, when needing the calculating of document progress similarity before extracting in redis later, it is only necessary to
Extracting key value can be obtained tagged word, improve several times relative to the speed for accessing tagged word as the Value of redis.
In conclusion the system of the present embodiment, the identification of similarity is carried out to current document and stored document before
Judgement stores dissimilar current document, similar current document is not stored then, to realize to separate sources
Similar document real-time quick duplicate removal processing, avoid the repetition of similar document from storing.
Embodiment three:
A kind of terminal, including processor and memory connected to the processor are present embodiments provided, it is described to deposit
For reservoir for storing computer program, the computer program includes program instruction, and the processor is configured for calling institute
Program instruction is stated, method described in embodiment one is executed.
It should be appreciated that in the present embodiment, alleged processor can be central processing unit (Central Processing
Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal
Processor, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing
At programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete
Door or transistor logic, discrete hardware components etc..
The memory may include read-only memory and random access memory, and provide instruction and data to processor.
The a part of of memory can also include nonvolatile RAM.For example, memory can be with storage device type
Information.
The terminal of the present embodiment executes method described in embodiment one, to current document and it is stored above
Shelves carry out the identification judgement of similarity, store to dissimilar current document, similar current document is not stored then,
To realize to the real-time quick duplicate removal processing of the similar document of separate sources, the repetition of similar document is avoided to store.
Those of ordinary skill in the art may be aware that system unit described in conjunction with the examples disclosed in this document and
Method and step can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and soft
The interchangeability of part generally describes each exemplary composition and step according to function in the above description.These function
It can be implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Professional skill
Art personnel can use different methods to achieve the described function each specific application, but this realization should not be recognized
It is beyond the scope of this invention.
In several embodiments provided herein, it should be understood that disclosed method and system can pass through it
Its mode is realized.For example, in addition the division of the above unit, only a kind of logical function partition can have in actual implementation
Division mode, such as multiple units or components can be combined or can be integrated into another system or some features can be with
Ignore, or does not execute.Said units may or may not be physically separated, and component shown as a unit can be with
It is or may not be physical unit, it can it is in one place, or may be distributed over multiple network units.It can
The purpose of the embodiment of the present invention is realized to select some or all of unit therein according to the actual needs.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme should all cover within the scope of the claims and the description of the invention.
Claims (10)
1. a kind of real-time quick De-weight method of multi-source data document, which comprises the following steps:
It receives current document and current document is filtered, obtain filtered document data;
By local susceptibility hash algorithm, the tagged word of document data is calculated;
According to tagged word and document data, judge whether current document and document before database purchase are similar;
If dissimilar, by the tagged word of current document and document datastore to database, otherwise do not store.
2. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that described to pass through
Local susceptibility hash algorithm calculates the tagged word of document data, specific steps are as follows:
Body matter in document data is segmented, and is obtained such as dried fruit word;
Go out the weight of each word by word frequency statistics method statistic;
Each word is mapped out into a hash value with hash algorithm;
It is weighted according to hash value of the weight to each word, the numeric string after being weighted;
The numeric string of all words is subjected to step-by-step summation, obtains final numeric string;
Final numeric string is converted to the tagged word of 64 bit bytes of 01 form.
3. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that the basis
Tagged word and document data judge whether current document and document before database purchase are similar, specific steps are as follows:
The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to N,
Current document is dissimilar with document before, otherwise similar, and obtains preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than M,
Current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents.
4. a kind of real-time quick De-weight method of multi-source data document according to claim 3, which is characterized in that the basis
Tagged word and document data judge whether current document and document before database purchase are similar, further includes:
Extract the keyword of current document and the keyword of two degree of similar documents;
When the keyword number of the keyword number of current document or two degree of similar documents is less than or equal to 3, if same keyword
Number is less than 2, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents;
When the keyword number of the keyword number of current document and two degree of similar documents is greater than 3, if same keyword
Number is less than 3, then current document and two degree of similar documents are dissimilar, otherwise similar, and obtains three degree of similar documents.
5. a kind of real-time quick De-weight method of multi-source data document according to claim 4, which is characterized in that the basis
Tagged word and document data judge whether current document and document before database purchase are similar, further includes:
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the data value of two documents
Occupy-place amount is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.
6. a kind of real-time quick De-weight method of multi-source data document according to claim 1, which is characterized in that the document
Data include article title, ID number, body matter and data source identification, the tagged word and number of files by current document
Database, specific steps are arrived according to storage are as follows:
The tagged word of current document and number ID are combined into the key value of current document;
By the article title of current document, ID number, body matter, data source mark and the storage of key value to redis data
Library.
7. a kind of real-time quick De-weight method of multi-source data document according to claim 6, which is characterized in that ought be above
Shelves carry out the likelihood ratio earlier above with document before, the key value of document before extracting in database, before being obtained according to key value
The tagged word of document.
8. a kind of real-time quick machining system of multi-source data document characterized by comprising
Data processing unit obtains filtered document data for receiving current document and being filtered to current document;
Computing unit, for calculating the tagged word of document data by local susceptibility hash algorithm;
Similar judging unit, for judging the document before of current document and database purchase according to tagged word and document data
It is whether similar;
Duplicate removal access unit, if dissimilar with document before for current document, by the tagged word and document of current document
Data are stored to database, are not otherwise stored.
9. a kind of real-time quick machining system of multi-source data document according to claim 8, which is characterized in that described similar
Judging unit is specifically used for:
The Hamming distances of the tagged word and the tagged word of document before of current document are calculated, if Hamming distances are more than or equal to 3,
Current document is dissimilar with document before, otherwise similar, and obtains preliminary similar document;
The number of words for calculating the body matter and the body matter of preliminary similar document of current document is poor, if number of words difference is greater than 500,
Then current document and preliminary similar document are dissimilar, otherwise similar, and obtain two degree of similar documents;
Extract the keyword of current document and the keyword of two degree of similar documents;The keyword number of current document or two degree it is similar
The keyword number of document be less than or equal to 3 when, if same keyword number less than 2, current document and two degree of similar documents
Dissmilarity, it is otherwise similar, and obtain three degree of similar documents;The keyword of the keyword number of current document and two degree of similar documents
When number is greater than 3, if same keyword number, less than 3, current document and two degree of similar documents are dissimilar, otherwise phase
Seemingly, and three degree of similar documents are obtained;
The data value occupy-place amount of current document and the data value occupy-place amount of three degree of similar documents are calculated, if the data value of two documents
Occupy-place amount is identical and data are different, then current document and three degree of similar documents are dissimilar, otherwise similar.
10. a kind of terminal, including processor and memory connected to the processor, the memory is calculated for storing
Machine program, the computer program include program instruction, which is characterized in that the processor is configured for calling described program
Instruction executes the method according to claim 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811456999.5A CN109635084B (en) | 2018-11-30 | 2018-11-30 | Real-time rapid duplicate removal method and system for multi-source data document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811456999.5A CN109635084B (en) | 2018-11-30 | 2018-11-30 | Real-time rapid duplicate removal method and system for multi-source data document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109635084A true CN109635084A (en) | 2019-04-16 |
CN109635084B CN109635084B (en) | 2020-11-24 |
Family
ID=66070616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811456999.5A Active CN109635084B (en) | 2018-11-30 | 2018-11-30 | Real-time rapid duplicate removal method and system for multi-source data document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109635084B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750217A (en) * | 2019-10-18 | 2020-02-04 | 北京浪潮数据技术有限公司 | Information management method and related device |
CN111368521A (en) * | 2020-02-29 | 2020-07-03 | 重庆百事得大牛机器人有限公司 | Management method for legal advisor service |
CN111597178A (en) * | 2020-05-18 | 2020-08-28 | 山东浪潮通软信息科技有限公司 | Method, system, equipment and medium for cleaning repeating data |
CN111737966A (en) * | 2020-06-11 | 2020-10-02 | 北京百度网讯科技有限公司 | Document repetition degree detection method, device, equipment and readable storage medium |
CN114661771A (en) * | 2022-04-14 | 2022-06-24 | 广州经传多赢投资咨询有限公司 | Stock data storage and reading method, equipment and readable storage medium |
CN115422125A (en) * | 2022-09-29 | 2022-12-02 | 浙江星汉信息技术股份有限公司 | Electronic document automatic filing method and system based on intelligent algorithm |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8661341B1 (en) * | 2011-01-19 | 2014-02-25 | Google, Inc. | Simhash based spell correction |
CN104346443A (en) * | 2014-10-20 | 2015-02-11 | 北京国双科技有限公司 | Web text processing method and device |
EP2846499A1 (en) * | 2013-09-06 | 2015-03-11 | Alcatel Lucent | Method And Device For Classifying A Message |
CN105224518A (en) * | 2014-06-17 | 2016-01-06 | 腾讯科技(深圳)有限公司 | The lookup method of the computing method of text similarity and system, Similar Text and system |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
US20180068023A1 (en) * | 2016-09-07 | 2018-03-08 | Facebook, Inc. | Similarity Search Using Polysemous Codes |
CN108009152A (en) * | 2017-12-04 | 2018-05-08 | 陕西识代运筹信息科技股份有限公司 | A kind of data processing method and device of the text similarity analysis based on Spark-Streaming |
CN108573045A (en) * | 2018-04-18 | 2018-09-25 | 同方知网数字出版技术股份有限公司 | A kind of alignment matrix similarity retrieval method based on multistage fingerprint |
CN108763486A (en) * | 2018-05-30 | 2018-11-06 | 湖南写邦科技有限公司 | Paper duplicate checking method, terminal and storage medium based on terminal |
CN108776654A (en) * | 2018-05-30 | 2018-11-09 | 昆明理工大学 | One kind being based on improved simhash transcription comparison methods |
-
2018
- 2018-11-30 CN CN201811456999.5A patent/CN109635084B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8661341B1 (en) * | 2011-01-19 | 2014-02-25 | Google, Inc. | Simhash based spell correction |
EP2846499A1 (en) * | 2013-09-06 | 2015-03-11 | Alcatel Lucent | Method And Device For Classifying A Message |
CN105224518A (en) * | 2014-06-17 | 2016-01-06 | 腾讯科技(深圳)有限公司 | The lookup method of the computing method of text similarity and system, Similar Text and system |
CN104346443A (en) * | 2014-10-20 | 2015-02-11 | 北京国双科技有限公司 | Web text processing method and device |
US20180068023A1 (en) * | 2016-09-07 | 2018-03-08 | Facebook, Inc. | Similarity Search Using Polysemous Codes |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
CN108009152A (en) * | 2017-12-04 | 2018-05-08 | 陕西识代运筹信息科技股份有限公司 | A kind of data processing method and device of the text similarity analysis based on Spark-Streaming |
CN108573045A (en) * | 2018-04-18 | 2018-09-25 | 同方知网数字出版技术股份有限公司 | A kind of alignment matrix similarity retrieval method based on multistage fingerprint |
CN108763486A (en) * | 2018-05-30 | 2018-11-06 | 湖南写邦科技有限公司 | Paper duplicate checking method, terminal and storage medium based on terminal |
CN108776654A (en) * | 2018-05-30 | 2018-11-09 | 昆明理工大学 | One kind being based on improved simhash transcription comparison methods |
Non-Patent Citations (1)
Title |
---|
游春晖: ""基于语义情感倾向的文本相似度计算"", 《电子科技大学硕士学位论文》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750217A (en) * | 2019-10-18 | 2020-02-04 | 北京浪潮数据技术有限公司 | Information management method and related device |
CN111368521A (en) * | 2020-02-29 | 2020-07-03 | 重庆百事得大牛机器人有限公司 | Management method for legal advisor service |
CN111368521B (en) * | 2020-02-29 | 2023-04-07 | 重庆百事得大牛机器人有限公司 | Management method for legal advisor service |
CN111597178A (en) * | 2020-05-18 | 2020-08-28 | 山东浪潮通软信息科技有限公司 | Method, system, equipment and medium for cleaning repeating data |
CN111737966A (en) * | 2020-06-11 | 2020-10-02 | 北京百度网讯科技有限公司 | Document repetition degree detection method, device, equipment and readable storage medium |
CN111737966B (en) * | 2020-06-11 | 2024-03-01 | 北京百度网讯科技有限公司 | Document repetition detection method, device, equipment and readable storage medium |
CN114661771A (en) * | 2022-04-14 | 2022-06-24 | 广州经传多赢投资咨询有限公司 | Stock data storage and reading method, equipment and readable storage medium |
CN115422125A (en) * | 2022-09-29 | 2022-12-02 | 浙江星汉信息技术股份有限公司 | Electronic document automatic filing method and system based on intelligent algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN109635084B (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635084A (en) | A kind of real-time quick De-weight method of multi-source data document and system | |
US10394956B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
CN104008106B (en) | A kind of method and device obtaining much-talked-about topic | |
CN109033200A (en) | Method, apparatus, equipment and the computer-readable medium of event extraction | |
TW201546633A (en) | Method and Apparatus of Matching Text Information and Pushing a Business Object | |
CN102890689A (en) | Method and system for building user interest model | |
CN111460153A (en) | Hot topic extraction method and device, terminal device and storage medium | |
CN103593418A (en) | Distributed subject finding method and system for big data | |
KR20190075962A (en) | Data processing method and data processing apparatus | |
CN112650923A (en) | Public opinion processing method and device for news events, storage medium and computer equipment | |
CN110737821B (en) | Similar event query method, device, storage medium and terminal equipment | |
CN104679738A (en) | Method and device for mining Internet hot words | |
Wu et al. | Extracting topics based on Word2Vec and improved Jaccard similarity coefficient | |
CN110309234A (en) | A kind of client of knowledge based map holds position method for early warning, device and storage medium | |
CN109800292A (en) | The determination method, device and equipment of question and answer matching degree | |
CN111061837A (en) | Topic identification method, device, equipment and medium | |
CN107527289B (en) | Investment portfolio industry configuration method, device, server and storage medium | |
CN113204953A (en) | Text matching method and device based on semantic recognition and device readable storage medium | |
CN112035449A (en) | Data processing method and device, computer equipment and storage medium | |
CN114328800A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN109241361A (en) | Data processing method based on block chain | |
CN111324725B (en) | Topic acquisition method, terminal and computer readable storage medium | |
CN112579781A (en) | Text classification method and device, electronic equipment and medium | |
CN108628875A (en) | A kind of extracting method of text label, device and server | |
CN114691835A (en) | Audit plan data generation method, device and equipment based on text mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |