CN105589962A - Method and device for generating text fingerprint information - Google Patents

Method and device for generating text fingerprint information Download PDF

Info

Publication number
CN105589962A
CN105589962A CN201510975957.2A CN201510975957A CN105589962A CN 105589962 A CN105589962 A CN 105589962A CN 201510975957 A CN201510975957 A CN 201510975957A CN 105589962 A CN105589962 A CN 105589962A
Authority
CN
China
Prior art keywords
text
vector
initial characteristics
words
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510975957.2A
Other languages
Chinese (zh)
Other versions
CN105589962B (en
Inventor
张伸正
魏少俊
陈培军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510975957.2A priority Critical patent/CN105589962B/en
Publication of CN105589962A publication Critical patent/CN105589962A/en
Application granted granted Critical
Publication of CN105589962B publication Critical patent/CN105589962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for generating text fingerprint information. The method comprises the steps as follows: an initial feature vector of a text is extracted; a weighted value of at least one element in the initial feature vector is endowed with a multiple value of the minimum weighted value; the weighted values of the other elements are endowed with the minimum weighted value; the corresponding element is added to the initial feature vector according to the multiple to form a new feature vector; and the text fingerprint information is generated after minimum hash algorithm is carried out on the new feature vector. The method and the device for generating the text fingerprint information disclosed by the invention can improve the accuracy of fingerprint information, so that information clustering has a relatively excellent effect.

Description

A kind of generation method and apparatus of text fingerprints information
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of generation method of finger print informationAnd device.
Background technology
Along with the development of Internet technology and day by day universal, the text message user institutes such as newsThe information content of facing increases with surprising rapidity, own interested in obtaining easilyThe demand of text message more and more urgent.
Because text message amount increases rapidly, text categories is refinement all the more, and have very strong in real timeProperty, often to upgrade rapidly, timeliness is extremely short, therefore text is carried out to effective cluster, to provideIt is very important giving different users or offering different application.
Cluster refers to the set of physics or abstract object is divided into be made up of similar object multipleThe process of class. What generated by cluster bunch is the set of one group of data object, these objects withObject in one bunch is similar each other, different with the object in other bunches.
In prior art, whether similar technology is minimum to a kind of vector of detection quickly and efficientlyHash algorithm (MinHash).
Suppose to exist vectorial A and B, these two vectorial coefficient of community J are defined as:
J(A,B)=|A∩B|/|A∪B|
In min-hash algorithm, suppose vectorial A=(a1,a2...ai...aN) be a N dimensional vector,For each element a in vectori,H(ai) be by aiBe mapped to the hash function of an integer,hmin(A) be the min-hash value of the element gained after hash function is processed in vectorial A. RightIn vectorial A and B, hmin(A)=hmin(B) condition of setting up is to have min-hash value in A ∪ BElement is also at A ∩ B. The prerequisite that above formula is set up is that this H is a good hash function, toolThere is good uniformity, different element map can be become to different integers.
Therefore have: Pr(hmin(A)=hmin(B))=J (A, B). Wherein Pr represents probability. Be vectorial AMin-hash value equals vectorial A with the probability that vectorial B min-hash value equates, the system of group of BNumber. Therefore, vector identical min-hash value can be gathered is a class.
But, when existing min-hash algorithm is used for text message to carry out cluster, do not examineConsider the importance of each word elements in text, therefore, may be by masses actually notInterested two text message clusters arrive together simultaneously.
Summary of the invention
In view of the above problems, the present invention has been proposed to provide one to overcome the problems referred to above or extremelySmall part ground solves or slows down a kind of text fingerprints information generating method and the dress of the problems referred to abovePut.
According to an aspect of the present invention, provide a kind of generation method of text fingerprints information,Comprise: the initial characteristics vector that extracts text; At least one element in initial characteristics vectorWeighted value is endowed the multiple value of minimal weight value, and the weighted value of other elements is endowed MINIMUM WEIGHTHeavily value; Increase in initial characteristics vector according to described multiple respective element form new feature toAmount; New characteristic vector is carried out generating after min-hash computing to the finger print information of described text.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,The formation of initial characteristics vector be specially the feature of selecting representative words to form news toAmount.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,Text is carried out to word segmentation processing, further go garbage process form text feature toAmount.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,The formation of initial characteristics vector is specially according to the words frequency of occurrences in the words sequence of text by heightArrange words to low order, the words that takes out from front to back predetermined number is initial as textCharacteristic vector.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,The word frequency of calculating each element in initial characteristics vector, word frequency is that element is at textThe number of times of middle appearance, determines the minimal weight value of initial characteristics vector and right according to word frequencyThe element of answering, the multiple weighted value of minimal weight value and corresponding element thereof.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,The document frequency that calculates each element in initial characteristics vector, document frequency is for having comprised elementAmount of text, inverse document frequency is the functional value that is inverse ratio with document frequency, according to contrary documentFrequency is determined minimal weight value and the corresponding element thereof of initial characteristics vector, minimal weight valueMultiple weighted value and corresponding element thereof.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,Calculate word frequency and the inverse document frequency of each element in initial characteristics vector, according to meterThe word frequency obtaining and inverse document frequency are determined the minimal weight value of characteristic vector and rightThe element of answering, the multiple weighted value of minimal weight value and corresponding element thereof.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,According to the weight of each element residing location positioning element in text in initial characteristics vectorValue.
Alternatively, in the generation method of text fingerprints information according to an embodiment of the invention,Position comprises text header, text snippet, text text.
According to another aspect of the present invention, provide a kind of generating apparatus of text fingerprints information,Comprise: extraction element, for extracting the initial characteristics vector of text; Valuator device, for inciting somebody to actionThe multiple value of minimal weight value is given at least one element in initial characteristics vector as weighted value,The weighted value of other elements is endowed minimal weight value; Eigenvector transform device, for basisMultiple increases respective element and forms new characteristic vector in initial characteristics vector; Finger print information is rawApparatus for converting, carries out generating after min-hash computing the finger of described text for the characteristic vector to newLine information.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Extraction element, for selecting representative words to form the initial characteristics vector of news.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Extraction element, for text is carried out to word segmentation processing, further goes garbage to process structureBecome the initial characteristics vector of text.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Extraction element, for according to the words sequence words frequency of occurrences of text order from high to lowArrange words, take out from front to back the words of predetermined number as the initial characteristics vector of text.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Valuator device, for calculating the word frequency of each element of initial characteristics vector, words frequentlyRate is the number of times that element occurs in described text, determines initial characteristics vector according to word frequencyMinimal weight value and corresponding element, the multiple weighted value of minimal weight value and correspondingElement.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Valuator device, for calculating the document frequency of each element of initial characteristics vector, document frequentlyRate is the amount of text that has comprised element, and inverse document frequency is the function that is inverse ratio with document frequencyBe worth, determine minimal weight value and the corresponding element thereof of initial characteristics vector according to inverse document frequency,The multiple weighted value of minimal weight value and corresponding element thereof.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Valuator device, for calculating word frequency and the contrary literary composition of each element of initial characteristics vectorShelves frequency, determines that according to the word frequency calculating and inverse document frequency characteristic vectorLittle weighted value and corresponding element thereof, the multiple weighted value of minimal weight value and corresponding element thereof.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Valuator device, for according to each element of initial characteristics vector residing position in textDetermine the weighted value of element.
Alternatively, in the generating apparatus of text fingerprints information according to an embodiment of the invention,Position comprises text header, text snippet, text text.
Beneficial effect of the present invention is: in the time that text message generates finger print information, improved fingerprintThe degree of accuracy of information, the information cluster of making has more excellent effect.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand thisBright technological means, and can be implemented according to the content of description, and in order to allow the present inventionAbove-mentioned and other objects, features and advantages can become apparent, below especially exemplified by of the present inventionDetailed description of the invention.
Brief description of the drawings
By reading below detailed description of the preferred embodiment, various other advantage and benefitsTo become cheer and bright for those of ordinary skill in the art. Accompanying drawing is only for illustrating preferred enforcementThe object of mode, and do not think limitation of the present invention. And in whole accompanying drawing, useIdentical reference symbol represents identical parts. In the accompanying drawings:
Fig. 1 has schematically shown the generation of text fingerprints information according to an embodiment of the inventionThe flow chart of method;
Fig. 2 has schematically shown the feature of extracting according to an embodiment of the invention text messageThe flow chart of vector;
Fig. 3 has schematically shown the generation of text fingerprints information according to an embodiment of the inventionThe block diagram of device;
Fig. 4 has schematically shown the block diagram of extraction element according to an embodiment of the invention.
Specific embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail. Although accompanying drawingIn shown exemplary embodiment of the present disclosure, but should be appreciated that can be real with various formsThe embodiment that shows the disclosure and should do not set forth here limits. On the contrary, provide these embodimentIn order more thoroughly to understand the disclosure, and can be by biography complete the scope of the present disclosureReach to those skilled in the art.
Refer to Fig. 1, it shows a kind of text message that the specific embodiment of the invention providesThe method of cluster, comprising:
Step 110, the characteristic vector of extraction text message.
Particularly, step 110 optionally comprises the following steps, and refers to Fig. 2:
Step 1101, carries out word segmentation processing to text message.
First this detailed description of the invention can carry out word segmentation processing, obtains multiple words. Through undueWord words after treatment has for example comprised the words such as " Ma Yili ", " new film ", " yardstick ", alsoComprise garbage.
Step 1102, goes garbage processing to the words after word segmentation processing.
Garbage can be divided into punctuation mark by it, with the nothing in Chinese such as structural auxiliary word function wordThe vocabulary of meaning. In the specific embodiment of the invention, after word segmentation processing, can also be furtherComprise the words after word segmentation processing is gone to garbage processing.
Step 1103, selects representative words to form the characteristic vector of news.
Optionally, can be after going garbage to process the words that obtain as the feature of news toAmount. Or extracting representative words in the words obtaining after going garbage to process formsThe characteristic vector of news.
For example, for one section of news report webpage, through participle and after going garbage to process,Obtain a words sequence S=(s1,s2,s3......,sN), wherein s1, s2, s3 etc. represent through participle withAnd remove garbage words after treatment.
In words sequence S, may there is identical words, therefore can be to the word in words sequenceThe word word frequency statistics of being correlated with, more further enter according to words occurrence number order from high to lowRow is arranged, and takes out from front to back the character of predetermined number as the characteristic vector of this newsletter archive.
Step 120, in characteristic vector, the weighted value of at least one element is endowed minimal weight valueMultiple, the weighted value of other elements is endowed minimal weight value.
For example, the characteristic vector S of certain text message is (the large job market of Ma Yili new film yardstickImperial elder sister's model must so be worn), to the weight assignment 0.4 of " Ma Yili ", " new film "Weight assignment 0.2, other each element weight assignment 0.1.
Wherein, the weighted value 0.1 of other elements is minimal weight value, and " new film " weighted value isThe minimal weight value of 2 times, the weighted value of " Ma Yili " is the minimal weight value of 4 times.
In one embodiment of this invention, the following methods that can adopt of weight is determined:
Word frequency TF represents the frequency that a certain words Ti occurs in a certain document D j, TiThe frequency occurring is higher, and TFi is higher, illustrates that this words is more important for whole document, exampleAs, for one section of document D j that talks about Ma Yili, the frequency TFi ratio that document Li Mayi clever occursHigher.
According to the word frequency of the each words in characteristic vector, determine each in characteristic vectorThe weighted value of element.
In a certain embodiment of the present invention, the following methods that can adopt of weight is determined:
Document frequency DF has represented to comprise the number of the document of a certain words Ti, for a certain wordWord Ti, the document that comprises this words Ti is more, and DFi is larger, and Ti is different for distinguishingThe effect of document is less, belongs to non-focus word.
Inverse document frequency IDF, DF is inverse relation with document frequency. Such as but not limited to,For a certain words, can set IDFi=log (N/DFi), wherein N is total number of documents. IfA certain words only appears in one section of document, and DFi is 1, and IDFi is logN, nowThis words is for the differentiation effect maximum between document.
According to the inverse document frequency of the each words in characteristic vector, determine each in characteristic vectorThe weighted value of individual element.
In a certain embodiment of the present invention, the following methods that can adopt of weight is determined:
According to the word frequency of the each words in characteristic vector and inverse document frequency, determine specialLevy the weighted value of each element in vector. Such as but not limited to, can taking advantage of IF and IDFLong-pending as parameter, determine the weighted value of each element in characteristic vector.
In a detailed description of the invention of the present invention, the following methods that can adopt of weight is determined:
Words appears at title, text snippet, the different positions such as text text, its important journeyDegree is different, is also different to the role of delegate of text. Therefore, can be according in characteristic vectorThe weighted value of each element residing location positioning element in text, position can comprise butBe not limited to text header, text snippet, text text.
In a certain embodiment of the present invention, the following methods that can adopt of weight is determined:
Position according to words in text, and word frequency and/or the definite spy of inverse document frequencyLevy the weighted value of each element in vector.
Step 130 increases respective element according to described multiple and forms new spy in characteristic vectorLevy vector.
For example, increase by 3 " Ma Yili ", 1 " new film " is to former characteristic vector (Ma YiliThe imperial elder sister's model in the large job market of new film yardstick must so be worn) in, form new feature to(elder sister is driven in the large job market of Ma Yili Ma Yili Ma Yili Ma Yili new film new film yardstick to amountModel must so be worn).
Those skilled in the art are known, and other increase respective element in characteristic vector according to multipleThe method that forms new characteristic vector is also all fine, for example, increases by 6 " Ma Yili ",2 " new film " is to former characteristic vector; Or increase by 2 " Ma Yili " 2 " new film "To former characteristic vector etc.
Step 140, carries out generating after min-hash computing text message to new characteristic vectorFinger print information.
To former characteristic vector, (the imperial elder sister's model in the large job market of Ma Yili new film yardstick must thisWear) numerical value that carries out generating after min-hash computing is the finger print information of text information, not(American-European wind dress ornament is taken for the finger print information of the text message of process tax weight processing and characteristic vectorJoin and be promoted to the imperial elder sister's model in job market) finger print information there is higher likeness in form degree. But, realOn border, the probability that the public pays close attention to above-mentioned two text message is simultaneously not high.
According to new characteristic vector (Ma Yili Ma Yili Ma Yili Ma Yili new film new filmThe imperial elder sister's model in the large job market of yardstick must so be worn) carry out generating after min-hash computingThe finger print information of text information, (American-European wind clothing matching is promoted to duty with characteristic vectorImperial elder sister's model) finger print information have higher discrimination, therefore, the reaction that more can prepareThe relevance of practical application Chinese version information.
Fig. 3 shows the frame of the generating apparatus of the text fingerprints information that the embodiment of the present invention providesFigure.
Known according to Fig. 3, the generating apparatus of text fingerprints information comprises extraction element 210, composesValue device 220, eigenvector transform device 230, finger print information generating apparatus 240.
Extraction element 210, for extracting the initial characteristics vector of text.
Fig. 4 has schematically shown the block diagram of extraction element according to an embodiment of the invention. AsShown in figure, in this embodiment, extraction element has comprised participle device 2101, and garbage is removed dressPut 2102, initial characteristics vector generator 2103.
Participle device 2101, forms words sequence for text is carried out after word segmentation processing.
For example, the words of text message after word segmentation processing for example comprised " Ma Yili ", " newSheet ", the words such as " yardstick ".
Optionally, extraction element also comprises garbage removal device 2102.
Garbage removal device, for going to garbage place to the words after word segmentation processingReason. Garbage can be divided into punctuation mark by it, with the nothing in Chinese such as structural auxiliary word function wordThe vocabulary of meaning.
Initial characteristics vector generator 2103, for generating initial characteristics vector.
The words that optionally, can obtain after going garbage to process is as the initial spy of textLevy vector. Or extract representative words in the words obtaining after going garbage to processForm the characteristic vector of news.
For example, for one section of news report webpage, through participle and after going garbage to process,Obtain a words sequence S=(s1,s2,s3......,sN), wherein s1, s2, s3 etc. represent through participle withAnd remove garbage words after treatment.
In words sequence S, may there is identical words, therefore can be to the word in words sequenceThe word word frequency statistics of being correlated with, more further enter according to words occurrence number order from high to lowRow is arranged, and takes out from front to back the character of predetermined number as the characteristic vector of this newsletter archive.
Valuator device 220, for giving the multiple value of minimal weight value initially as weighted valueAt least one element in characteristic vector, the weighted value of other elements is endowed minimal weight value.
For example, the characteristic vector S of certain text message is (the large job market of Ma Yili new film yardstickImperial elder sister's model must so be worn), to the weight assignment 0.4 of " Ma Yili ", " new film "Weight assignment 0.2, other each element weight assignment 0.1.
Wherein, the weighted value 0.1 of other elements is minimal weight value, and " new film " weighted value isThe minimal weight value of 2 times, the weighted value of " Ma Yili " is the minimal weight value of 4 times.
In one embodiment of this invention, the following methods that can adopt of weight is determined:
Word frequency TF represents the frequency that a certain words Ti occurs in a certain document D j, TiThe frequency occurring is higher, and TFi is higher, illustrates that this words is more important for whole document, exampleAs, for one section of document D j that talks about Ma Yili, the frequency TFi ratio that document Li Mayi clever occursHigher.
According to the word frequency of the each words in characteristic vector, determine each in characteristic vectorThe weighted value of element.
In a certain embodiment of the present invention, the following methods that can adopt of weight is determined:
Document frequency DF has represented to comprise the number of the document of a certain words Ti, for a certain wordWord Ti, the document that comprises this words Ti is more, and DFi is larger, and Ti is different for distinguishingThe effect of document is less, belongs to non-focus word.
Inverse document frequency IDF, DF is inverse relation with document frequency. Such as but not limited to,For a certain words, can set IDFi=log (N/DFi), wherein N is total number of documents. IfA certain words only appears in one section of document, and DFi is 1, and IDFi is logN, nowThis words is for the differentiation effect maximum between document.
According to the inverse document frequency of the each words in characteristic vector, determine each in characteristic vectorThe weighted value of individual element.
In a certain embodiment of the present invention, the following methods that can adopt of weight is determined:
According to the word frequency of the each words in characteristic vector and inverse document frequency, determine specialLevy the weighted value of each element in vector. Such as but not limited to, can taking advantage of IF and IDFLong-pending as parameter, determine the weighted value of each element in characteristic vector.
In a detailed description of the invention of the present invention, the following methods that can adopt of weight is determined:
Words appears at title, text snippet, the different positions such as text text, its important journeyDegree is different, is also different to the role of delegate of text. Therefore, can be according in characteristic vectorThe weighted value of each element residing location positioning element in text, position can comprise butBe not limited to text header, text snippet, text text.
In a certain embodiment of the present invention, the following methods that can adopt of weight is determined:
Position according to words in text, and word frequency and/or the definite spy of inverse document frequencyLevy the weighted value of each element in vector.
Eigenvector transform device 230, for according to the multiple of minimal weight value at initial characteristicsIn vector, increase respective element and form new characteristic vector.
For example, increase by 3 " Ma Yili ", 1 " new film " is to former characteristic vector (Ma YiliThe imperial elder sister's model in the large job market of new film yardstick must so be worn) in, form new feature to(elder sister is driven in the large job market of Ma Yili Ma Yili Ma Yili Ma Yili new film new film yardstick to amountModel must so be worn).
Those skilled in the art are known, and other increase respective element in characteristic vector according to multipleThe method that forms new characteristic vector is also all fine, for example, increases by 6 " Ma Yili ",2 " new film " is to former characteristic vector; Or increase by 2 " Ma Yili " 2 " new film "To former characteristic vector etc.
Finger print information generating apparatus 240, carries out min-hash computing for the characteristic vector to newThe finger print information of rear generation text.
To former characteristic vector, (the imperial elder sister's model in the large job market of Ma Yili new film yardstick must thisWear) numerical value that carries out generating after min-hash computing is the finger print information of text information, not(American-European wind dress ornament is taken for the finger print information of the text message of process tax weight processing and characteristic vectorJoin and be promoted to the imperial elder sister's model in job market) finger print information there is higher likeness in form degree. But, realOn border, the probability that the public pays close attention to above-mentioned two text message is simultaneously not high.
According to new characteristic vector (Ma Yili Ma Yili Ma Yili Ma Yili new film new filmThe imperial elder sister's model in the large job market of yardstick must so be worn) carry out generating after min-hash computingThe finger print information of text information, (American-European wind clothing matching is promoted to duty with characteristic vectorImperial elder sister's model) finger print information have higher discrimination, therefore, the reaction that more can prepareThe relevance of practical application Chinese version information.
All parts embodiment of the present invention can realize with hardware, or with at one or manyThe software module of moving on individual processor realizes, or realizes with their combination. This areaTechnical staff should be appreciated that and can use in practice microprocessor or digital signal processor(DSP) realize according to one in the generating apparatus of the text fingerprints information of the embodiment of the present inventionThe some or all functions of a little or whole parts. The present invention can also be embodied as for carrying outThe equipment of part or all of method as described herein or device program (for example, meterCalculation machine program and computer program). Realizing program of the present invention and can be stored in meter like thisOn calculation machine computer-readable recording medium, or can there is the form of one or more signal. Such letterNumber can download and obtain from internet website, or provide on carrier signal, or to appointWhat his form provides.
Alleged " embodiment ", " embodiment " or " one or more enforcement hereinExample " mean, special characteristic, structure or the characteristic described are included in the present invention in conjunction with the embodimentsAt least one embodiment in. In addition note that here, the word of " in one embodiment "Example not necessarily refers to same embodiment entirely.
In the description that provided herein, a large amount of details are described. But, Neng GouliSeparate, embodiments of the invention can be put into practice in the situation that there is no these details. OneIn a little examples, be not shown specifically known method, structure and technology, so that not fuzzy to thisThe understanding of description.
It should be noted above-described embodiment the present invention will be described instead of the present invention is carried outRestriction, and those skilled in the art can in the case of not departing from the scope of claimsDesign alternative embodiment. In the claims, should be by any reference between bracketSymbol construction becomes limitations on claims. Word " comprise " do not get rid of existence be not listed in right wantElement in asking or step. Being positioned at word " " before element or " one ", not get rid of existence manyIndividual such element. The present invention can be by means of including the hardware of some different elements and borrowingThe computer that helps suitably programming is realized. In the unit claim of having enumerated some devices,Several in these devices can be to carry out imbody by same hardware branch. Word first,Second and the use of C grade do not represent any order. Can be title by these word explanations.
In addition, shall also be noted that the language using in this description be mainly for readable andThe object of instruction is selected, instead of in order to explain or to limit theme of the present invention and select. Therefore, in the case of not departing from the scope and spirit of appended claims, for thisThe many modifications and changes of those of ordinary skill of technical field are all apparent. ForScope of the present invention, disclosing that the present invention is done is illustrative, and nonrestrictive, thisScope of invention is limited by appended claims.

Claims (10)

1. a generation method for text fingerprints information, comprising:
Extract the initial characteristics vector of text;
In described initial characteristics vector, the weighted value of at least one element is endowed minimal weight valueMultiple value, the weighted value of other elements is endowed minimal weight value;
Increase in initial characteristics vector according to described multiple respective element form new feature toAmount;
New characteristic vector is carried out generating after min-hash computing to the finger print information of described text.
2. the generation method of text fingerprints information according to claim 1, is characterized in that,The formation of described initial characteristics vector is specially the spy who selects representative words to form textLevy vector.
3. according to the generation method of arbitrary described text fingerprints information in claim 1-2, itsBe characterised in that, described text is carried out to word segmentation processing, further go garbage to process structureBecome the characteristic vector of text.
4. according to the generation method of arbitrary described text fingerprints information in claim 1-3, itsBe characterised in that, the formation of described initial characteristics vector is specially according to word in the words sequence of textWord frequency of occurrences order is from high to low arranged words, takes out from front to back the words of predetermined numberAs the initial characteristics vector of described text.
5. according to the generation method of arbitrary described text fingerprints information in claim 1-4, itsBe characterised in that, calculate the word frequency of each element in initial characteristics vector, described words frequentlyRate is the number of times that element occurs in described text, determines described initial according to described word frequencyThe minimal weight value of characteristic vector and corresponding element thereof, the multiple weighted value of minimal weight value andThe element that it is corresponding.
6. according to the generation method of arbitrary described text fingerprints information in claim 1-5, itsBe characterised in that, calculate the document frequency of each element in initial characteristics vector, described document frequentlyRate is the amount of text that has comprised described element, and inverse document frequency is for to be anti-with described document frequencyThe functional value of ratio, determines the minimal weight of described initial characteristics vector according to described inverse document frequencyValue and corresponding element thereof, the multiple weighted value of minimal weight value and corresponding element thereof.
7. according to the generation method of arbitrary described text fingerprints information in claim 1-6, itsBe characterised in that, calculate word frequency and the contrary document frequency of each element in initial characteristics vectorRate, determines characteristic vector according to the described word frequency calculating and described inverse document frequencyMinimal weight value and corresponding element, the multiple weighted value of minimal weight value and correspondingElement.
8. according to the generation method of arbitrary described text fingerprints information in claim 1-7, itsBe characterised in that, according to the residing location positioning in text of each element in initial characteristics vectorThe weighted value of element.
9. according to the generation method of arbitrary described text fingerprints information in claim 1-8, itsBe characterised in that, described position comprises text header, text snippet, text text.
10. a generating apparatus for text fingerprints information, comprising:
Extraction element, for extracting the initial characteristics vector of text;
Valuator device, described initial for giving the multiple value of minimal weight value as weighted valueAt least one element in characteristic vector, the weighted value of other elements is endowed minimal weight value;
Eigenvector transform device, for increasing phase according to described multiple at initial characteristics vectorAnswer element to form new characteristic vector;
Finger print information generating apparatus, carries out giving birth to after min-hash computing for the characteristic vector to newBecome the finger print information of described text.
CN201510975957.2A 2015-12-22 2015-12-22 A kind of generation method and device of text fingerprints information Active CN105589962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510975957.2A CN105589962B (en) 2015-12-22 2015-12-22 A kind of generation method and device of text fingerprints information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510975957.2A CN105589962B (en) 2015-12-22 2015-12-22 A kind of generation method and device of text fingerprints information

Publications (2)

Publication Number Publication Date
CN105589962A true CN105589962A (en) 2016-05-18
CN105589962B CN105589962B (en) 2018-11-02

Family

ID=55929541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510975957.2A Active CN105589962B (en) 2015-12-22 2015-12-22 A kind of generation method and device of text fingerprints information

Country Status (1)

Country Link
CN (1) CN105589962B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145080A (en) * 2018-07-26 2019-01-04 新华三信息安全技术有限公司 A kind of text fingerprints preparation method and device
WO2019098454A1 (en) * 2017-11-15 2019-05-23 주식회사 세진마인드 Technique for generating and utilizing virtual fingerprint representing text data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100017850A1 (en) * 2008-07-21 2010-01-21 Workshare Technology, Inc. Methods and systems to fingerprint textual information using word runs
CN104239306A (en) * 2013-06-08 2014-12-24 华为技术有限公司 Multimedia fingerprint Hash vector construction method and device
CN102187642B (en) * 2011-04-14 2015-01-07 华为技术有限公司 Method and device for adding, searching for and deleting key in hash table
CN103971061B (en) * 2014-05-26 2017-06-30 中电长城网际***应用有限公司 Text fingerprint acquisition methods and its device, data managing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100017850A1 (en) * 2008-07-21 2010-01-21 Workshare Technology, Inc. Methods and systems to fingerprint textual information using word runs
CN102187642B (en) * 2011-04-14 2015-01-07 华为技术有限公司 Method and device for adding, searching for and deleting key in hash table
CN104239306A (en) * 2013-06-08 2014-12-24 华为技术有限公司 Multimedia fingerprint Hash vector construction method and device
CN103971061B (en) * 2014-05-26 2017-06-30 中电长城网际***应用有限公司 Text fingerprint acquisition methods and its device, data managing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019098454A1 (en) * 2017-11-15 2019-05-23 주식회사 세진마인드 Technique for generating and utilizing virtual fingerprint representing text data
US11373043B2 (en) 2017-11-15 2022-06-28 Sejin Mind Inc. Technique for generating and utilizing virtual fingerprint representing text data
CN109145080A (en) * 2018-07-26 2019-01-04 新华三信息安全技术有限公司 A kind of text fingerprints preparation method and device
CN109145080B (en) * 2018-07-26 2021-01-01 新华三信息安全技术有限公司 Text fingerprint obtaining method and device

Also Published As

Publication number Publication date
CN105589962B (en) 2018-11-02

Similar Documents

Publication Publication Date Title
CN109992764B (en) File generation method and device
CN108363602B (en) Intelligent UI (user interface) layout method and device, terminal equipment and storage medium
CN105138652B (en) A kind of enterprise's incidence relation recognition methods and system
CN105630767A (en) Text similarity comparison method and device
CN105915438A (en) Message pushing method, apparatus, and system
CN104216931A (en) Real-time recommending system and method
CN107798579B (en) Method for generating protocol file and terminal thereof
CN105183794A (en) Business serial number generation apparatus and method
CN108572990A (en) Information-pushing method and device
WO2021104097A1 (en) Meme generation method and apparatus, and terminal device
CN112581162A (en) Information content display method, device, storage medium and terminal
CN106681716A (en) Intelligent terminal and automatic classification method of application programs thereof
CN112328857B (en) Product knowledge aggregation method and device, computer equipment and storage medium
JP2017509069A (en) Input resource push method, system, computer storage medium and device
CN112685493A (en) Report processing method and device, electronic equipment and storage medium
CN110083759A (en) Public opinion information crawler method, apparatus, computer equipment and storage medium
CN110457596A (en) A kind of resource recommendation processing method and processing device
CN102521713B (en) Data processing equipment and data processing method
CN105589962A (en) Method and device for generating text fingerprint information
JP2019520617A (en) Advertisement generation method, computer readable storage medium and system
CN106126496B (en) A kind of information segmenting method and device
CN104881447A (en) Searching method and device
CN102937973B (en) A kind of generation is used for the method and apparatus presenting configuration information that information presents
CN102982011A (en) Method and device for identifying out-of-sequence texts
CN116089732B (en) User preference identification method and system based on advertisement click data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220728

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.