CN105824955B - Short message clustering method and device - Google Patents
Short message clustering method and device Download PDFInfo
- Publication number
- CN105824955B CN105824955B CN201610191407.6A CN201610191407A CN105824955B CN 105824955 B CN105824955 B CN 105824955B CN 201610191407 A CN201610191407 A CN 201610191407A CN 105824955 B CN105824955 B CN 105824955B
- Authority
- CN
- China
- Prior art keywords
- short message
- category quantity
- similarity
- cluster result
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 239000011159 matrix material Substances 0.000 claims abstract description 97
- 230000003595 spectral effect Effects 0.000 claims abstract description 60
- 230000008569 process Effects 0.000 claims abstract description 22
- 239000013598 vector Substances 0.000 claims description 58
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000012937 correction Methods 0.000 claims description 10
- 238000003064 k means clustering Methods 0.000 claims description 8
- 230000001174 ascending effect Effects 0.000 claims description 6
- 230000000052 comparative effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 12
- 238000012545 processing Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000012092 media component Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
- H04M1/72403—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
- H04M1/7243—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
- H04M1/72436—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Business, Economics & Management (AREA)
- Human Computer Interaction (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The disclosure is directed to a kind of short message clustering method and device, method includes: to construct the similarity matrix of short message set according to the similarity in short message set between any two short message;Hierarchical clustering is carried out to similarity matrix using default similarity threshold, obtains reference category quantity;Determine categorical measure, comprising: first category quantity and second category quantity are determined according to reference category quantity;Spectral clustering process, comprising: spectral clustering is carried out to similarity matrix using reference category quantity, first category quantity and second category quantity as number of clusters, obtains short message cluster result;When short message cluster result meets preset condition, the corresponding short message cluster result of reference category quantity is determined as target message cluster result.This method is the structural similarity based on short message when clustering to the short message in short message set, fully considered short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves.
Description
Technical field
This disclosure relates to data classification technology field more particularly to a kind of short message clustering method and device.
Background technique
Ordinary user's short message is related to the more privacy of user, and the complicated multiplicity of sentence structure, in text mining, relates generally to
It is less.Notify that class short message structure is relatively more rigorous, usually the important object of text mining.
Text Clustering Method has the methods of k-means, hierarchical clustering.But these clustering methods are difficult to consider in cluster
Sentence structurally and semantically between similitude, therefore, in cluster, obtained cluster result accuracy is lower, has very big
Limitation.
Summary of the invention
To overcome the problems in correlation technique, the disclosure provides a kind of short message clustering method and device.
According to the first aspect of the embodiments of the present disclosure, a kind of short message clustering method is provided, comprising:
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as
Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity
It is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using this method
Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn
Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all
Such as batch label, batch delete operation.
Optionally, the first category quantity, second category quantity are adjacent with the reference category quantity, and described first
Categorical measure is less than the second category quantity.
Optionally, described using the reference category quantity, first category quantity and second category quantity as cluster
Quantity carries out spectral clustering to the similarity matrix, obtains cluster result, comprising:
Obtain the characteristic value and the corresponding feature vector of characteristic value in the similarity matrix;
It is ascending from described eigenvector respectively according to the reference category quantity, first category quantity and the second class
Other quantity selects three groups of feature vectors;
Three groups of feature vectors of selection are formed into three characteristic vector spaces;
The feature vector in each characteristic vector space is clustered respectively using K-means clustering algorithm, obtains three
Group cluster classification is as short message cluster result.
In this way, not only being clustered to reference cluster classification, but also to two adjacent with reference cluster classification
Classification is equally clustered, and obtains three group cluster classifications as distance results, in order to subsequent between three group cluster classifications
Difference judged, and then determine this determine reference cluster classification it is whether appropriate.
Optionally, which comprises
Calculate separately the weighted average of distance between the class cluster mass center of any two cluster classification in every group cluster classification;
The ratio of the weighted average of three group cluster classifications is calculated using default ratio formula;
Judge whether the ratio is greater than first threshold;
When the ratio is greater than first threshold, determine that the short message cluster result meets preset condition.
This method that the embodiment of the present disclosure provides, using with reference to cluster classification and adjacent with reference cluster classification two
After a classification is clustered, by the relationship between each weighted average of calculating, can determine whether this cluster is accurate, and
In inaccuracy, it can be iterated operation, until obtaining optimal classification.
Optionally, the similarity according in short message set between any two short message constructs the short message set
Similarity matrix, comprising:
The similarity in short message set between any two short message is calculated using default similarity formula;
Generate similarity matrix of the matrix comprising all similarities as the short message set.
Optionally, the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the latent Dirichletal location model LDA theme vector of short message A;Vec (B) is short message B's
LDA theme vector;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
Optionally, described that hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtain reference class
Other quantity, comprising:
The value of similarity each in the similarity matrix is compared with the default similarity threshold respectively;
Extract all similarities for being greater than default similarity threshold in the similarity matrix;
The short message that similarity between any two short message is all larger than the default similarity threshold is determined as a class
Not;
The quantity of obtained classification will be determined as the reference category quantity.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape
The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation,
And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold
The quantity of the class of cause.
Optionally, the method also includes:
When the short message cluster result is unsatisfactory for preset condition, the value of the reference category quantity is corrected, iteration executes
The determining categorical measure and the spectral clustering process, until the short message cluster result obtained meets the preset condition.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape
The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation,
And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold
The quantity of the class of cause.After being modified in order to the subsequent quantity based on the rough class, it is iterated Learning Clustering quantity, directly
Until Clustering Effect is stablized to the end.
Optionally, the value of the amendment reference category quantity, comprising:
Obtain the default ratio in the short message cluster result;
When the default ratio is less than second threshold, the value of the reference category quantity is subtracted into the first preset value;
When the default ratio is greater than second threshold, the value of the reference category quantity is increased by the second preset value.
In the embodiments of the present disclosure, since second threshold is less than the first threshold, that is, being less than first threshold in ratio
Afterwards, it can also be further compared with second threshold, and according to the ratio with second threshold as a result, to determine reference class
Other quantity is to increase or reduce.
According to the second aspect of an embodiment of the present disclosure, a kind of short message clustering apparatus is provided, comprising:
Matrix constructs module, for constructing the short message according to the similarity in short message set between any two short message
The similarity matrix of set;
Hierarchical clustering determining module, for carrying out hierarchical clustering to the similarity matrix using default similarity threshold,
Obtain reference category quantity;
Categorical measure determining module, for determining categorical measure, comprising: the first kind is determined according to the reference category quantity
Other quantity and second category quantity;
Spectral clustering module, for carrying out spectral clustering process, comprising: by the reference category quantity, first category quantity and
Second category quantity carries out spectral clustering to the similarity matrix respectively as number of clusters, obtains short message cluster result;
As a result determining module, for when the short message cluster result meets preset condition, by the reference category quantity
Corresponding short message cluster result is determined as target message cluster result.
Optionally, the first category quantity, second category quantity are adjacent with the reference category quantity, and described first
Categorical measure is less than the second category quantity.
Optionally, the spectral clustering module, comprising:
Feature acquisition submodule, for obtaining characteristic value and the corresponding feature of characteristic value in the similarity matrix
Vector;
Vector choose submodule, for it is ascending from described eigenvector respectively according to the reference category quantity,
First category quantity and second category quantity select three groups of feature vectors;
Vector space forms submodule, for the three groups of feature vectors chosen to be separately constituted three feature vector skies
Between;
Submodule is clustered, for distinguishing using K-means clustering algorithm the feature vector in each characteristic vector space
It is clustered, obtains three group cluster classifications as short message cluster result.
Optionally, described device includes:
Mean value calculation module, for calculate separately the cluster classification of any two in every group cluster classification class cluster mass center it
Between distance weighted average;
Ratio calculation module, the ratio of the weighted average for calculating three group cluster classifications using default ratio formula;
Ratio in judgement module, for judging whether the ratio is greater than first threshold;
First determining module, for it is pre- to determine that the short message cluster result meets when the ratio is greater than first threshold
If condition.
Optionally, the matrix constructs module, comprising:
Similarity calculation submodule, for being calculated in short message set between any two short message using default similarity formula
Similarity;
Matrix generates submodule, for generating similarity moment of the matrix comprising all similarities as the short message set
Battle array.
Optionally, the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the LDA theme vector of short message A;Vec (B) is the LDA theme vector of short message B;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
Optionally, the hierarchical clustering determining module, comprising:
Comparative sub-module, for by the value of similarity each in the similarity matrix respectively with the default similarity threshold
Value is compared;
Extracting sub-module, for extracting all similarities for being greater than default similarity threshold in the similarity matrix;
Classification determines submodule, for the similarity between any two short message to be all larger than the default similarity threshold
Short message be determined as a classification;
Reference category quantity determines submodule, and the quantity of the classification for that will determine is as the reference category number
Amount.
Optionally, described device further include:
Correction module, for correcting the reference category quantity when the short message cluster result is unsatisfactory for preset condition
Value;
After correction module amendment, iteration executes the determining categorical measure and the spectral clustering process, until obtaining
The short message cluster result obtained meets the preset condition.
Optionally, the correction module, comprising:
Ratio acquisition submodule, for obtaining the default ratio in the short message cluster result;
First amendment submodule, is used for when the default ratio is less than second threshold, by the reference category quantity
Value subtracts the first preset value;
Second amendment submodule, is used for when the default ratio is greater than second threshold, by the reference category quantity
Value increases by the second preset value.
According to the third aspect of an embodiment of the present disclosure, a kind of terminal is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as
Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity
It is determined as target message cluster result.
According to a fourth aspect of embodiments of the present disclosure, a kind of server is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as
Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity
It is determined as target message cluster result.
The technical scheme provided by this disclosed embodiment can include the following benefits:
This method that the embodiment of the present disclosure provides, according to the similarity in short message set between any two object short message,
Construct the similarity matrix of the short message set;Hierarchical clustering is carried out to the similarity matrix using default similarity threshold,
Obtain reference category quantity;Determine categorical measure, comprising: first category quantity and second are determined according to the reference category quantity
Categorical measure;Spectral clustering process, comprising: make the reference category quantity, first category quantity and second category quantity respectively
Spectral clustering is carried out to the similarity matrix for number of clusters, obtains short message cluster result;When the short message cluster result meets
When preset condition, the corresponding short message cluster result of the reference category quantity is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using this method
Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn
Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all
Such as batch label, batch delete operation.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure.
In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, for those of ordinary skill in the art
Speech, without any creative labor, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of short message clustering method shown according to an exemplary embodiment;
Fig. 2 is the flow diagram of step S104 in Fig. 1;
Fig. 3 is the flow chart of another short message clustering method shown according to an exemplary embodiment;
Fig. 4 is the flow diagram of step S102 in Fig. 1;
Fig. 5 is the flow chart of another short message clustering method shown according to an exemplary embodiment;
Fig. 6 is a kind of structural schematic diagram of short message clustering apparatus shown according to an exemplary embodiment;
Fig. 7 is the structural schematic diagram of spectral clustering module 14 in Fig. 6;
Fig. 8 is the structural schematic diagram of another short message clustering apparatus shown according to an exemplary embodiment;
Fig. 9 is the structural schematic diagram of matrix building module 11 in Fig. 6;
Figure 10 is the structural schematic diagram of the middle-level cluster determining module 12 of Fig. 6;
Figure 11 is a kind of block diagram of terminal 1100 shown according to an exemplary embodiment;
Figure 12 is a kind of block diagram of server 1200 for short message cluster shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Fig. 1 is a kind of flow chart of short message clustering method shown according to an exemplary embodiment.As shown in Figure 1, the party
Method may comprise steps of.
In step s101, according to the similarity in short message set between any two short message, the short message set is constructed
Similarity matrix.
Due to the special construction of short message, it is different from other information, so when calculating the similarity between short message, it is contemplated that
The sentence structure and semantic structure of short message can preset a similarity formula, then be counted using the similarity formula
The similarity between short message is calculated, here the structural similarity of similarity namely short message.
In the embodiments of the present disclosure, which may include following two step.
S1: the similarity in short message set between any two short message is calculated using default similarity formula.
In the embodiments of the present disclosure, similarity formula is preset are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is LDA (Latent Dirichlet Allocation, the latent Dirichletal location of short message A
Model) theme vector;Vec (B) is the LDA theme vector of short message B;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
S2: similarity matrix of the matrix comprising all similarities as the short message set is generated.
After calculating the similarity into short message set between any two short message, it can be constructed according to similarity formula
Similarity matrix comprising the similarity of all short messages in short message set.In the embodiments of the present disclosure, similarity matrix can use W
To indicate.
In step s 102, hierarchical clustering is carried out to the similarity matrix using default similarity threshold, is referred to
Categorical measure.
When carrying out hierarchical clustering, minimum similarity threshold can be set, then using minimum similarity threshold to similar
Degree cluster carries out hierarchical clustering, obtains reference category quantity.In the embodiments of the present disclosure, reference category quantity can be with n come table
Show.
In step s 103, determine categorical measure, comprising: according to the reference category quantity determine first category quantity and
Second category quantity.
In the embodiments of the present disclosure, number of clusters can be equal to reference category quantity, wherein number of clusters is with k come table
Show, even k=n, then on the basis of k=n, calculates separately to obtain two categorical measures: first category quantity and the second class
Other quantity.In the embodiments of the present disclosure, first category quantity, second category quantity are adjacent with the reference category quantity, and institute
First category quantity is stated less than the second category quantity, such as: first category quantity is k-1, and second category quantity is k+1.
In step S104, spectral clustering process, comprising: by the reference category quantity, first category quantity and the second class
Other quantity carries out spectral clustering to the similarity matrix respectively as number of clusters, obtains short message cluster result.
It, can be by calculating preceding k eigen vector in similarity matrix W after determining reference category, building is special
Vector space is levied, then the feature vector in each characteristic vector space is clustered respectively using K-means clustering algorithm
Obtain a group cluster classification.
In addition, be directed to first category quantity and second category quantity, also construction feature vector space respectively, and using utilizing
K-means clustering algorithm is clustered, and respectively obtains a group cluster classification.
In the embodiments of the present disclosure, using three obtained group cluster classifications as short message cluster result.
In step s105, when the short message cluster result meets preset condition, the reference category quantity is corresponding
Short message cluster result be determined as target message cluster result.
After obtaining short message cluster result, it is also necessary to judge short message cluster result, judgement here mainly will
The difference for judging reference category quantity and first category quantity, the corresponding weighted average of second category quantity, only works as three
Weighted average difference meet certain requirements after, just determine that obtained reference quantity classification is correctly, and then will just to join
The corresponding short message cluster result of quantity classification is examined as final result.
This method that the embodiment of the present disclosure provides, according to the similarity in short message set between any two object short message,
Construct the similarity matrix of the short message set;Hierarchical clustering is carried out to the similarity matrix using default similarity threshold,
Obtain reference category quantity;Determine categorical measure, comprising: first category quantity and second are determined according to the reference category quantity
Categorical measure;Spectral clustering process, comprising: make the reference category quantity, first category quantity and second category quantity respectively
Spectral clustering is carried out to the similarity matrix for number of clusters, obtains short message cluster result;When the short message cluster result meets
When preset condition, the corresponding short message cluster result of the reference category quantity is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using this method
Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn
Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all
Such as batch label, batch delete operation.
In another embodiment of the disclosure, as shown in Fig. 2, the step S104 in embodiment described in Fig. 1 may include following
Step.
In step S1041, the characteristic value and the corresponding feature vector of characteristic value in the similarity matrix are obtained.
In step S1042, it is ascending from described eigenvector respectively according to the reference category quantity, the first kind
Other quantity and second category quantity select three groups of feature vectors.
In disclosure quantity, reference category quantity can be indicated with k, and first category quantity can be indicated with k-1,
Second category quantity can be indicated with k+1.
In step S1043, three groups of feature vectors of selection are formed into three characteristic vector spaces.
In step S1044, using K-means clustering algorithm to the feature vector in each characteristic vector space respectively into
Row cluster, obtains three group cluster classifications as short message cluster result.
In this way, not only being clustered to reference cluster classification, but also to two adjacent with reference cluster classification
Classification is equally clustered, and obtains three group cluster classifications as distance results, in order to subsequent between three group cluster classifications
Difference judged, and then determine this determine reference cluster classification it is whether appropriate.
In another embodiment of the disclosure, on the basis of embodiment shown in Fig. 2, as shown in figure 3, this method can also wrap
Include following steps.
In step s 201, any two in every group cluster classification are calculated separately and cluster distance between the class cluster mass center of classification
Weighted average.
In step S202, the ratio of the weighted average of three group cluster classifications is calculated using default ratio formula.
In the embodiments of the present disclosure, after obtaining cluster result using k, the weighted average of distance between class cluster mass center is calculated
Value is d;After obtaining cluster result using k-1, the weighted average for calculating distance between class cluster mass center is d-1;When utilize k+1
After obtaining cluster result, the weighted average for calculating distance between class cluster mass center is d+1。
According to three weighted averages being the previously calculated, presetting ratio formula can be a=(d-1-d)/(d-d+1),
Wherein, a is the ratio for weighting draw value.
In step S203, judge whether the ratio is greater than first threshold.
When the ratio is greater than first threshold, step S1063 is executed;When the ratio is less than or equal to first threshold
When, execute step S1064.
In step S204, determine that the short message cluster result meets preset condition.
In the embodiments of the present disclosure, first threshold is usually larger.When ratio is greater than first threshold, expression reference category quantity
Gap between adjacent first category quantity, second category quantity is met the requirements, that is, is classified obvious.
In step S205, determine that the short message cluster result is unsatisfactory for preset condition.
On the contrary, presentation class result is unobvious, needs to re-start classification if ratio is less than first threshold.
This method that the embodiment of the present disclosure provides, using with reference to cluster classification and adjacent with reference cluster classification two
After a classification is clustered, by the relationship between each weighted average of calculating, can determine whether this cluster is accurate, and
In inaccuracy, it can be iterated operation, until obtaining optimal classification.
In another embodiment of the disclosure, as shown in figure 4, abovementioned steps S102 may comprise steps of.
In step S1021, by the value of similarity each in the similarity matrix respectively with the default similarity threshold
Value is compared.
Default similarity threshold can preset for those skilled in the art, in the embodiments of the present disclosure, preset similar
Degree threshold value can be indicated with λ.
In step S1022, all similarities for being greater than default similarity threshold in the similarity matrix are extracted.
In step S1023, the similarity between any two short message is all larger than the short of the default similarity threshold
Letter is determined as a classification.
In step S1024, the quantity of obtained classification will be determined as the reference category quantity.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape
The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation,
And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold
The quantity of the class of cause.
In another embodiment of the disclosure, as shown in figure 5, this method may also comprise the following steps:.
In step s 106, judge whether the short message cluster result meets preset condition;
When the short message cluster result meets preset condition, step S105 is executed;When the short message cluster result is discontented
When sufficient preset condition, step S107 is executed.
In step s 107, the value of the reference category quantity is corrected, iteration executes the determining categorical measure and described
Spectral clustering process, until the short message cluster result obtained meets the preset condition.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape
The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation,
And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold
The quantity of the class of cause.After being modified in order to the subsequent quantity based on the rough class, it is iterated Learning Clustering quantity, directly
Until Clustering Effect is stablized to the end.
In another embodiment of the disclosure, abovementioned steps S107 be may comprise steps of.
S1: the default ratio in the short message cluster result is obtained.
Here presetting at ratio is the ratio being calculated in Fig. 3.
S2: when the default ratio is less than second threshold, the value of the reference category quantity is subtracted into the first preset value.
S3: when the default ratio is greater than second threshold, the value of the reference category quantity is increased by the second preset value.
First preset value and the second preset value may be the same or different.When reference category quantity is unsatisfactory for requiring,
It can be finely adjusted on the basis of reference category quantity, so in the embodiments of the present disclosure, the first preset value and second is preset
Value can be set as 1, and certainly, in other embodiments, those skilled in the art can also be arranged to reference category quantity
Other amount trimmeds, such as: the first preset value and the second preset value can also be disposed as 2 or 3 etc..
In the embodiments of the present disclosure, second threshold is less than the first threshold, that is, after ratio is less than first threshold,
It can also be further compared with second threshold, and according to the ratio with second threshold as a result, to determine reference category
Quantity is to increase or reduce.
Fig. 6 is a kind of structural schematic diagram of short message clustering apparatus shown according to an exemplary embodiment.As shown in fig. 6,
The apparatus may include: matrix constructs module 11, hierarchical clustering determining module 12, categorical measure determining module 13, spectral clustering mould
Block 14 and result determining module 15, wherein
Matrix building module 11 is configured as according to the similarity in short message set between any two short message, described in building
The similarity matrix of short message set;
Hierarchical clustering determining module 12 is configured as carrying out level to the similarity matrix using default similarity threshold
Cluster, obtains reference category quantity;
Categorical measure determining module 13 is configured to determine that categorical measure, comprising: is determined according to the reference category quantity
First category quantity and second category quantity;
Spectral clustering module 14 is configured for spectral clustering process, comprising: by the reference category quantity, first category number
Amount and second category quantity carry out spectral clustering to the similarity matrix respectively as number of clusters, obtain short message cluster result;
As a result determining module 15 is configured as when the short message cluster result meets preset condition, by the reference category
The corresponding short message cluster result of quantity is determined as target message cluster result.
In the embodiments of the present disclosure, optionally, the first category quantity, second category quantity and the reference category number
Measure it is adjacent, and the first category quantity be less than the second category quantity.
The device that the embodiment of the present disclosure provides, according to the similarity in short message set between any two object short message,
Construct the similarity matrix of the short message set;Hierarchical clustering is carried out to the similarity matrix using default similarity threshold,
Obtain reference category quantity;Determine categorical measure, comprising: first category quantity and second are determined according to the reference category quantity
Categorical measure;Spectral clustering process, comprising: make the reference category quantity, first category quantity and second category quantity respectively
Spectral clustering is carried out to the similarity matrix for number of clusters, obtains short message cluster result;When the short message cluster result meets
When preset condition, the corresponding short message cluster result of the reference category quantity is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using the device
Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn
Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all
Such as batch label, batch delete operation.
In another embodiment of the disclosure, as shown in fig. 7, the spectral clustering module 14, may include: that feature obtains submodule
Block 141, vector choose submodule 142, vector space composition submodule 143 and cluster submodule and are configured 144, wherein
Feature acquisition submodule 141 is configured as obtaining characteristic value and characteristic value correspondence in the similarity matrix
Feature vector;
Vector choose submodule 142 be configured as it is ascending from described eigenvector respectively according to the reference category
Quantity, first category quantity and second category quantity select three groups of feature vectors;
Vector space composition submodule 143 is configured as the three groups of feature vectors chosen separately constituting three features
Vector space;
Cluster submodule be configured 144 for using K-means clustering algorithm to the feature in each characteristic vector space to
Amount is clustered respectively, obtains three group cluster classifications as short message cluster result.
In another embodiment of the disclosure, on the basis of the embodiment shown in fig. 7, as shown in figure 8, the device includes: flat
Mean value computation module 21, ratio calculation module 22, ratio in judgement module 23, the first determining module 24 and the second determining module 25,
Wherein,
Mean value calculation module 21 is configured to calculate the class cluster of any two cluster classification in every group cluster classification
The weighted average of distance between mass center;
Ratio calculation module 22 is configured as calculating the weighted average of three group cluster classifications using default ratio formula
Ratio;
Ratio in judgement module 23 is configured as judging whether the ratio is greater than first threshold;
First determining module 24 is configured as determining that the short message cluster result is full when the ratio is greater than first threshold
Sufficient preset condition;
Second determining module 25 is configured as when the ratio is less than or equal to first threshold, determines the short message cluster
As a result it is unsatisfactory for preset condition.
In another embodiment of the disclosure, as shown in figure 9, the matrix constructs module 11, it may include: similarity calculation
Submodule 111 and matrix generate submodule 112, wherein
Similarity calculation submodule 111 is configured as short using any two in default similarity formula calculating short message set
Similarity between letter;
Matrix generates submodule 112 and is configurable to generate phase of the matrix comprising all similarities as the short message set
Like degree matrix.
In another embodiment of the disclosure, as shown in Figure 10, aforementioned hierarchical clustering determining module 12, comprising: compare submodule
Block 121, extracting sub-module 122, classification determine that submodule 123 and reference category quantity determine submodule 124, wherein
Comparative sub-module 121 be configured as by the value of similarity each in the similarity matrix respectively with the default phase
It is compared like degree threshold value;
Extracting sub-module 122 is configured as extracting all similar greater than default similarity threshold in the similarity matrix
Degree;
Classification determine submodule 123 be configured as the similarity between any two short message being all larger than it is described preset it is similar
The short message of degree threshold value is determined as a classification;
Reference category quantity determines that submodule 124 is configured as the quantity for the classification that will be determined as the reference class
Other quantity.
In another embodiment of the disclosure, described device further include: correction module, wherein
The correction module is configured as correcting the reference category when the short message cluster result is unsatisfactory for preset condition
The value of quantity;
After correction module amendment, iteration executes the determining categorical measure and the spectral clustering process, until obtaining
The short message cluster result obtained meets the preset condition.
In another embodiment of the disclosure, the correction module may include:
Ratio acquisition submodule, for obtaining the default ratio in the short message cluster result;
First amendment submodule, is used for when the default ratio is less than second threshold, by the reference category quantity
Value subtracts the first preset value;
Second amendment submodule, is used for when the default ratio is greater than second threshold, by the reference category quantity
Value increases by the second preset value;
The second threshold is less than the first threshold.
Figure 11 is a kind of block diagram of terminal 1100 shown according to an exemplary embodiment.For example, terminal 1100 can be
Mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices, body-building are set
It is standby, personal digital assistant etc..
Referring to Fig.1 1, terminal 1100 may include following one or more components: processing component 1102, memory 1104,
Power supply module 1106, multimedia component 1108, audio component 1110, the interface 1112 of input/output (I/O), sensor module
1114 and communication component 1116.
The integrated operation of the usual controlling terminal 1100 of processing component 1102, such as with display, telephone call, data communication,
Camera operation and record operate associated operation.Processing component 1102 may include one or more processors 1120 to execute
Instruction, to perform all or part of the steps of the methods described above.In addition, processing component 1102 may include one or more moulds
Block, convenient for the interaction between processing component 1102 and other assemblies.For example, processing component 1102 may include multi-media module,
To facilitate the interaction between multimedia component 1108 and processing component 1102.
Memory 1104 is configured as storing various types of data to support the operation in terminal 1100.These data
Example includes the instruction of any application or method for operating in terminal 1100, contact data, telephone book data,
Message, picture, video etc..Memory 1104 can by any kind of volatibility or non-volatile memory device or they
Combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), it is erasable can
Program read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory
Reservoir, disk or CD.
Power supply module 1106 provides electric power for the various assemblies of terminal 1100.Power supply module 1106 may include power management
System, one or more power supplys and other with for terminal 1100 generate, manage, and distribute the associated component of electric power.
Multimedia component 1108 includes the screen of one output interface of offer between the terminal 1100 and user.?
In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel,
Screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes that one or more touch passes
Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding is dynamic
The boundary of work, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more
Media component 1108 includes a front camera and/or rear camera.When terminal 1100 is in operation mode, as shot mould
When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting
Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 1110 is configured as output and/or input audio signal.For example, audio component 1110 includes a wheat
Gram wind (MIC), when terminal 1100 is in operation mode, when such as call model, logging mode and semantics recognition mode, microphone quilt
It is configured to receive external audio signal.The received audio signal can be further stored in memory 1104 or via communication
Component 1116 is sent.In some embodiments, audio component 1110 further includes a loudspeaker, is used for output audio signal.
I/O interface 1112 provides interface, above-mentioned peripheral interface module between processing component 1102 and peripheral interface module
It can be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and
Locking press button.
Sensor module 1114 includes one or more sensors, and the state for providing various aspects for terminal 1100 is commented
Estimate.For example, sensor module 1114 can detecte the state that opens/closes of terminal 1100, the relative positioning of component, such as institute
The display and keypad that component is terminal 1100 are stated, sensor module 1114 can also detect terminal 1100 or terminal 1,100 1
The position change of a component, the existence or non-existence that user contacts with terminal 1100,1100 orientation of terminal or acceleration/deceleration and end
The temperature change at end 1100.Sensor module 1114 may include proximity sensor, be configured in not any physics
It is detected the presence of nearby objects when contact.Sensor module 1114 can also include optical sensor, as CMOS or ccd image are sensed
Device, for being used in imaging applications.In some embodiments, which can also include acceleration sensing
Device, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 1116 is configured to facilitate the communication of wired or wireless way between terminal 1100 and other equipment.Eventually
End 1100 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.It is exemplary at one
In embodiment, communication component 1116 receives broadcast singal or broadcast correlation from external broadcasting management system via broadcast channel
Information.In one exemplary embodiment, the communication component 1116 further includes near-field communication (NFC) module, to promote short distance
Communication.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module
(UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 1100 can be by one or more application specific integrated circuit (ASIC), number
Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 1104 of instruction, above-metioned instruction can be executed by the processor 1120 of terminal 1100 to complete the above method.Example
Such as, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, soft
Disk and optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is held by the processor of terminal
When row, enable the terminal to execute a kind of based reminding method, which comprises
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as
Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity
It is determined as target message cluster result.
Figure 12 is a kind of block diagram of server 1200 for short message cluster shown according to an exemplary embodiment.Example
Such as, device 1200 may be provided as a server.Referring to Fig.1 2, device 1200 includes processing component 1222, is further wrapped
One or more processors, and the memory resource as representated by memory 1232 are included, it can be by processing component for storing
The instruction of 1222 execution, such as application program.The application program stored in memory 1232 may include one or one with
On each correspond to one group of instruction module.
Device 1200 can also include that a power supply module 1226 be configured as the power management of executive device 1200, and one
Wired or wireless network interface 1250 is configured as device 1200 being connected to network and input and output (I/O) interface
1258.Device 1200 can be operated based on the operating system for being stored in memory 1232, such as Windows ServerTM, Mac
OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processor of server
When execution, enable the server to execute a kind of based reminding method, which comprises
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as
Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity
It is determined as target message cluster result.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or
Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following
Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.
Claims (20)
1. a kind of short message clustering method characterized by comprising
According in short message set between any two short message structural similarity and short message sentence structurally and semantically between
Similitude constructs the similarity matrix of the short message set;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as cluster
Quantity carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, the corresponding short message cluster result of the reference category quantity is determined
For target message cluster result.
2. the method according to claim 1, wherein the first category quantity, second category quantity with it is described
Reference category quantity is adjacent, and the first category quantity is less than the second category quantity.
3. the method according to claim 1, wherein described by the reference category quantity, first category quantity
Spectral clustering is carried out to the similarity matrix respectively as number of clusters with second category quantity, obtains cluster result, comprising:
Obtain the characteristic value and the corresponding feature vector of characteristic value in the similarity matrix;
It is ascending from described eigenvector respectively according to the reference category quantity, first category quantity and second category number
Amount selects three groups of feature vectors;
Three groups of feature vectors of selection are formed into three characteristic vector spaces;
The feature vector in each characteristic vector space is clustered respectively using K-means clustering algorithm, three groups is obtained and gathers
Class classification is as short message cluster result.
4. according to the method described in claim 3, it is characterized in that, which comprises
Calculate separately the weighted average of distance between the class cluster mass center of any two cluster classification in every group cluster classification;
The ratio of the weighted average of three group cluster classifications is calculated using default ratio formula;
Judge whether the ratio is greater than first threshold;
When the ratio is greater than first threshold, determine that the short message cluster result meets preset condition.
5. the method according to claim 1, wherein described according in short message set between any two short message
Similarity constructs the similarity matrix of the short message set, comprising:
The similarity in short message set between any two short message is calculated using default similarity formula;
Generate similarity matrix of the matrix comprising all similarities as the short message set.
6. according to the method described in claim 5, it is characterized in that, the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the latent Dirichletal location model LDA theme vector of short message A;Vec (B) is the LDA master of short message B
Inscribe vector;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
7. the method according to claim 1, wherein described utilize default similarity threshold to the similarity moment
Battle array carries out hierarchical clustering, obtains reference category quantity, comprising:
The value of similarity each in the similarity matrix is compared with the default similarity threshold respectively;
Extract all similarities for being greater than default similarity threshold in the similarity matrix;
The short message that similarity between any two short message is all larger than the default similarity threshold is determined as a classification;
The quantity of obtained classification will be determined as the reference category quantity.
8. method described in -7 any one according to claim 1, which is characterized in that the method also includes:
When the short message cluster result is unsatisfactory for preset condition, the value of the reference category quantity is corrected, described in iteration executes
Categorical measure and the spectral clustering process are determined, until the short message cluster result obtained meets the preset condition.
9. according to the method described in claim 8, it is characterized in that, the value of the amendment reference category quantity, comprising:
Obtain the default ratio in the short message cluster result;
When the default ratio is less than second threshold, the value of the reference category quantity is subtracted into the first preset value;
When the default ratio is greater than second threshold, the value of the reference category quantity is increased by the second preset value.
10. a kind of short message clustering apparatus characterized by comprising
Matrix constructs module, for according to the structural similarity and short message sentence in short message set between any two short message
Similitude between structurally and semantically constructs the similarity matrix of the short message set;
Hierarchical clustering determining module is obtained for carrying out hierarchical clustering to the similarity matrix using default similarity threshold
Reference category quantity;
Categorical measure determining module, for determining categorical measure, comprising: first category number is determined according to the reference category quantity
Amount and second category quantity;
Spectral clustering module, for carrying out spectral clustering process, comprising: by the reference category quantity, first category quantity and second
Categorical measure carries out spectral clustering to the similarity matrix respectively as number of clusters, obtains short message cluster result;
As a result determining module, for when the short message cluster result meets preset condition, the reference category quantity to be corresponded to
Short message cluster result be determined as target message cluster result.
11. device according to claim 10, which is characterized in that the first category quantity, second category quantity and institute
It is adjacent to state reference category quantity, and the first category quantity is less than the second category quantity.
12. device according to claim 10, which is characterized in that the spectral clustering module, comprising:
Feature acquisition submodule, for obtaining characteristic value and the corresponding feature vector of characteristic value in the similarity matrix;
Vector choose submodule, for it is ascending from described eigenvector respectively according to the reference category quantity, first
Categorical measure and second category quantity select three groups of feature vectors;
Vector space forms submodule, for the three groups of feature vectors chosen to be separately constituted three characteristic vector spaces;
Submodule is clustered, for carrying out respectively using K-means clustering algorithm to the feature vector in each characteristic vector space
Cluster, obtains three group cluster classifications as short message cluster result.
13. device according to claim 12, which is characterized in that described device includes:
Mean value calculation module, for calculating separately the spacing of the class cluster mass center of any two cluster classification in every group cluster classification
From weighted average;
Ratio calculation module, the ratio of the weighted average for calculating three group cluster classifications using default ratio formula;
Ratio in judgement module, for judging whether the ratio is greater than first threshold;
First determining module, for when the ratio is greater than first threshold, determining that the short message cluster result meets default item
Part.
14. device according to claim 10, which is characterized in that the matrix constructs module, comprising:
Similarity calculation submodule, for calculating the phase in short message set between any two short message using default similarity formula
Like degree;
Matrix generates submodule, for generating similarity matrix of the matrix comprising all similarities as the short message set.
15. device according to claim 14, which is characterized in that the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the LDA theme vector of short message A;Vec (B) is the LDA theme vector of short message B;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
16. device according to claim 10, which is characterized in that the hierarchical clustering determining module, comprising:
Comparative sub-module, for by the value of similarity each in the similarity matrix respectively with the default similarity threshold into
Row compares;
Extracting sub-module, for extracting all similarities for being greater than default similarity threshold in the similarity matrix;
Classification determines submodule, for the similarity between any two short message to be all larger than the short of the default similarity threshold
Letter is determined as a classification;
Reference category quantity determines submodule, and the quantity of the classification for that will determine is as the reference category quantity.
17. device described in any one of 0 to 16 according to claim 1, which is characterized in that described device further include:
Correction module, for correcting the value of the reference category quantity when the short message cluster result is unsatisfactory for preset condition;
After correction module amendment, iteration executes the determining categorical measure and the spectral clustering process, until obtain
Short message cluster result meets the preset condition.
18. device according to claim 17, which is characterized in that the correction module, comprising:
Ratio acquisition submodule, for obtaining the default ratio in the short message cluster result;
First amendment submodule, for when the default ratio is less than second threshold, the value of the reference category quantity to be subtracted
Remove the first preset value;
Second amendment submodule, for when the default ratio is greater than second threshold, the value of the reference category quantity to be increased
Add the second preset value.
19. a kind of terminal characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According in short message set between any two short message structural similarity and short message sentence structurally and semantically between
Similitude constructs the similarity matrix of the short message set;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as cluster
Quantity carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, the corresponding short message cluster result of the reference category quantity is determined
For target message cluster result.
20. a kind of server characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According in short message set between any two short message structural similarity and short message sentence structurally and semantically between
Similitude constructs the similarity matrix of the short message set;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as cluster
Quantity carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, the corresponding short message cluster result of the reference category quantity is determined
For target message cluster result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610191407.6A CN105824955B (en) | 2016-03-30 | 2016-03-30 | Short message clustering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610191407.6A CN105824955B (en) | 2016-03-30 | 2016-03-30 | Short message clustering method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105824955A CN105824955A (en) | 2016-08-03 |
CN105824955B true CN105824955B (en) | 2019-02-19 |
Family
ID=56525338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610191407.6A Active CN105824955B (en) | 2016-03-30 | 2016-03-30 | Short message clustering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105824955B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106815310B (en) * | 2016-12-20 | 2020-04-21 | 华南师范大学 | Hierarchical clustering method and system for massive document sets |
CN109800775B (en) * | 2017-11-17 | 2022-10-28 | 腾讯科技(深圳)有限公司 | File clustering method, device, equipment and readable medium |
CN108959440A (en) * | 2018-06-13 | 2018-12-07 | 福建新大陆软件工程有限公司 | A kind of short message clustering method and device |
CN112148942B (en) * | 2019-06-27 | 2024-04-09 | 北京达佳互联信息技术有限公司 | Business index data classification method and device based on data clustering |
CN110730270B (en) * | 2019-09-09 | 2021-09-14 | 上海斑马来拉物流科技有限公司 | Short message grouping method and device, computer storage medium and electronic equipment |
CN111507400B (en) * | 2020-04-16 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Application classification method, device, electronic equipment and storage medium |
CN117880765B (en) * | 2024-03-13 | 2024-05-28 | 深圳市诚立业科技发展有限公司 | Intelligent management system for short message data |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101860822A (en) * | 2010-06-11 | 2010-10-13 | 中兴通讯股份有限公司 | Method and system for monitoring spam messages |
CN104112026B (en) * | 2014-08-01 | 2017-09-08 | 中国联合网络通信集团有限公司 | A kind of short message text sorting technique and system |
CN104699668B (en) * | 2015-03-26 | 2017-09-26 | 小米科技有限责任公司 | Determine the method and device of Words similarity |
CN104778256B (en) * | 2015-04-20 | 2017-10-17 | 江苏科技大学 | A kind of the quick of field question answering system consulting can increment clustering method |
-
2016
- 2016-03-30 CN CN201610191407.6A patent/CN105824955B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN105824955A (en) | 2016-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105824955B (en) | Short message clustering method and device | |
WO2020232977A1 (en) | Neural network training method and apparatus, and image processing method and apparatus | |
CN104408402B (en) | Face identification method and device | |
US11455491B2 (en) | Method and device for training image recognition model, and storage medium | |
CN108227950B (en) | Input method and device | |
CN105404863B (en) | Character features recognition methods and system | |
CN111259967B (en) | Image classification and neural network training method, device, equipment and storage medium | |
CN108073303B (en) | Input method and device and electronic equipment | |
JP6051336B2 (en) | Clustering method, clustering device, terminal device, program, and recording medium | |
JP2016516251A (en) | Clustering method, clustering device, terminal device, program, and recording medium | |
JP2017513075A (en) | Method and apparatus for generating an image filter | |
CN105677731B (en) | Show method, apparatus, terminal and the server of preview picture figure | |
CN108038102A (en) | Recommendation method, apparatus, terminal and the storage medium of facial expression image | |
CN105100193B (en) | Cloud business card recommended method and device | |
WO2020192113A1 (en) | Image processing method and apparatus, electronic device, and storage medium | |
CN105678266A (en) | Method and device for combining photo albums of human faces | |
TW202036462A (en) | Method, apparatus and electronic device for image generating and storage medium thereof | |
JP2016517110A5 (en) | ||
CN104573642B (en) | Face identification method and device | |
CN112926310A (en) | Keyword extraction method and device | |
CN111797746A (en) | Face recognition method and device and computer readable storage medium | |
CN112559852A (en) | Information recommendation method and device | |
CN105786350B (en) | Choose reminding method, device and the terminal of image | |
CN106534965A (en) | Method and device for obtaining video information | |
US20150262033A1 (en) | Method and terminal device for clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |