CN105824955B - Short message clustering method and device - Google Patents

Short message clustering method and device Download PDF

Info

Publication number
CN105824955B
CN105824955B CN201610191407.6A CN201610191407A CN105824955B CN 105824955 B CN105824955 B CN 105824955B CN 201610191407 A CN201610191407 A CN 201610191407A CN 105824955 B CN105824955 B CN 105824955B
Authority
CN
China
Prior art keywords
short message
category quantity
similarity
cluster result
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610191407.6A
Other languages
Chinese (zh)
Other versions
CN105824955A (en
Inventor
汪平仄
张涛
陈志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN201610191407.6A priority Critical patent/CN105824955B/en
Publication of CN105824955A publication Critical patent/CN105824955A/en
Application granted granted Critical
Publication of CN105824955B publication Critical patent/CN105824955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure is directed to a kind of short message clustering method and device, method includes: to construct the similarity matrix of short message set according to the similarity in short message set between any two short message;Hierarchical clustering is carried out to similarity matrix using default similarity threshold, obtains reference category quantity;Determine categorical measure, comprising: first category quantity and second category quantity are determined according to reference category quantity;Spectral clustering process, comprising: spectral clustering is carried out to similarity matrix using reference category quantity, first category quantity and second category quantity as number of clusters, obtains short message cluster result;When short message cluster result meets preset condition, the corresponding short message cluster result of reference category quantity is determined as target message cluster result.This method is the structural similarity based on short message when clustering to the short message in short message set, fully considered short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves.

Description

Short message clustering method and device
Technical field
This disclosure relates to data classification technology field more particularly to a kind of short message clustering method and device.
Background technique
Ordinary user's short message is related to the more privacy of user, and the complicated multiplicity of sentence structure, in text mining, relates generally to It is less.Notify that class short message structure is relatively more rigorous, usually the important object of text mining.
Text Clustering Method has the methods of k-means, hierarchical clustering.But these clustering methods are difficult to consider in cluster Sentence structurally and semantically between similitude, therefore, in cluster, obtained cluster result accuracy is lower, has very big Limitation.
Summary of the invention
To overcome the problems in correlation technique, the disclosure provides a kind of short message clustering method and device.
According to the first aspect of the embodiments of the present disclosure, a kind of short message clustering method is provided, comprising:
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity It is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using this method Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all Such as batch label, batch delete operation.
Optionally, the first category quantity, second category quantity are adjacent with the reference category quantity, and described first Categorical measure is less than the second category quantity.
Optionally, described using the reference category quantity, first category quantity and second category quantity as cluster Quantity carries out spectral clustering to the similarity matrix, obtains cluster result, comprising:
Obtain the characteristic value and the corresponding feature vector of characteristic value in the similarity matrix;
It is ascending from described eigenvector respectively according to the reference category quantity, first category quantity and the second class Other quantity selects three groups of feature vectors;
Three groups of feature vectors of selection are formed into three characteristic vector spaces;
The feature vector in each characteristic vector space is clustered respectively using K-means clustering algorithm, obtains three Group cluster classification is as short message cluster result.
In this way, not only being clustered to reference cluster classification, but also to two adjacent with reference cluster classification Classification is equally clustered, and obtains three group cluster classifications as distance results, in order to subsequent between three group cluster classifications Difference judged, and then determine this determine reference cluster classification it is whether appropriate.
Optionally, which comprises
Calculate separately the weighted average of distance between the class cluster mass center of any two cluster classification in every group cluster classification;
The ratio of the weighted average of three group cluster classifications is calculated using default ratio formula;
Judge whether the ratio is greater than first threshold;
When the ratio is greater than first threshold, determine that the short message cluster result meets preset condition.
This method that the embodiment of the present disclosure provides, using with reference to cluster classification and adjacent with reference cluster classification two After a classification is clustered, by the relationship between each weighted average of calculating, can determine whether this cluster is accurate, and In inaccuracy, it can be iterated operation, until obtaining optimal classification.
Optionally, the similarity according in short message set between any two short message constructs the short message set Similarity matrix, comprising:
The similarity in short message set between any two short message is calculated using default similarity formula;
Generate similarity matrix of the matrix comprising all similarities as the short message set.
Optionally, the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the latent Dirichletal location model LDA theme vector of short message A;Vec (B) is short message B's LDA theme vector;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
Optionally, described that hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtain reference class Other quantity, comprising:
The value of similarity each in the similarity matrix is compared with the default similarity threshold respectively;
Extract all similarities for being greater than default similarity threshold in the similarity matrix;
The short message that similarity between any two short message is all larger than the default similarity threshold is determined as a class Not;
The quantity of obtained classification will be determined as the reference category quantity.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation, And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold The quantity of the class of cause.
Optionally, the method also includes:
When the short message cluster result is unsatisfactory for preset condition, the value of the reference category quantity is corrected, iteration executes The determining categorical measure and the spectral clustering process, until the short message cluster result obtained meets the preset condition.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation, And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold The quantity of the class of cause.After being modified in order to the subsequent quantity based on the rough class, it is iterated Learning Clustering quantity, directly Until Clustering Effect is stablized to the end.
Optionally, the value of the amendment reference category quantity, comprising:
Obtain the default ratio in the short message cluster result;
When the default ratio is less than second threshold, the value of the reference category quantity is subtracted into the first preset value;
When the default ratio is greater than second threshold, the value of the reference category quantity is increased by the second preset value.
In the embodiments of the present disclosure, since second threshold is less than the first threshold, that is, being less than first threshold in ratio Afterwards, it can also be further compared with second threshold, and according to the ratio with second threshold as a result, to determine reference class Other quantity is to increase or reduce.
According to the second aspect of an embodiment of the present disclosure, a kind of short message clustering apparatus is provided, comprising:
Matrix constructs module, for constructing the short message according to the similarity in short message set between any two short message The similarity matrix of set;
Hierarchical clustering determining module, for carrying out hierarchical clustering to the similarity matrix using default similarity threshold, Obtain reference category quantity;
Categorical measure determining module, for determining categorical measure, comprising: the first kind is determined according to the reference category quantity Other quantity and second category quantity;
Spectral clustering module, for carrying out spectral clustering process, comprising: by the reference category quantity, first category quantity and Second category quantity carries out spectral clustering to the similarity matrix respectively as number of clusters, obtains short message cluster result;
As a result determining module, for when the short message cluster result meets preset condition, by the reference category quantity Corresponding short message cluster result is determined as target message cluster result.
Optionally, the first category quantity, second category quantity are adjacent with the reference category quantity, and described first Categorical measure is less than the second category quantity.
Optionally, the spectral clustering module, comprising:
Feature acquisition submodule, for obtaining characteristic value and the corresponding feature of characteristic value in the similarity matrix Vector;
Vector choose submodule, for it is ascending from described eigenvector respectively according to the reference category quantity, First category quantity and second category quantity select three groups of feature vectors;
Vector space forms submodule, for the three groups of feature vectors chosen to be separately constituted three feature vector skies Between;
Submodule is clustered, for distinguishing using K-means clustering algorithm the feature vector in each characteristic vector space It is clustered, obtains three group cluster classifications as short message cluster result.
Optionally, described device includes:
Mean value calculation module, for calculate separately the cluster classification of any two in every group cluster classification class cluster mass center it Between distance weighted average;
Ratio calculation module, the ratio of the weighted average for calculating three group cluster classifications using default ratio formula;
Ratio in judgement module, for judging whether the ratio is greater than first threshold;
First determining module, for it is pre- to determine that the short message cluster result meets when the ratio is greater than first threshold If condition.
Optionally, the matrix constructs module, comprising:
Similarity calculation submodule, for being calculated in short message set between any two short message using default similarity formula Similarity;
Matrix generates submodule, for generating similarity moment of the matrix comprising all similarities as the short message set Battle array.
Optionally, the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the LDA theme vector of short message A;Vec (B) is the LDA theme vector of short message B;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
Optionally, the hierarchical clustering determining module, comprising:
Comparative sub-module, for by the value of similarity each in the similarity matrix respectively with the default similarity threshold Value is compared;
Extracting sub-module, for extracting all similarities for being greater than default similarity threshold in the similarity matrix;
Classification determines submodule, for the similarity between any two short message to be all larger than the default similarity threshold Short message be determined as a classification;
Reference category quantity determines submodule, and the quantity of the classification for that will determine is as the reference category number Amount.
Optionally, described device further include:
Correction module, for correcting the reference category quantity when the short message cluster result is unsatisfactory for preset condition Value;
After correction module amendment, iteration executes the determining categorical measure and the spectral clustering process, until obtaining The short message cluster result obtained meets the preset condition.
Optionally, the correction module, comprising:
Ratio acquisition submodule, for obtaining the default ratio in the short message cluster result;
First amendment submodule, is used for when the default ratio is less than second threshold, by the reference category quantity Value subtracts the first preset value;
Second amendment submodule, is used for when the default ratio is greater than second threshold, by the reference category quantity Value increases by the second preset value.
According to the third aspect of an embodiment of the present disclosure, a kind of terminal is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity It is determined as target message cluster result.
According to a fourth aspect of embodiments of the present disclosure, a kind of server is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity It is determined as target message cluster result.
The technical scheme provided by this disclosed embodiment can include the following benefits:
This method that the embodiment of the present disclosure provides, according to the similarity in short message set between any two object short message, Construct the similarity matrix of the short message set;Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, Obtain reference category quantity;Determine categorical measure, comprising: first category quantity and second are determined according to the reference category quantity Categorical measure;Spectral clustering process, comprising: make the reference category quantity, first category quantity and second category quantity respectively Spectral clustering is carried out to the similarity matrix for number of clusters, obtains short message cluster result;When the short message cluster result meets When preset condition, the corresponding short message cluster result of the reference category quantity is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using this method Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all Such as batch label, batch delete operation.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.
In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without any creative labor, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of short message clustering method shown according to an exemplary embodiment;
Fig. 2 is the flow diagram of step S104 in Fig. 1;
Fig. 3 is the flow chart of another short message clustering method shown according to an exemplary embodiment;
Fig. 4 is the flow diagram of step S102 in Fig. 1;
Fig. 5 is the flow chart of another short message clustering method shown according to an exemplary embodiment;
Fig. 6 is a kind of structural schematic diagram of short message clustering apparatus shown according to an exemplary embodiment;
Fig. 7 is the structural schematic diagram of spectral clustering module 14 in Fig. 6;
Fig. 8 is the structural schematic diagram of another short message clustering apparatus shown according to an exemplary embodiment;
Fig. 9 is the structural schematic diagram of matrix building module 11 in Fig. 6;
Figure 10 is the structural schematic diagram of the middle-level cluster determining module 12 of Fig. 6;
Figure 11 is a kind of block diagram of terminal 1100 shown according to an exemplary embodiment;
Figure 12 is a kind of block diagram of server 1200 for short message cluster shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Fig. 1 is a kind of flow chart of short message clustering method shown according to an exemplary embodiment.As shown in Figure 1, the party Method may comprise steps of.
In step s101, according to the similarity in short message set between any two short message, the short message set is constructed Similarity matrix.
Due to the special construction of short message, it is different from other information, so when calculating the similarity between short message, it is contemplated that The sentence structure and semantic structure of short message can preset a similarity formula, then be counted using the similarity formula The similarity between short message is calculated, here the structural similarity of similarity namely short message.
In the embodiments of the present disclosure, which may include following two step.
S1: the similarity in short message set between any two short message is calculated using default similarity formula.
In the embodiments of the present disclosure, similarity formula is preset are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is LDA (Latent Dirichlet Allocation, the latent Dirichletal location of short message A Model) theme vector;Vec (B) is the LDA theme vector of short message B;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
S2: similarity matrix of the matrix comprising all similarities as the short message set is generated.
After calculating the similarity into short message set between any two short message, it can be constructed according to similarity formula Similarity matrix comprising the similarity of all short messages in short message set.In the embodiments of the present disclosure, similarity matrix can use W To indicate.
In step s 102, hierarchical clustering is carried out to the similarity matrix using default similarity threshold, is referred to Categorical measure.
When carrying out hierarchical clustering, minimum similarity threshold can be set, then using minimum similarity threshold to similar Degree cluster carries out hierarchical clustering, obtains reference category quantity.In the embodiments of the present disclosure, reference category quantity can be with n come table Show.
In step s 103, determine categorical measure, comprising: according to the reference category quantity determine first category quantity and Second category quantity.
In the embodiments of the present disclosure, number of clusters can be equal to reference category quantity, wherein number of clusters is with k come table Show, even k=n, then on the basis of k=n, calculates separately to obtain two categorical measures: first category quantity and the second class Other quantity.In the embodiments of the present disclosure, first category quantity, second category quantity are adjacent with the reference category quantity, and institute First category quantity is stated less than the second category quantity, such as: first category quantity is k-1, and second category quantity is k+1.
In step S104, spectral clustering process, comprising: by the reference category quantity, first category quantity and the second class Other quantity carries out spectral clustering to the similarity matrix respectively as number of clusters, obtains short message cluster result.
It, can be by calculating preceding k eigen vector in similarity matrix W after determining reference category, building is special Vector space is levied, then the feature vector in each characteristic vector space is clustered respectively using K-means clustering algorithm Obtain a group cluster classification.
In addition, be directed to first category quantity and second category quantity, also construction feature vector space respectively, and using utilizing K-means clustering algorithm is clustered, and respectively obtains a group cluster classification.
In the embodiments of the present disclosure, using three obtained group cluster classifications as short message cluster result.
In step s105, when the short message cluster result meets preset condition, the reference category quantity is corresponding Short message cluster result be determined as target message cluster result.
After obtaining short message cluster result, it is also necessary to judge short message cluster result, judgement here mainly will The difference for judging reference category quantity and first category quantity, the corresponding weighted average of second category quantity, only works as three Weighted average difference meet certain requirements after, just determine that obtained reference quantity classification is correctly, and then will just to join The corresponding short message cluster result of quantity classification is examined as final result.
This method that the embodiment of the present disclosure provides, according to the similarity in short message set between any two object short message, Construct the similarity matrix of the short message set;Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, Obtain reference category quantity;Determine categorical measure, comprising: first category quantity and second are determined according to the reference category quantity Categorical measure;Spectral clustering process, comprising: make the reference category quantity, first category quantity and second category quantity respectively Spectral clustering is carried out to the similarity matrix for number of clusters, obtains short message cluster result;When the short message cluster result meets When preset condition, the corresponding short message cluster result of the reference category quantity is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using this method Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all Such as batch label, batch delete operation.
In another embodiment of the disclosure, as shown in Fig. 2, the step S104 in embodiment described in Fig. 1 may include following Step.
In step S1041, the characteristic value and the corresponding feature vector of characteristic value in the similarity matrix are obtained.
In step S1042, it is ascending from described eigenvector respectively according to the reference category quantity, the first kind Other quantity and second category quantity select three groups of feature vectors.
In disclosure quantity, reference category quantity can be indicated with k, and first category quantity can be indicated with k-1, Second category quantity can be indicated with k+1.
In step S1043, three groups of feature vectors of selection are formed into three characteristic vector spaces.
In step S1044, using K-means clustering algorithm to the feature vector in each characteristic vector space respectively into Row cluster, obtains three group cluster classifications as short message cluster result.
In this way, not only being clustered to reference cluster classification, but also to two adjacent with reference cluster classification Classification is equally clustered, and obtains three group cluster classifications as distance results, in order to subsequent between three group cluster classifications Difference judged, and then determine this determine reference cluster classification it is whether appropriate.
In another embodiment of the disclosure, on the basis of embodiment shown in Fig. 2, as shown in figure 3, this method can also wrap Include following steps.
In step s 201, any two in every group cluster classification are calculated separately and cluster distance between the class cluster mass center of classification Weighted average.
In step S202, the ratio of the weighted average of three group cluster classifications is calculated using default ratio formula.
In the embodiments of the present disclosure, after obtaining cluster result using k, the weighted average of distance between class cluster mass center is calculated Value is d;After obtaining cluster result using k-1, the weighted average for calculating distance between class cluster mass center is d-1;When utilize k+1 After obtaining cluster result, the weighted average for calculating distance between class cluster mass center is d+1
According to three weighted averages being the previously calculated, presetting ratio formula can be a=(d-1-d)/(d-d+1), Wherein, a is the ratio for weighting draw value.
In step S203, judge whether the ratio is greater than first threshold.
When the ratio is greater than first threshold, step S1063 is executed;When the ratio is less than or equal to first threshold When, execute step S1064.
In step S204, determine that the short message cluster result meets preset condition.
In the embodiments of the present disclosure, first threshold is usually larger.When ratio is greater than first threshold, expression reference category quantity Gap between adjacent first category quantity, second category quantity is met the requirements, that is, is classified obvious.
In step S205, determine that the short message cluster result is unsatisfactory for preset condition.
On the contrary, presentation class result is unobvious, needs to re-start classification if ratio is less than first threshold.
This method that the embodiment of the present disclosure provides, using with reference to cluster classification and adjacent with reference cluster classification two After a classification is clustered, by the relationship between each weighted average of calculating, can determine whether this cluster is accurate, and In inaccuracy, it can be iterated operation, until obtaining optimal classification.
In another embodiment of the disclosure, as shown in figure 4, abovementioned steps S102 may comprise steps of.
In step S1021, by the value of similarity each in the similarity matrix respectively with the default similarity threshold Value is compared.
Default similarity threshold can preset for those skilled in the art, in the embodiments of the present disclosure, preset similar Degree threshold value can be indicated with λ.
In step S1022, all similarities for being greater than default similarity threshold in the similarity matrix are extracted.
In step S1023, the similarity between any two short message is all larger than the short of the default similarity threshold Letter is determined as a classification.
In step S1024, the quantity of obtained classification will be determined as the reference category quantity.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation, And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold The quantity of the class of cause.
In another embodiment of the disclosure, as shown in figure 5, this method may also comprise the following steps:.
In step s 106, judge whether the short message cluster result meets preset condition;
When the short message cluster result meets preset condition, step S105 is executed;When the short message cluster result is discontented When sufficient preset condition, step S107 is executed.
In step s 107, the value of the reference category quantity is corrected, iteration executes the determining categorical measure and described Spectral clustering process, until the short message cluster result obtained meets the preset condition.
Spectral clustering is established on the basis of spectral graph theory, and compared with traditional clustering algorithm, it has can be in arbitrary shape The characteristics of sample space of shape clusters and converges on globally optimal solution.But spectral clustering needs the quantity of given class, could operation, And be under normal conditions can not obtain a categorical measure in advance, so, can be obtained one big based on default similarity threshold The quantity of the class of cause.After being modified in order to the subsequent quantity based on the rough class, it is iterated Learning Clustering quantity, directly Until Clustering Effect is stablized to the end.
In another embodiment of the disclosure, abovementioned steps S107 be may comprise steps of.
S1: the default ratio in the short message cluster result is obtained.
Here presetting at ratio is the ratio being calculated in Fig. 3.
S2: when the default ratio is less than second threshold, the value of the reference category quantity is subtracted into the first preset value.
S3: when the default ratio is greater than second threshold, the value of the reference category quantity is increased by the second preset value.
First preset value and the second preset value may be the same or different.When reference category quantity is unsatisfactory for requiring, It can be finely adjusted on the basis of reference category quantity, so in the embodiments of the present disclosure, the first preset value and second is preset Value can be set as 1, and certainly, in other embodiments, those skilled in the art can also be arranged to reference category quantity Other amount trimmeds, such as: the first preset value and the second preset value can also be disposed as 2 or 3 etc..
In the embodiments of the present disclosure, second threshold is less than the first threshold, that is, after ratio is less than first threshold, It can also be further compared with second threshold, and according to the ratio with second threshold as a result, to determine reference category Quantity is to increase or reduce.
Fig. 6 is a kind of structural schematic diagram of short message clustering apparatus shown according to an exemplary embodiment.As shown in fig. 6, The apparatus may include: matrix constructs module 11, hierarchical clustering determining module 12, categorical measure determining module 13, spectral clustering mould Block 14 and result determining module 15, wherein
Matrix building module 11 is configured as according to the similarity in short message set between any two short message, described in building The similarity matrix of short message set;
Hierarchical clustering determining module 12 is configured as carrying out level to the similarity matrix using default similarity threshold Cluster, obtains reference category quantity;
Categorical measure determining module 13 is configured to determine that categorical measure, comprising: is determined according to the reference category quantity First category quantity and second category quantity;
Spectral clustering module 14 is configured for spectral clustering process, comprising: by the reference category quantity, first category number Amount and second category quantity carry out spectral clustering to the similarity matrix respectively as number of clusters, obtain short message cluster result;
As a result determining module 15 is configured as when the short message cluster result meets preset condition, by the reference category The corresponding short message cluster result of quantity is determined as target message cluster result.
In the embodiments of the present disclosure, optionally, the first category quantity, second category quantity and the reference category number Measure it is adjacent, and the first category quantity be less than the second category quantity.
The device that the embodiment of the present disclosure provides, according to the similarity in short message set between any two object short message, Construct the similarity matrix of the short message set;Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, Obtain reference category quantity;Determine categorical measure, comprising: first category quantity and second are determined according to the reference category quantity Categorical measure;Spectral clustering process, comprising: make the reference category quantity, first category quantity and second category quantity respectively Spectral clustering is carried out to the similarity matrix for number of clusters, obtains short message cluster result;When the short message cluster result meets When preset condition, the corresponding short message cluster result of the reference category quantity is determined as target message cluster result.
It is the structural similarity based on short message, sufficiently when being clustered to the short message in short message set using the device Consider short message sentence structurally and semantically between similitude so that the accuracy that is clustered of short message improves, in turn Classification is clearly demarcated between the short message clustered by this method, so that the subsequent short message classification progress obtained to cluster is other all Such as batch label, batch delete operation.
In another embodiment of the disclosure, as shown in fig. 7, the spectral clustering module 14, may include: that feature obtains submodule Block 141, vector choose submodule 142, vector space composition submodule 143 and cluster submodule and are configured 144, wherein
Feature acquisition submodule 141 is configured as obtaining characteristic value and characteristic value correspondence in the similarity matrix Feature vector;
Vector choose submodule 142 be configured as it is ascending from described eigenvector respectively according to the reference category Quantity, first category quantity and second category quantity select three groups of feature vectors;
Vector space composition submodule 143 is configured as the three groups of feature vectors chosen separately constituting three features Vector space;
Cluster submodule be configured 144 for using K-means clustering algorithm to the feature in each characteristic vector space to Amount is clustered respectively, obtains three group cluster classifications as short message cluster result.
In another embodiment of the disclosure, on the basis of the embodiment shown in fig. 7, as shown in figure 8, the device includes: flat Mean value computation module 21, ratio calculation module 22, ratio in judgement module 23, the first determining module 24 and the second determining module 25, Wherein,
Mean value calculation module 21 is configured to calculate the class cluster of any two cluster classification in every group cluster classification The weighted average of distance between mass center;
Ratio calculation module 22 is configured as calculating the weighted average of three group cluster classifications using default ratio formula Ratio;
Ratio in judgement module 23 is configured as judging whether the ratio is greater than first threshold;
First determining module 24 is configured as determining that the short message cluster result is full when the ratio is greater than first threshold Sufficient preset condition;
Second determining module 25 is configured as when the ratio is less than or equal to first threshold, determines the short message cluster As a result it is unsatisfactory for preset condition.
In another embodiment of the disclosure, as shown in figure 9, the matrix constructs module 11, it may include: similarity calculation Submodule 111 and matrix generate submodule 112, wherein
Similarity calculation submodule 111 is configured as short using any two in default similarity formula calculating short message set Similarity between letter;
Matrix generates submodule 112 and is configurable to generate phase of the matrix comprising all similarities as the short message set Like degree matrix.
In another embodiment of the disclosure, as shown in Figure 10, aforementioned hierarchical clustering determining module 12, comprising: compare submodule Block 121, extracting sub-module 122, classification determine that submodule 123 and reference category quantity determine submodule 124, wherein
Comparative sub-module 121 be configured as by the value of similarity each in the similarity matrix respectively with the default phase It is compared like degree threshold value;
Extracting sub-module 122 is configured as extracting all similar greater than default similarity threshold in the similarity matrix Degree;
Classification determine submodule 123 be configured as the similarity between any two short message being all larger than it is described preset it is similar The short message of degree threshold value is determined as a classification;
Reference category quantity determines that submodule 124 is configured as the quantity for the classification that will be determined as the reference class Other quantity.
In another embodiment of the disclosure, described device further include: correction module, wherein
The correction module is configured as correcting the reference category when the short message cluster result is unsatisfactory for preset condition The value of quantity;
After correction module amendment, iteration executes the determining categorical measure and the spectral clustering process, until obtaining The short message cluster result obtained meets the preset condition.
In another embodiment of the disclosure, the correction module may include:
Ratio acquisition submodule, for obtaining the default ratio in the short message cluster result;
First amendment submodule, is used for when the default ratio is less than second threshold, by the reference category quantity Value subtracts the first preset value;
Second amendment submodule, is used for when the default ratio is greater than second threshold, by the reference category quantity Value increases by the second preset value;
The second threshold is less than the first threshold.
Figure 11 is a kind of block diagram of terminal 1100 shown according to an exemplary embodiment.For example, terminal 1100 can be Mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices, body-building are set It is standby, personal digital assistant etc..
Referring to Fig.1 1, terminal 1100 may include following one or more components: processing component 1102, memory 1104, Power supply module 1106, multimedia component 1108, audio component 1110, the interface 1112 of input/output (I/O), sensor module 1114 and communication component 1116.
The integrated operation of the usual controlling terminal 1100 of processing component 1102, such as with display, telephone call, data communication, Camera operation and record operate associated operation.Processing component 1102 may include one or more processors 1120 to execute Instruction, to perform all or part of the steps of the methods described above.In addition, processing component 1102 may include one or more moulds Block, convenient for the interaction between processing component 1102 and other assemblies.For example, processing component 1102 may include multi-media module, To facilitate the interaction between multimedia component 1108 and processing component 1102.
Memory 1104 is configured as storing various types of data to support the operation in terminal 1100.These data Example includes the instruction of any application or method for operating in terminal 1100, contact data, telephone book data, Message, picture, video etc..Memory 1104 can by any kind of volatibility or non-volatile memory device or they Combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), it is erasable can Program read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory Reservoir, disk or CD.
Power supply module 1106 provides electric power for the various assemblies of terminal 1100.Power supply module 1106 may include power management System, one or more power supplys and other with for terminal 1100 generate, manage, and distribute the associated component of electric power.
Multimedia component 1108 includes the screen of one output interface of offer between the terminal 1100 and user.? In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, Screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes that one or more touch passes Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding is dynamic The boundary of work, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more Media component 1108 includes a front camera and/or rear camera.When terminal 1100 is in operation mode, as shot mould When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 1110 is configured as output and/or input audio signal.For example, audio component 1110 includes a wheat Gram wind (MIC), when terminal 1100 is in operation mode, when such as call model, logging mode and semantics recognition mode, microphone quilt It is configured to receive external audio signal.The received audio signal can be further stored in memory 1104 or via communication Component 1116 is sent.In some embodiments, audio component 1110 further includes a loudspeaker, is used for output audio signal.
I/O interface 1112 provides interface, above-mentioned peripheral interface module between processing component 1102 and peripheral interface module It can be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and Locking press button.
Sensor module 1114 includes one or more sensors, and the state for providing various aspects for terminal 1100 is commented Estimate.For example, sensor module 1114 can detecte the state that opens/closes of terminal 1100, the relative positioning of component, such as institute The display and keypad that component is terminal 1100 are stated, sensor module 1114 can also detect terminal 1100 or terminal 1,100 1 The position change of a component, the existence or non-existence that user contacts with terminal 1100,1100 orientation of terminal or acceleration/deceleration and end The temperature change at end 1100.Sensor module 1114 may include proximity sensor, be configured in not any physics It is detected the presence of nearby objects when contact.Sensor module 1114 can also include optical sensor, as CMOS or ccd image are sensed Device, for being used in imaging applications.In some embodiments, which can also include acceleration sensing Device, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 1116 is configured to facilitate the communication of wired or wireless way between terminal 1100 and other equipment.Eventually End 1100 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.It is exemplary at one In embodiment, communication component 1116 receives broadcast singal or broadcast correlation from external broadcasting management system via broadcast channel Information.In one exemplary embodiment, the communication component 1116 further includes near-field communication (NFC) module, to promote short distance Communication.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 1100 can be by one or more application specific integrated circuit (ASIC), number Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 1104 of instruction, above-metioned instruction can be executed by the processor 1120 of terminal 1100 to complete the above method.Example Such as, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, soft Disk and optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is held by the processor of terminal When row, enable the terminal to execute a kind of based reminding method, which comprises
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity It is determined as target message cluster result.
Figure 12 is a kind of block diagram of server 1200 for short message cluster shown according to an exemplary embodiment.Example Such as, device 1200 may be provided as a server.Referring to Fig.1 2, device 1200 includes processing component 1222, is further wrapped One or more processors, and the memory resource as representated by memory 1232 are included, it can be by processing component for storing The instruction of 1222 execution, such as application program.The application program stored in memory 1232 may include one or one with On each correspond to one group of instruction module.
Device 1200 can also include that a power supply module 1226 be configured as the power management of executive device 1200, and one Wired or wireless network interface 1250 is configured as device 1200 being connected to network and input and output (I/O) interface 1258.Device 1200 can be operated based on the operating system for being stored in memory 1232, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processor of server When execution, enable the server to execute a kind of based reminding method, which comprises
According to the similarity in short message set between any two short message, the similarity matrix of the short message set is constructed;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as Number of clusters carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, by the corresponding short message cluster result of the reference category quantity It is determined as target message cluster result.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.

Claims (20)

1. a kind of short message clustering method characterized by comprising
According in short message set between any two short message structural similarity and short message sentence structurally and semantically between Similitude constructs the similarity matrix of the short message set;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as cluster Quantity carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, the corresponding short message cluster result of the reference category quantity is determined For target message cluster result.
2. the method according to claim 1, wherein the first category quantity, second category quantity with it is described Reference category quantity is adjacent, and the first category quantity is less than the second category quantity.
3. the method according to claim 1, wherein described by the reference category quantity, first category quantity Spectral clustering is carried out to the similarity matrix respectively as number of clusters with second category quantity, obtains cluster result, comprising:
Obtain the characteristic value and the corresponding feature vector of characteristic value in the similarity matrix;
It is ascending from described eigenvector respectively according to the reference category quantity, first category quantity and second category number Amount selects three groups of feature vectors;
Three groups of feature vectors of selection are formed into three characteristic vector spaces;
The feature vector in each characteristic vector space is clustered respectively using K-means clustering algorithm, three groups is obtained and gathers Class classification is as short message cluster result.
4. according to the method described in claim 3, it is characterized in that, which comprises
Calculate separately the weighted average of distance between the class cluster mass center of any two cluster classification in every group cluster classification;
The ratio of the weighted average of three group cluster classifications is calculated using default ratio formula;
Judge whether the ratio is greater than first threshold;
When the ratio is greater than first threshold, determine that the short message cluster result meets preset condition.
5. the method according to claim 1, wherein described according in short message set between any two short message Similarity constructs the similarity matrix of the short message set, comprising:
The similarity in short message set between any two short message is calculated using default similarity formula;
Generate similarity matrix of the matrix comprising all similarities as the short message set.
6. according to the method described in claim 5, it is characterized in that, the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the latent Dirichletal location model LDA theme vector of short message A;Vec (B) is the LDA master of short message B Inscribe vector;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
7. the method according to claim 1, wherein described utilize default similarity threshold to the similarity moment Battle array carries out hierarchical clustering, obtains reference category quantity, comprising:
The value of similarity each in the similarity matrix is compared with the default similarity threshold respectively;
Extract all similarities for being greater than default similarity threshold in the similarity matrix;
The short message that similarity between any two short message is all larger than the default similarity threshold is determined as a classification;
The quantity of obtained classification will be determined as the reference category quantity.
8. method described in -7 any one according to claim 1, which is characterized in that the method also includes:
When the short message cluster result is unsatisfactory for preset condition, the value of the reference category quantity is corrected, described in iteration executes Categorical measure and the spectral clustering process are determined, until the short message cluster result obtained meets the preset condition.
9. according to the method described in claim 8, it is characterized in that, the value of the amendment reference category quantity, comprising:
Obtain the default ratio in the short message cluster result;
When the default ratio is less than second threshold, the value of the reference category quantity is subtracted into the first preset value;
When the default ratio is greater than second threshold, the value of the reference category quantity is increased by the second preset value.
10. a kind of short message clustering apparatus characterized by comprising
Matrix constructs module, for according to the structural similarity and short message sentence in short message set between any two short message Similitude between structurally and semantically constructs the similarity matrix of the short message set;
Hierarchical clustering determining module is obtained for carrying out hierarchical clustering to the similarity matrix using default similarity threshold Reference category quantity;
Categorical measure determining module, for determining categorical measure, comprising: first category number is determined according to the reference category quantity Amount and second category quantity;
Spectral clustering module, for carrying out spectral clustering process, comprising: by the reference category quantity, first category quantity and second Categorical measure carries out spectral clustering to the similarity matrix respectively as number of clusters, obtains short message cluster result;
As a result determining module, for when the short message cluster result meets preset condition, the reference category quantity to be corresponded to Short message cluster result be determined as target message cluster result.
11. device according to claim 10, which is characterized in that the first category quantity, second category quantity and institute It is adjacent to state reference category quantity, and the first category quantity is less than the second category quantity.
12. device according to claim 10, which is characterized in that the spectral clustering module, comprising:
Feature acquisition submodule, for obtaining characteristic value and the corresponding feature vector of characteristic value in the similarity matrix;
Vector choose submodule, for it is ascending from described eigenvector respectively according to the reference category quantity, first Categorical measure and second category quantity select three groups of feature vectors;
Vector space forms submodule, for the three groups of feature vectors chosen to be separately constituted three characteristic vector spaces;
Submodule is clustered, for carrying out respectively using K-means clustering algorithm to the feature vector in each characteristic vector space Cluster, obtains three group cluster classifications as short message cluster result.
13. device according to claim 12, which is characterized in that described device includes:
Mean value calculation module, for calculating separately the spacing of the class cluster mass center of any two cluster classification in every group cluster classification From weighted average;
Ratio calculation module, the ratio of the weighted average for calculating three group cluster classifications using default ratio formula;
Ratio in judgement module, for judging whether the ratio is greater than first threshold;
First determining module, for when the ratio is greater than first threshold, determining that the short message cluster result meets default item Part.
14. device according to claim 10, which is characterized in that the matrix constructs module, comprising:
Similarity calculation submodule, for calculating the phase in short message set between any two short message using default similarity formula Like degree;
Matrix generates submodule, for generating similarity matrix of the matrix comprising all similarities as the short message set.
15. device according to claim 14, which is characterized in that the default similarity formula are as follows:
Sim (A, B)=Simstruct(A,B)×(αSimt(A,B)+βSimgram(A,B))
Wherein [0,1] α ∈, β ∈ [0,1];
When short message A is identical with the structure of short message B, Simstruct(A, B)=1;
When the structure of short message A and short message B is not identical, Simstruct(A, B)=0;
Simt(A, B)=cos (vec (A), vec (B));
Wherein: vec (A) is the LDA theme vector of short message A;Vec (B) is the LDA theme vector of short message B;
Simgram(A, B)=| D (A) ∩ D (B) |/| D (A) ∪ D (B) |;
Wherein, D (A) is the 2-gram word pair of short message A;D (B) is the 2-gram word pair of short message B.
16. device according to claim 10, which is characterized in that the hierarchical clustering determining module, comprising:
Comparative sub-module, for by the value of similarity each in the similarity matrix respectively with the default similarity threshold into Row compares;
Extracting sub-module, for extracting all similarities for being greater than default similarity threshold in the similarity matrix;
Classification determines submodule, for the similarity between any two short message to be all larger than the short of the default similarity threshold Letter is determined as a classification;
Reference category quantity determines submodule, and the quantity of the classification for that will determine is as the reference category quantity.
17. device described in any one of 0 to 16 according to claim 1, which is characterized in that described device further include:
Correction module, for correcting the value of the reference category quantity when the short message cluster result is unsatisfactory for preset condition;
After correction module amendment, iteration executes the determining categorical measure and the spectral clustering process, until obtain Short message cluster result meets the preset condition.
18. device according to claim 17, which is characterized in that the correction module, comprising:
Ratio acquisition submodule, for obtaining the default ratio in the short message cluster result;
First amendment submodule, for when the default ratio is less than second threshold, the value of the reference category quantity to be subtracted Remove the first preset value;
Second amendment submodule, for when the default ratio is greater than second threshold, the value of the reference category quantity to be increased Add the second preset value.
19. a kind of terminal characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According in short message set between any two short message structural similarity and short message sentence structurally and semantically between Similitude constructs the similarity matrix of the short message set;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as cluster Quantity carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, the corresponding short message cluster result of the reference category quantity is determined For target message cluster result.
20. a kind of server characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
According in short message set between any two short message structural similarity and short message sentence structurally and semantically between Similitude constructs the similarity matrix of the short message set;
Hierarchical clustering is carried out to the similarity matrix using default similarity threshold, obtains reference category quantity;
Determine categorical measure, comprising: first category quantity and second category quantity are determined according to the reference category quantity;
Spectral clustering process, comprising: using the reference category quantity, first category quantity and second category quantity as cluster Quantity carries out spectral clustering to the similarity matrix, obtains short message cluster result;
When the short message cluster result meets preset condition, the corresponding short message cluster result of the reference category quantity is determined For target message cluster result.
CN201610191407.6A 2016-03-30 2016-03-30 Short message clustering method and device Active CN105824955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610191407.6A CN105824955B (en) 2016-03-30 2016-03-30 Short message clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610191407.6A CN105824955B (en) 2016-03-30 2016-03-30 Short message clustering method and device

Publications (2)

Publication Number Publication Date
CN105824955A CN105824955A (en) 2016-08-03
CN105824955B true CN105824955B (en) 2019-02-19

Family

ID=56525338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610191407.6A Active CN105824955B (en) 2016-03-30 2016-03-30 Short message clustering method and device

Country Status (1)

Country Link
CN (1) CN105824955B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN109800775B (en) * 2017-11-17 2022-10-28 腾讯科技(深圳)有限公司 File clustering method, device, equipment and readable medium
CN108959440A (en) * 2018-06-13 2018-12-07 福建新大陆软件工程有限公司 A kind of short message clustering method and device
CN112148942B (en) * 2019-06-27 2024-04-09 北京达佳互联信息技术有限公司 Business index data classification method and device based on data clustering
CN110730270B (en) * 2019-09-09 2021-09-14 上海斑马来拉物流科技有限公司 Short message grouping method and device, computer storage medium and electronic equipment
CN111507400B (en) * 2020-04-16 2023-10-31 腾讯科技(深圳)有限公司 Application classification method, device, electronic equipment and storage medium
CN117880765B (en) * 2024-03-13 2024-05-28 深圳市诚立业科技发展有限公司 Intelligent management system for short message data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101860822A (en) * 2010-06-11 2010-10-13 中兴通讯股份有限公司 Method and system for monitoring spam messages
CN104112026B (en) * 2014-08-01 2017-09-08 中国联合网络通信集团有限公司 A kind of short message text sorting technique and system
CN104699668B (en) * 2015-03-26 2017-09-26 小米科技有限责任公司 Determine the method and device of Words similarity
CN104778256B (en) * 2015-04-20 2017-10-17 江苏科技大学 A kind of the quick of field question answering system consulting can increment clustering method

Also Published As

Publication number Publication date
CN105824955A (en) 2016-08-03

Similar Documents

Publication Publication Date Title
CN105824955B (en) Short message clustering method and device
WO2020232977A1 (en) Neural network training method and apparatus, and image processing method and apparatus
CN104408402B (en) Face identification method and device
US11455491B2 (en) Method and device for training image recognition model, and storage medium
CN108227950B (en) Input method and device
CN105404863B (en) Character features recognition methods and system
CN111259967B (en) Image classification and neural network training method, device, equipment and storage medium
CN108073303B (en) Input method and device and electronic equipment
JP6051336B2 (en) Clustering method, clustering device, terminal device, program, and recording medium
JP2016516251A (en) Clustering method, clustering device, terminal device, program, and recording medium
JP2017513075A (en) Method and apparatus for generating an image filter
CN105677731B (en) Show method, apparatus, terminal and the server of preview picture figure
CN108038102A (en) Recommendation method, apparatus, terminal and the storage medium of facial expression image
CN105100193B (en) Cloud business card recommended method and device
WO2020192113A1 (en) Image processing method and apparatus, electronic device, and storage medium
CN105678266A (en) Method and device for combining photo albums of human faces
TW202036462A (en) Method, apparatus and electronic device for image generating and storage medium thereof
JP2016517110A5 (en)
CN104573642B (en) Face identification method and device
CN112926310A (en) Keyword extraction method and device
CN111797746A (en) Face recognition method and device and computer readable storage medium
CN112559852A (en) Information recommendation method and device
CN105786350B (en) Choose reminding method, device and the terminal of image
CN106534965A (en) Method and device for obtaining video information
US20150262033A1 (en) Method and terminal device for clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant