CN109739978A - A kind of Text Clustering Method, text cluster device and terminal device - Google Patents
A kind of Text Clustering Method, text cluster device and terminal device Download PDFInfo
- Publication number
- CN109739978A CN109739978A CN201811508368.3A CN201811508368A CN109739978A CN 109739978 A CN109739978 A CN 109739978A CN 201811508368 A CN201811508368 A CN 201811508368A CN 109739978 A CN109739978 A CN 109739978A
- Authority
- CN
- China
- Prior art keywords
- text
- trained
- word
- vector
- clustered
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 239000013598 vector Substances 0.000 claims abstract description 160
- 238000012549 training Methods 0.000 claims abstract description 65
- 230000009466 transformation Effects 0.000 claims abstract description 42
- 230000006870 function Effects 0.000 claims description 33
- 238000004590 computer program Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 11
- 238000009795 derivation Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000004913 activation Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000004044 response Effects 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000010030 laminating Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000013107 unsupervised machine learning method Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application is suitable for depth learning technology field, provides a kind of Text Clustering Method, text cluster device and terminal device, comprising: obtains training text, and carries out participle pretreatment to the training text and obtain multiple words to be trained;Preset transformation model is trained using the word to be trained, the transformation model after being trained;Text to be clustered is obtained, participle pretreatment is carried out to the text to be clustered and obtains multiple text feature words;The text feature word is converted into term vector respectively using the transformation model after the training, and all term vectors in the text to be clustered are overlapped to obtain the text vector of the text to be clustered;The text vector is clustered to obtain cluster result.By the above method, the accuracy of text cluster result can be effectively improved.
Description
Technical field
This application involves depth learning technology field more particularly to a kind of Text Clustering Methods, text cluster device and end
End equipment.
Background technique
Text cluster is developed on the basis of traditional clustering, and foundation is similar Documents Similarity
Larger, inhomogeneous Documents Similarity is smaller.As a kind of unsupervised machine learning method, cluster does not need training process,
It does not need to mark classification by hand to document in advance, therefore there is certain flexibility and higher automatic processing ability, yet
Become the important means classified to text information and identified, is that more and more researchers are of interest.But it is existing
Text Clustering Method still cannot get higher accuracy rate.
Summary of the invention
In view of this, the embodiment of the present application provides a kind of Text Clustering Method, text cluster device and terminal device, with
Solve the problems, such as that the result accuracy of existing Text Clustering Method is lower.
The first aspect of the embodiment of the present application provides a kind of Text Clustering Method, comprising:
Training text is obtained, and participle pretreatment is carried out to the training text and obtains multiple words to be trained;
Preset transformation model is trained using the word to be trained, the transformation model after being trained;
Text to be clustered is obtained, participle pretreatment is carried out to the text to be clustered and obtains multiple text feature words;
The text feature word is converted into term vector respectively using the transformation model after the training, and will be described to poly-
All term vectors in class text are overlapped to obtain the text vector of the text to be clustered;
The text vector is clustered to obtain cluster result.
The second aspect of the embodiment of the present application provides a kind of text cluster device, comprising:
Acquiring unit, for obtaining training text, and to the training text carry out participle pretreatment obtain it is multiple wait instruct
Practice word;
Training unit, for being trained using the word to be trained to preset transformation model, after being trained
Transformation model;
Pretreatment unit, for obtaining text to be clustered, to the text to be clustered carry out participle pretreatment obtain it is multiple
Text feature word;
Superpositing unit, for using the transformation model after the training respectively by the text feature word be converted to word to
Amount, and all term vectors in the text to be clustered are overlapped to obtain the text vector of the text to be clustered;
Cluster cell, for being clustered to obtain cluster result to the text vector.
The third aspect of the embodiment of the present application provides a kind of terminal device, including memory, processor and is stored in
In the memory and the computer program that can run on the processor, when the processor executes the computer program
The step of realizing the method that the embodiment of the present application first aspect provides.
The fourth aspect of the embodiment of the present application provides a kind of computer readable storage medium, the computer-readable storage
Media storage has computer program, and the computer program realizes the embodiment of the present application when being executed by one or more processors
On the one hand the step of the method provided.
Existing beneficial effect is the embodiment of the present application compared with prior art:
The embodiment of the present application by obtain training text, to the training text carry out participle pretreatment obtain it is multiple wait instruct
Practice word, and preset transformation model is trained using the word to be trained, by the above method, can be trained
Transformation model afterwards;Then text to be clustered is obtained, participle pretreatment is carried out to the text to be clustered and obtains multiple texts spies
Word is levied, the text feature word is converted into term vector respectively using the transformation model after the training, utilizes turning after training
The text feature word of text to be clustered more accurately can be converted to term vector by mold changing type;By the institute in the text to be clustered
There is term vector to be overlapped to obtain the text vector of the text to be clustered, the text vector is clustered to obtain cluster knot
Fruit.By the above method, accurate term vector can be obtained, and then effectively increases the accuracy rate of text cluster result.
Detailed description of the invention
It in order to more clearly explain the technical solutions in the embodiments of the present application, below will be to embodiment or description of the prior art
Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some of the application
Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is the implementation process schematic diagram of Text Clustering Method provided by the embodiments of the present application;
Fig. 2 is the schematic diagram of text cluster device provided by the embodiments of the present application;
Fig. 3 is the schematic diagram of terminal device provided by the embodiments of the present application;
Fig. 4 is the schematic diagram of binary tree provided by the embodiments of the present application;
Fig. 5 is the building process schematic diagram of Huffman tree provided by the embodiments of the present application.
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed
Body details, so as to provide a thorough understanding of the present application embodiment.However, it will be clear to one skilled in the art that there is no these specific
The application also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity
The detailed description of road and method, so as not to obscure the description of the present application with unnecessary details.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " instruction is described special
Sign, entirety, step, operation, the presence of element and/or component, but be not precluded one or more of the other feature, entirety, step,
Operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment
And be not intended to limit the application.As present specification and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
As used in this specification and in the appended claims, term " if " can be according to context quilt
Be construed to " when ... " or " once " or " in response to determination " or " in response to detecting ".Similarly, phrase " if it is determined that " or
" if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true
It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
In order to illustrate technical solution described herein, the following is a description of specific embodiments.
Fig. 1 is the implementation process schematic diagram of Text Clustering Method provided by the embodiments of the present application, as shown, the method
It may comprise steps of:
Step S101 obtains training text, and carries out participle pretreatment to the training text and obtain multiple words to be trained
Language.
The minimum unit of English is word, is separated between word by space.And the minimum unit of Chinese is word, two words often connect
Continuous appearance, do not separated significantly.For the angle of Study on Semantic, word is the semantic unit of atomicity,
Therefore it correctly first must be cut into word, could preferably carries out understanding semantically.When Chinese Text Categorization, need first
It segments.The participle of Chinese text namely refers to that continuously character string is cut according to certain specification progress cutting originally by text
It is divided into one by one individually with the word of certain semantic.
In one embodiment, it is described to the training text carry out participle pretreatment obtain multiple trained words, comprising:
It removes the punctuation mark in the training text and obtains the first preprocessed text.
The stop words removed in first preprocessed text obtains the second preprocessed text.
Word segmentation processing is carried out to second preprocessed text and obtains multiple text feature words.
In practical applications, before participle, need to carry out text to be clustered participle pretreatment, removal as ".","*",
The punctuation marks such as "/", "+" will also remove such as " the ", " a ", " an ", " that ", " you ", " I ", " they ", " desired ", " beat
Open ", the stop words of meaningless function word such as " can with " etc, and then obtain training required text feature word.
Wherein, stop words refers in information retrieval, to save memory space and improving search efficiency, in processing nature language
Certain words or word are fallen in meeting automatic fitration before or after speech data (or text).These stop words be usually by being manually entered, it is non-
What automation generated, the stop words after generation will form a deactivated vocabulary.
Step S102 is trained preset transformation model using the word to be trained, the conversion after being trained
Model.
It is in one embodiment, described that preset transformation model is trained using the word to be trained, it is instructed
Transformation model after white silk, comprising:
The word frequency that each word to be trained occurs in the training text is counted respectively, and is constructed and breathed out according to the word frequency
Fu Man tree.
Initial information is obtained, and according to the Huffman tree of the initial information and building, the word to be trained is carried out
Training, the transformation model after being trained.
Wherein, the initial information includes preset window, the initial term vector of initial parameter vector sum.
Wherein, Huffman tree is a kind of shortest binary tree of cum rights path length, also referred to as optimum binary tree.Referring to fig. 4, scheme
4 be the schematic diagram of binary tree provided by the embodiments of the present application.As shown, cum rights path length is WPL=5*2+ in Fig. 4 (a)
7*2+2*2+13*2=54;Cum rights path length is WPL=5*3+2*3+7*2+13*1=48 in Fig. 4 (b).As it can be seen that Fig. 4 (b)
Cum rights path length it is smaller, so Fig. 4 (b) is Huffman tree.
In practice, the step of creating Huffman tree can be as follows.Assuming that have n node, the weight point of n node
Not Wei w1, w2 ..., wn, the collection of the binary tree of composition is combined into F={ T1, T2 ..., Tn }, then can construct one and contain n leaf
The Huffman tree of child node.Steps are as follows:
1) the smallest tree of two root node weights is chosen from F as left and right subtree constructs a new binary tree, it is new
The weight of binary tree be its left and right subtree root node weights sum;
2) two binary trees that previous step is chosen are deleted from F, and the tree of neotectonics is put into F;
3) (1) (2) are repeated, until F is containing only one tree.
It illustratively, is the building process schematic diagram of Huffman tree provided by the embodiments of the present application referring to Fig. 5, Fig. 5.Such as figure
It is shown, there are 5 nodes, weight is respectively 1,3,2,5,4, the smallest joint structure binary tree of two weights is therefrom chosen first,
That is selection 1 and 2, and by it and as new y-bend tree node, i.e., 3,;Next lesser node of weight is 3, as new
Binary tree a node, and with 1 and 2 and 3 be added to obtain the node 6 of new binary tree, and so on, obtain such as Fig. 5
Binary tree shown in middle step 5.
In one embodiment, the Huffman tree according to the initial information and building, to the word to be trained
It is trained, the transformation model after being trained, comprising:
The context of the word to be trained is obtained according to the preset window in the initial information, and is calculated described wait instruct
Practice needing of including in the context of word and trained the sum of term vector of word, obtains and vector.
It determines in the Huffman tree from root node to the path of the word to be trained.
Using Bayesian formula, and based on the probability corresponding with the vector calculating path.
Logarithmic calculation is taken to obtain objective function the probability, using the objective function as the transformation model after training.
In one embodiment, after taking Logarithmic calculation to obtain objective function the probability, further includes:
The objective function is obtained into the first increment to the initial parameter vector derivation in the initial information, and utilizes θ '
=θ0+αη1The initial parameter vector is updated.
The objective function is obtained into the second increment to described and vector derivation, and utilizes X '=X0+βη2To described initial
Term vector is updated.
Wherein, the θ ' is updated parameter vector, the θ0For the initial parameter vector, the α is first pre-
If weight, the η1For first increment, the X ' is the term vector of the updated word to be trained, the X0For institute
The initial term vector of word to be trained is stated, the β is the second preset weights, the η2For second increment.
Step S103 obtains text to be clustered, carries out participle pretreatment to the text to be clustered and obtains multiple texts spies
Levy word.
Step S103 is similar to step S101, and for details, reference can be made to the steps in step S101.
The text feature word is converted to term vector respectively using the transformation model after the training by step S104, and
All term vectors in the text to be clustered are overlapped to obtain the text vector of the text to be clustered.
In one embodiment, all term vectors by the text to be clustered are overlapped to obtain described to poly-
The text vector of class text, comprising:
The weight of each text feature word is calculated using TF-IDF algorithm.
The term vector of the text feature word is obtained into the text feature word multiplied by the corresponding weight of text Feature Words
Feature vector.
It is overlapped the feature vector of all text feature words to obtain the text vector of the training text.
Wherein, TF-IDF (term frequency-inverse document frequency) is a kind of for information
The common weighting technique of retrieval and data mining.TF means word frequency (Term Frequency) that IDF means inverse text frequency
Index (Inverse Document Frequency).TF-IDF is a kind of statistical method, to assess a words for one
The significance level of a file set or a copy of it file in a corpus.The importance of words occurs hereof with it
The directly proportional increase of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.
The weight for calculating text feature value, can first calculate the TF of text Feature Words, i.e. word frequency, then calculate text spy
The IDF of word is levied, i.e., TF is finally multiplied to obtain the weight of text Feature Words with IDF by reverse document-frequency.
Illustratively, if total word number of a file is 100, and word " cow " occurs 3 times, then " female
The word frequency TF=3/100=0.03 of an ox " word in this document.One method for calculating reverse document-frequency (IDF) is file
There is " cow " word divided by how many part file is measured in the total number of files for including in collection.So if " cow " word exists
1,000 part of file occurred, and total number of files is 10, if 000,000 part, reverse document-frequency be exactly IDF=lg (10,
000,000/1,000)=4.Finally calculate the weight=0.03*4=0.12 of " cow " this word.
Step S105 is clustered to obtain cluster result to the text vector.
It is in one embodiment, described that the text vector is clustered to obtain cluster result, comprising:
Initiation parameter is obtained, the initiation parameter includes preset threshold and default learning rate.
A text vector is chosen from all text vectors and is labeled as center vector, will be removed in all text vectors
Text vector outside the center vector is labeled as vector to be clustered, and each vector to be clustered is successively inputted the cluster mould
Type is clustered.
After all vectors to be clustered have inputted the Clustering Model, cluster result is exported.
In practical applications, the vector centered on randomly selecting one in all text vectors, then by remaining text
This vector sequentially inputs Clustering Model.Wherein, center vector is equivalent to the cluster centre in clustering algorithm.In other words, at this
Apply in embodiment, first determine a cluster centre at random, then successively by other text vectors and this cluster centre into
Row cluster, specific cluster process see below embodiment.
It is in one embodiment, described successively to cluster each vector input to be clustered Clustering Model, comprising:
Pass through netij=WiXjCalculate the activation value between the vector to be clustered and the center vector, the netijFor
Activation value between j-th of vector to be clustered and i-th of center vector, the WiFor i-th of center vector, the XjIt is j-th
Vector to be clustered.
It is selected from all activated value between the calculated vector to be clustered and the center vector maximum sharp
Whether value living using the corresponding center vector of the maximum activation value as object vector, and judges the maximum activation value
Greater than the preset threshold.
If the maximum activation value is greater than the preset threshold, W is utilizedt=Wt+ηXjThe object vector is carried out
It updates, the WtFor the object vector, the η is the default learning rate.
If the maximum activation value is less than or equal to the preset threshold, centered on the vector label to be clustered
Vector, and the number of center vector is added 1.
In practical applications, if a text vector and that maximum activation value in the activation value of each cluster centre
Greater than preset threshold, illustrates that text vector that cluster centre corresponding with maximum activation value belongs to same class, utilize at this time
This text vector is updated this cluster centre, that is, redefines such new cluster centre;If a text
That maximum activation value is less than or equal to preset threshold in the activation value of vector and each cluster centre, illustrates text vector
It is not belonging to any cluster centre, text vector is defined as a new cluster centre at this time, is i.e. the number of cluster centre adds
1。
When there are multiple text vectors, initially determine that a cluster centre may generate more in cluster process
A new cluster centre, it is every generate new cluster centre after, need to calculate next input text vector and existing institute
There is the activation value between cluster centre.
In one embodiment, after obtaining cluster result, further includes:
The text vector for including in the center vector in the cluster result and every one kind is obtained, and counts center vector
Number, using the number of the center vector as the number of class.
It utilizesCluster index is calculated, and judges that the cluster refers to
Within a preset range whether number.
If the cluster index within a preset range, does not re-use preset Clustering Model to the text to be clustered
Text vector clustered.
Wherein, the DB is the cluster index, and the K is the number of the class, the DmIndicate all texts in m class
This vector to m class center vector average distance, the DnIndicate in the n-th class all text vectors to the center of the n-th class
The average distance of vector, the CmnIndicate the distance between center vector and the center vector of the n-th class of m class.
Through the foregoing embodiment, the result of cluster can be evaluated, DB value is smaller, indicates similar between class and class
Degree is lower, and it is better to further relate to cluster result.
The embodiment of the present application by obtain training text, to the training text carry out participle pretreatment obtain it is multiple wait instruct
Practice word, and preset transformation model is trained using the word to be trained, by the above method, can be trained
Transformation model afterwards;Then text to be clustered is obtained, participle pretreatment is carried out to the text to be clustered and obtains multiple texts spies
Word is levied, the text feature word is converted into term vector respectively using the transformation model after the training, utilizes turning after training
The text feature word of text to be clustered more accurately can be converted to term vector by mold changing type;By the institute in the text to be clustered
There is term vector to be overlapped to obtain the text vector of the text to be clustered, the text vector is clustered to obtain cluster knot
Fruit.By the above method, accurate term vector can be obtained, and then effectively increases the accuracy rate of text cluster result.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present application constitutes any limit
It is fixed.
Fig. 2 is the schematic diagram of text cluster device provided by the embodiments of the present application, for ease of description, is only shown and this Shen
It please the relevant part of embodiment.
Text cluster device shown in Fig. 2 can be the software unit being built in existing terminal device, hardware cell,
Or the unit of soft or hard combination, it can also be used as independent pendant and be integrated into the terminal device, be also used as independent end
End equipment exists.
The text cluster device 2 includes:
Acquiring unit 21, for obtaining training text, and to the training text carry out participle pretreatment obtain it is multiple to
Training word.
Training unit 22, for being trained using the word to be trained to preset transformation model, after being trained
Transformation model.
Pretreatment unit 23, for obtaining text to be clustered, to the text to be clustered carry out participle pretreatment obtain it is more
A text feature word.
Superpositing unit 24, for using the transformation model after the training respectively by the text feature word be converted to word to
Amount, and all term vectors in the text to be clustered are overlapped to obtain the text vector of the text to be clustered.
Cluster cell 25, for being clustered to obtain cluster result to the text vector.
Optionally, the acquiring unit 21 includes:
First removal module, obtains the first preprocessed text for removing the punctuation mark in the training text.
Second removal module, obtains the second preprocessed text for removing the stop words in first preprocessed text.
Word segmentation module obtains multiple text feature words for carrying out word segmentation processing to second preprocessed text.
Optionally, the training unit 22 includes:
Statistical module, for counting the word frequency that each word to be trained occurs in the training text respectively, and according to
The word frequency constructs Huffman tree.
Construct module, for obtaining initial information, and according to the Huffman tree of the initial information and building, to it is described to
Training word is trained, the transformation model after being trained.
Wherein, the initial information includes preset window, the initial term vector of initial parameter vector sum.
Optionally, the building module includes:
First computational submodule, for obtaining the word to be trained according to the preset window in the initial information
Hereafter, and calculate needing of including in the context of the word to be trained and trained the sum of term vector of word, obtain and to
Amount.
Submodule is determined, for determining in the Huffman tree from root node to the path of the word to be trained.
Second computational submodule, for utilizing Bayesian formula, and based on described corresponding with the vector calculating path
Probability.
Third computational submodule makees the objective function for taking Logarithmic calculation to obtain objective function the probability
For the transformation model after training.
Optionally, the building module further include:
4th computational submodule, for after taking Logarithmic calculation to obtain objective function the probability, by the target
Function obtains the first increment to the initial parameter vector derivation in the initial information, and utilizes θ '=θ0+αη1To described initial
Parameter vector is updated.
Derivation submodule, for the objective function to be obtained the second increment to described and vector derivation, and using X '=
X0+βη2The initial term vector is updated.
Wherein, the θ ' is updated parameter vector, the θ0For the initial parameter vector, the α is first pre-
If weight, the η1For first increment, the X ' is the term vector of the updated word to be trained, the X0For institute
The initial term vector of word to be trained is stated, the β is the second preset weights, the η2For second increment.
Optionally, the superpositing unit 24 includes:
Computing module, for calculating the weight of each text feature word using TF-IDF algorithm.
Product module, for the term vector of the text feature word to be obtained institute multiplied by the corresponding weight of text Feature Words
State the feature vector of text feature word.
Laminating module, for being overlapped the feature vector of all text feature words to obtain the text of the training text
This vector.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.Each functional unit in embodiment, module can integrate in one processing unit, can also
To be that each unit physically exists alone, can also be integrated in one unit with two or more units, it is above-mentioned integrated
Unit both can take the form of hardware realization, can also realize in the form of software functional units.In addition, each function list
Member, the specific name of module are also only for convenience of distinguishing each other, the protection scope being not intended to limit this application.Above system
The specific work process of middle unit, module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
Fig. 3 is the schematic diagram of terminal device provided by the embodiments of the present application.As shown in figure 3, the terminal device 3 of the embodiment
Include: processor 30, memory 31 and is stored in the calculating that can be run in the memory 31 and on the processor 30
Machine program 32.The processor 30 is realized when executing the computer program 32 in above-mentioned each Text Clustering Method embodiment
Step, such as step S101 to S105 shown in FIG. 1.Alternatively, realization when the processor 30 executes the computer program 32
The function of each module/unit in above-mentioned each Installation practice, such as the function of module 21 to 25 shown in Fig. 2.
Illustratively, the computer program 32 can be divided into one or more module/units, it is one or
Multiple module/units are stored in the memory 31, and are executed by the processor 30, to complete the application.Described one
A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for
Implementation procedure of the computer program 32 in the terminal device 3 is described.For example, the computer program 32 can be divided
It is cut into acquiring unit, training unit, pretreatment unit, superpositing unit, cluster cell, each unit concrete function is as follows:
Acquiring unit, for obtaining training text, and to the training text carry out participle pretreatment obtain it is multiple wait instruct
Practice word.
Training unit, for being trained using the word to be trained to preset transformation model, after being trained
Transformation model.
Pretreatment unit, for obtaining text to be clustered, to the text to be clustered carry out participle pretreatment obtain it is multiple
Text feature word.
Superpositing unit, for using the transformation model after the training respectively by the text feature word be converted to word to
Amount, and all term vectors in the text to be clustered are overlapped to obtain the text vector of the text to be clustered.
Cluster cell, for being clustered to obtain cluster result to the text vector.
Optionally, the acquiring unit includes:
First removal module, obtains the first preprocessed text for removing the punctuation mark in the training text.
Second removal module, obtains the second preprocessed text for removing the stop words in first preprocessed text.
Word segmentation module obtains multiple text feature words for carrying out word segmentation processing to second preprocessed text.
Optionally, the training unit includes:
Statistical module, for counting the word frequency that each word to be trained occurs in the training text respectively, and according to
The word frequency constructs Huffman tree.
Construct module, for obtaining initial information, and according to the Huffman tree of the initial information and building, to it is described to
Training word is trained, the transformation model after being trained.
Wherein, the initial information includes preset window, the initial term vector of initial parameter vector sum.
Optionally, the building module includes:
First computational submodule, for obtaining the word to be trained according to the preset window in the initial information
Hereafter, and calculate needing of including in the context of the word to be trained and trained the sum of term vector of word, obtain and to
Amount.
Submodule is determined, for determining in the Huffman tree from root node to the path of the word to be trained.
Second computational submodule, for utilizing Bayesian formula, and based on described corresponding with the vector calculating path
Probability.
Third computational submodule makees the objective function for taking Logarithmic calculation to obtain objective function the probability
For the transformation model after training.
Optionally, the building module further include:
4th computational submodule, for after taking Logarithmic calculation to obtain objective function the probability, by the target
Function obtains the first increment to the initial parameter vector derivation in the initial information, and utilizes θ '=θ0+αη1To described initial
Parameter vector is updated.
Derivation submodule, for the objective function to be obtained the second increment to described and vector derivation, and using X '=
X0+βη2The initial term vector is updated.
Wherein, the θ ' is updated parameter vector, the θ0For the initial parameter vector, the α is first pre-
If weight, the η1For first increment, the X ' is the term vector of the updated word to be trained, the X0For institute
The initial term vector of word to be trained is stated, the β is the second preset weights, the η2For second increment.
Optionally, the superpositing unit includes:
Computing module, for calculating the weight of each text feature word using TF-IDF algorithm.
Product module, for the term vector of the text feature word to be obtained institute multiplied by the corresponding weight of text Feature Words
State the feature vector of text feature word.
Laminating module, for being overlapped the feature vector of all text feature words to obtain the text of the training text
This vector.
The terminal device 3 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set
It is standby.The terminal device may include, but be not limited only to, processor 30, memory 31.It will be understood by those skilled in the art that Fig. 3
The only example of terminal device 3 does not constitute the restriction to terminal device 3, may include than illustrating more or fewer portions
Part perhaps combines certain components or different components, such as the terminal device can also include input-output equipment, net
Network access device, bus etc..
Alleged processor 30 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng.
The memory 31 can be the internal storage unit of the terminal device 3, such as the hard disk or interior of terminal device 3
It deposits.The memory 31 is also possible to the External memory equipment of the terminal device 3, such as be equipped on the terminal device 3
Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge
Deposit card (Flash Card) etc..Further, the memory 31 can also both include the storage inside list of the terminal device 3
Member also includes External memory equipment.The memory 31 is for storing needed for the computer program and the terminal device
Other programs and data.The memory 31 can be also used for temporarily storing the data that has exported or will export.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in detail or remembers in some embodiment
The part of load may refer to the associated description of other embodiments.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician
Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed
Scope of the present application.
In embodiment provided herein, it should be understood that disclosed device/terminal device and method, it can be with
It realizes by another way.For example, device described above/terminal device embodiment is only schematical, for example, institute
The division of module or unit is stated, only a kind of logical function partition, there may be another division manner in actual implementation, such as
Multiple units or components can be combined or can be integrated into another system, or some features can be ignored or not executed.Separately
A bit, shown or discussed mutual coupling or direct-coupling or communication connection can be through some interfaces, device
Or the INDIRECT COUPLING or communication connection of unit, it can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated module/unit be realized in the form of SFU software functional unit and as independent product sale or
In use, can store in a computer readable storage medium.Based on this understanding, the application realizes above-mentioned implementation
All or part of the process in example method, can also instruct relevant hardware to complete, the meter by computer program
Calculation machine program can be stored in a computer readable storage medium, the computer program when being executed by processor, it can be achieved that on
The step of stating each embodiment of the method.Wherein, the computer program includes computer program code, the computer program generation
Code can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium
It may include: any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic that can carry the computer program code
Dish, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM,
Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that described
The content that computer-readable medium includes can carry out increasing appropriate according to the requirement made laws in jurisdiction with patent practice
Subtract, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium do not include be electric carrier signal and
Telecommunication signal.
Embodiment described above is only to illustrate the technical solution of the application, rather than its limitations;Although referring to aforementioned reality
Example is applied the application is described in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution should all
Comprising within the scope of protection of this application.
Claims (10)
1. a kind of Text Clustering Method characterized by comprising
Training text is obtained, and participle pretreatment is carried out to the training text and obtains multiple words to be trained;
Preset transformation model is trained using the word to be trained, the transformation model after being trained;
Text to be clustered is obtained, participle pretreatment is carried out to the text to be clustered and obtains multiple text feature words;
The text feature word is converted into term vector respectively using the transformation model after the training, and by the text to be clustered
All term vectors in this are overlapped to obtain the text vector of the text to be clustered;
The text vector is clustered to obtain cluster result.
2. Text Clustering Method as described in claim 1, which is characterized in that described to carry out segmenting pre- place to the training text
Reason obtains multiple trained words, comprising:
It removes the punctuation mark in the training text and obtains the first preprocessed text;
The stop words removed in first preprocessed text obtains the second preprocessed text;
Word segmentation processing is carried out to second preprocessed text and obtains multiple text feature words.
3. Text Clustering Method as described in claim 1, which is characterized in that the word to be trained described in is to preset
Transformation model is trained, the transformation model after being trained, comprising:
The word frequency that each word to be trained occurs in the training text is counted respectively, and Huffman is constructed according to the word frequency
Tree;
Initial information is obtained, and according to the Huffman tree of the initial information and building, the word to be trained is trained,
Transformation model after being trained;
Wherein, the initial information includes preset window, the initial term vector of initial parameter vector sum.
4. Text Clustering Method as claimed in claim 3, which is characterized in that the Kazakhstan according to the initial information and building
Fu Man tree is trained the word to be trained, the transformation model after being trained, comprising:
The context of the word to be trained is obtained according to the preset window in the initial information, and calculates the word to be trained
Needing of including in the context of language is trained the sum of term vector of word, is obtained and vector;
It determines in the Huffman tree from root node to the path of the word to be trained;
Using Bayesian formula, and based on the probability corresponding with the vector calculating path;
Logarithmic calculation is taken to obtain objective function the probability, using the objective function as the transformation model after training.
5. Text Clustering Method as claimed in claim 4, which is characterized in that taking Logarithmic calculation to obtain target the probability
After function, further includes:
The objective function is obtained into the first increment to the initial parameter vector derivation in the initial information, and utilizes θ '=θ0+
αη1The initial parameter vector is updated;
The objective function is obtained into the second increment to described and vector derivation, and utilizes X '=X0+βη2To the initial word to
Amount is updated;
Wherein, the θ ' is updated parameter vector, the θ0For the initial parameter vector, the α is the first default power
Value, the η1For first increment, the X ' is the term vector of the updated word to be trained, the X0For it is described to
The initial term vector of training word, the β are the second preset weights, the η2For second increment.
6. such as Text Clustering Method described in any one of claim 1 to 5, which is characterized in that described by the text to be clustered
In all term vectors be overlapped to obtain the text vector of the text to be clustered, comprising:
The weight of each text feature word is calculated using TF-IDF algorithm;
The term vector of the text feature word is obtained into the spy of the text feature word multiplied by the corresponding weight of text Feature Words
Levy vector;
It is overlapped the feature vector of all text feature words to obtain the text vector of the training text.
7. a kind of text cluster device characterized by comprising
Acquiring unit for obtaining training text, and carries out participle pretreatment to the training text and obtains multiple words to be trained
Language;
Training unit, for being trained using the word to be trained to preset transformation model, the conversion after being trained
Model;
Pretreatment unit carries out participle pretreatment to the text to be clustered and obtains multiple texts for obtaining text to be clustered
Feature Words;
Superpositing unit, for the text feature word to be converted to term vector respectively using the transformation model after the training, and
All term vectors in the text to be clustered are overlapped to obtain the text vector of the text to be clustered;
Cluster cell, for being clustered to obtain cluster result to the text vector.
8. text cluster device as claimed in claim 7, which is characterized in that the acquiring unit includes:
First removal module, obtains the first preprocessed text for removing the punctuation mark in the training text;
Second removal module, obtains the second preprocessed text for removing the stop words in first preprocessed text;
Word segmentation module obtains multiple text feature words for carrying out word segmentation processing to second preprocessed text.
9. a kind of terminal device, including memory, processor and storage are in the memory and can be on the processor
The computer program of operation, which is characterized in that the processor realizes such as claim 1 to 6 when executing the computer program
The step of any one the method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In when the computer program is executed by processor the step of any one of such as claim 1 to 6 of realization the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811508368.3A CN109739978A (en) | 2018-12-11 | 2018-12-11 | A kind of Text Clustering Method, text cluster device and terminal device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811508368.3A CN109739978A (en) | 2018-12-11 | 2018-12-11 | A kind of Text Clustering Method, text cluster device and terminal device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109739978A true CN109739978A (en) | 2019-05-10 |
Family
ID=66359287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811508368.3A Pending CN109739978A (en) | 2018-12-11 | 2018-12-11 | A kind of Text Clustering Method, text cluster device and terminal device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109739978A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN110990569A (en) * | 2019-11-29 | 2020-04-10 | 百度在线网络技术(北京)有限公司 | Text clustering method and device and related equipment |
CN111563097A (en) * | 2020-04-30 | 2020-08-21 | 广东小天才科技有限公司 | Unsupervised topic aggregation method and device, electronic equipment and storage medium |
CN112036176A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Text clustering method and device |
CN112612888A (en) * | 2020-12-25 | 2021-04-06 | 航天信息股份有限公司 | Method and system for intelligently clustering text files |
CN112835798A (en) * | 2021-02-03 | 2021-05-25 | 广州虎牙科技有限公司 | Cluster learning method, test step clustering method and related device |
CN112860893A (en) * | 2021-02-08 | 2021-05-28 | 国网河北省电力有限公司营销服务中心 | Short text classification method and terminal equipment |
CN113392209A (en) * | 2020-10-26 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Text clustering method based on artificial intelligence, related equipment and storage medium |
CN113420112A (en) * | 2021-06-21 | 2021-09-21 | 中国科学院声学研究所 | News entity analysis method and device based on unsupervised learning |
WO2021189974A1 (en) * | 2020-10-21 | 2021-09-30 | 平安科技(深圳)有限公司 | Model training method and apparatus, text classification method and apparatus, computer device and medium |
CN113779207A (en) * | 2020-12-03 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Visual angle layering method and device for dialect text |
CN113807073A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Text content abnormity detection method, device and storage medium |
WO2023000782A1 (en) * | 2021-07-21 | 2023-01-26 | 北京有竹居网络技术有限公司 | Method and apparatus for acquiring video hotspot, readable medium, and electronic device |
CN116127077A (en) * | 2023-04-17 | 2023-05-16 | 长沙数智融媒科技有限公司 | Kmeans-based content uniform clustering method |
CN116339799A (en) * | 2023-04-06 | 2023-06-27 | 山景智能(北京)科技有限公司 | Method, system, terminal equipment and storage medium for intelligent data interface management |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765769A (en) * | 2015-03-06 | 2015-07-08 | 大连理工大学 | Short text query expansion and indexing method based on word vector |
CN106776713A (en) * | 2016-11-03 | 2017-05-31 | 中山大学 | It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis |
CN107153642A (en) * | 2017-05-16 | 2017-09-12 | 华北电力大学 | A kind of analysis method based on neural network recognization text comments Sentiment orientation |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
CN108197109A (en) * | 2017-12-29 | 2018-06-22 | 北京百分点信息科技有限公司 | A kind of multilingual analysis method and device based on natural language processing |
-
2018
- 2018-12-11 CN CN201811508368.3A patent/CN109739978A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765769A (en) * | 2015-03-06 | 2015-07-08 | 大连理工大学 | Short text query expansion and indexing method based on word vector |
CN106776713A (en) * | 2016-11-03 | 2017-05-31 | 中山大学 | It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis |
CN107153642A (en) * | 2017-05-16 | 2017-09-12 | 华北电力大学 | A kind of analysis method based on neural network recognization text comments Sentiment orientation |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
CN108197109A (en) * | 2017-12-29 | 2018-06-22 | 北京百分点信息科技有限公司 | A kind of multilingual analysis method and device based on natural language processing |
Non-Patent Citations (1)
Title |
---|
许绍燮等: "地震预报方法实用化研究文集", 31 December 1989, 学术书刊出版社, pages: 112 - 114 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990569A (en) * | 2019-11-29 | 2020-04-10 | 百度在线网络技术(北京)有限公司 | Text clustering method and device and related equipment |
CN110941961B (en) * | 2019-11-29 | 2023-08-25 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN110990569B (en) * | 2019-11-29 | 2023-11-07 | 百度在线网络技术(北京)有限公司 | Text clustering method and device and related equipment |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN111563097A (en) * | 2020-04-30 | 2020-08-21 | 广东小天才科技有限公司 | Unsupervised topic aggregation method and device, electronic equipment and storage medium |
CN113807073A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Text content abnormity detection method, device and storage medium |
CN113807073B (en) * | 2020-06-16 | 2023-11-14 | 中国电信股份有限公司 | Text content anomaly detection method, device and storage medium |
CN112036176A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Text clustering method and device |
CN112036176B (en) * | 2020-07-22 | 2024-05-24 | 大箴(杭州)科技有限公司 | Text clustering method and device |
WO2021189974A1 (en) * | 2020-10-21 | 2021-09-30 | 平安科技(深圳)有限公司 | Model training method and apparatus, text classification method and apparatus, computer device and medium |
CN113392209A (en) * | 2020-10-26 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Text clustering method based on artificial intelligence, related equipment and storage medium |
CN113392209B (en) * | 2020-10-26 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Text clustering method based on artificial intelligence, related equipment and storage medium |
CN113779207A (en) * | 2020-12-03 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Visual angle layering method and device for dialect text |
CN112612888A (en) * | 2020-12-25 | 2021-04-06 | 航天信息股份有限公司 | Method and system for intelligently clustering text files |
CN112612888B (en) * | 2020-12-25 | 2023-06-16 | 航天信息股份有限公司 | Method and system for intelligent clustering of text files |
CN112835798B (en) * | 2021-02-03 | 2024-02-20 | 广州虎牙科技有限公司 | Clustering learning method, testing step clustering method and related devices |
CN112835798A (en) * | 2021-02-03 | 2021-05-25 | 广州虎牙科技有限公司 | Cluster learning method, test step clustering method and related device |
CN112860893A (en) * | 2021-02-08 | 2021-05-28 | 国网河北省电力有限公司营销服务中心 | Short text classification method and terminal equipment |
CN112860893B (en) * | 2021-02-08 | 2023-02-28 | 国网河北省电力有限公司营销服务中心 | Short text classification method and terminal equipment |
CN113420112A (en) * | 2021-06-21 | 2021-09-21 | 中国科学院声学研究所 | News entity analysis method and device based on unsupervised learning |
WO2023000782A1 (en) * | 2021-07-21 | 2023-01-26 | 北京有竹居网络技术有限公司 | Method and apparatus for acquiring video hotspot, readable medium, and electronic device |
CN116339799A (en) * | 2023-04-06 | 2023-06-27 | 山景智能(北京)科技有限公司 | Method, system, terminal equipment and storage medium for intelligent data interface management |
CN116339799B (en) * | 2023-04-06 | 2023-11-28 | 山景智能(北京)科技有限公司 | Method, system, terminal equipment and storage medium for intelligent data interface management |
CN116127077A (en) * | 2023-04-17 | 2023-05-16 | 长沙数智融媒科技有限公司 | Kmeans-based content uniform clustering method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109739978A (en) | A kind of Text Clustering Method, text cluster device and terminal device | |
CN111353310B (en) | Named entity identification method and device based on artificial intelligence and electronic equipment | |
Hashimoto et al. | Topic detection using paragraph vectors to support active learning in systematic reviews | |
CN109766437A (en) | A kind of Text Clustering Method, text cluster device and terminal device | |
US20230195773A1 (en) | Text classification method, apparatus and computer-readable storage medium | |
EP3985559A1 (en) | Entity semantics relationship classification | |
CN104714931B (en) | For selecting the method and system to represent tabular information | |
CN108133045A (en) | Keyword extracting method and system, keyword extraction model generating method and system | |
CN111221944B (en) | Text intention recognition method, device, equipment and storage medium | |
CN109684476A (en) | A kind of file classification method, document sorting apparatus and terminal device | |
CN108399227A (en) | Method, apparatus, computer equipment and the storage medium of automatic labeling | |
WO2017193685A1 (en) | Method and device for data processing in social network | |
CN107992477A (en) | Text subject determines method, apparatus and electronic equipment | |
CN111898374B (en) | Text recognition method, device, storage medium and electronic equipment | |
CN108664512B (en) | Text object classification method and device | |
CN106294618A (en) | Searching method and device | |
CN111737997A (en) | Text similarity determination method, text similarity determination equipment and storage medium | |
CN115455171B (en) | Text video mutual inspection rope and model training method, device, equipment and medium | |
Berndorfer et al. | Automated diagnosis coding with combined text representations | |
CN110297893A (en) | Natural language question-answering method, device, computer installation and storage medium | |
US11886515B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN113806582B (en) | Image retrieval method, image retrieval device, electronic equipment and storage medium | |
CN112632261A (en) | Intelligent question and answer method, device, equipment and storage medium | |
CN113761192B (en) | Text processing method, text processing device and text processing equipment | |
WO2014130287A1 (en) | Method and system for propagating labels to patient encounter data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |