CN110321553A - Short text subject identifying method, device and computer readable storage medium - Google Patents
Short text subject identifying method, device and computer readable storage medium Download PDFInfo
- Publication number
- CN110321553A CN110321553A CN201910466244.1A CN201910466244A CN110321553A CN 110321553 A CN110321553 A CN 110321553A CN 201910466244 A CN201910466244 A CN 201910466244A CN 110321553 A CN110321553 A CN 110321553A
- Authority
- CN
- China
- Prior art keywords
- word
- short text
- words
- descriptor
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention discloses a kind of short text subject identifying methods, this method comprises: obtaining short text sample;Short text sample is pre-processed, word collection is obtained;Calculate the term weighing that the word concentrates word;According to the term weighing, the similarity between the word concentration word is calculated;The similarity between word is concentrated according to the word, the word concentrated to the word clusters, the descriptor after being clustered;Determine that the word concentrates the word frequency of word according to the descriptor after cluster;The word concentrates the word frequency Training document theme of word to generate model;Obtain target short text message;The target short text message is inputted in the document subject matter and generates model, obtains the descriptor of the target short text message.The present invention also proposes a kind of short text topic identification device and a kind of computer readable storage medium.The present invention is able to achieve the intention that user can accurately be identified from short text, to be intended to provide the service of user demand according to user.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of short text subject identifying methods, device and computer
Readable storage medium storing program for executing.
Background technique
Nowadays short text is very universal in various application scenarios, and it is shorter that short text typically refers to length, does not surpass generally
The textual form for crossing 160 characters, such as microblogging, chat message, theme of news, viewpoint comment, question text, SMS, text
Offer abstract etc..Many intentions that client is generally comprised in short text, deep excavation is carried out to it great application value.
The purpose of short text classification task is exactly to handle automatically the short text of user's input, obtains valuable output.It is chatting
In the building process of its robot, identify to the intention of user is a wherein important part, but the prior art
In the user that is calculated in the case where short text data to be intended to accuracy lower.
Summary of the invention
The present invention provides a kind of short text subject identifying method, device and computer readable storage medium, main purpose
It is to realize the intention that can accurately identify user from short text, to be intended to provide the clothes of user demand according to user
Business.
To achieve the above object, the present invention also provides a kind of short text subject identifying methods, which comprises
Obtain short text sample;
Short text sample is pre-processed, word collection is obtained;
Calculate the term weighing that the word concentrates word;
According to the term weighing, the similarity between the word concentration word is calculated;
The similarity between word is concentrated according to the word, the word concentrated to the word clusters, and is gathered
Descriptor after class;
Determine that the word concentrates the word frequency of word according to the descriptor after cluster;
The word frequency Training document theme of word is concentrated to generate model using the word;
Obtain target short text message;
The target short text message is inputted in the document subject matter and generates model, obtains the target short text message
Descriptor.
Preferably, described to pre-process to short text sample, obtaining word collection includes:
The short text sample is subjected to disconnected word word cutting using word cutting method, obtains multiple words;
Stop words is removed from the multiple words, the words that obtains that treated;
To treated, words executes the operation of words original shapeization, using the word after original shapeization operation as word collection.
Preferably, the term weighing for calculating the word concentration word includes:
Obtain the length of the text where each word;
Calculate the number that each word occurs in the text at it;
Using TF-IDF method, and according to the length of the text where each word and each word its institute in the text
The number of appearance calculates the term weighing of each word.
Preferably, described according to the term weighing, the similarity calculated between the word concentration word includes:
For any two word, obtain the text attribute of two words respectively, the text attribute include numerical attribute,
Category attribute;
According to the numerical attribute of two words, the numerical distance between two words is calculated;
According to the classification at the place of two words, the distance between the classification of two words is calculated;
Between the distance weighted summation between the classification of numerical distance and two words two words, described two words are obtained
Similarity between language;
It repeats the above steps, until the similarity between the word concentration any two word has been calculated.
Preferably, the similarity concentrated between word according to the word carries out the word that the word is concentrated
Cluster, the descriptor after being clustered include:
It is arranged and group's number of group is divided to be k, and randomly selects clustering center of the k sample as k group;
The similarity between word, the center of each group of iterative calculation are concentrated according to the word, and is redistributed each
Group belonging to word, the descriptor until reach iterated conditional, each of after being clustered group.
Preferably, the method also includes:
If not in descriptor database, the target short text is disappeared for the descriptor of the target short text message
The descriptor of breath is added to classification belonging to descriptor described in the descriptor database.
To achieve the above object, the present invention also provides a kind of short text topic identification device, described device includes memory
And processor, the short text topic identification program that can be run on the processor, the short essay are stored on the memory
This topic identification program realizes following steps when being executed by the processor:
Obtain short text sample;
Short text sample is pre-processed, word collection is obtained;
Calculate the term weighing that the word concentrates word;
According to the term weighing, the similarity between the word concentration word is calculated;
The similarity between word is concentrated according to the word, the word concentrated to the word clusters, and is gathered
Descriptor after class;
Determine that the word concentrates the word frequency of word according to the descriptor after cluster;
The word frequency Training document theme of word is concentrated to generate model using the word;
Obtain target short text message;
The target short text message is inputted in the document subject matter and generates model, obtains the target short text message
Descriptor.
Preferably, described to pre-process to short text sample, obtaining word collection includes:
The short text sample is subjected to disconnected word word cutting using word cutting method, obtains multiple words;
Stop words is removed from the multiple words, the words that obtains that treated;
To treated, words executes the operation of words original shapeization, using the word after original shapeization operation as word collection.
Preferably, the term weighing for calculating the word concentration word includes:
Obtain the length of the text where each word;
Calculate the number that each word occurs in the text at it;
Using TF-IDF method, and according to the length of the text where each word and each word its institute in the text
The number of appearance calculates the term weighing of each word.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium
Short text topic identification program is stored on storage medium, the short text topic identification program can be handled by one or more
Device executes, the step of to realize short text subject identifying method as described above.
The present invention obtains short text sample;Short text sample is pre-processed, word collection is obtained;Calculate the word collection
The term weighing of middle word;According to the term weighing, the similarity between the word concentration word is calculated;According to institute's predicate
Language concentrates the similarity between word, and the word concentrated to the word clusters, the descriptor after being clustered;According to poly-
Descriptor after class determines that the word concentrates the word frequency of word;The word concentrates the word frequency Training document theme of word to generate
Model;Obtain target short text message;The target short text message is inputted in the document subject matter and generates model, obtains institute
State the descriptor of target short text message.The present invention is able to achieve the intention that user can accurately be identified from short text, thus
It is intended to provide the service of user demand according to user.
Detailed description of the invention
Fig. 1 is the flow diagram for the short text subject identifying method that one embodiment of the invention provides;
Fig. 2 is the schematic diagram of internal structure for the short text topic identification device that one embodiment of the invention provides;
The module of short text topic identification program in the short text topic identification device that Fig. 3 provides for one embodiment of the invention
Schematic diagram.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of short text subject identifying method.Shown in referring to Fig.1, provided for one embodiment of the invention short
The flow diagram of text subject recognition methods.This method can be executed by device, which can be by software and/or hard
Part is realized.
In the present embodiment, short text subject identifying method includes:
S10, short text sample is obtained.
In the present embodiment, the short text sample includes the sample of multiple application scenarios, for example, microblogging, chat message,
Theme of news, viewpoint comment, question text, SMS, literature summary.It certainly in other embodiments, can also be for single
The acquisition of one scene progress great amount of samples.
S11, short text sample is pre-processed, obtains word collection.
In the present embodiment, described to pre-process to short text sample, obtaining word collection includes:
The short text sample is subjected to disconnected word word cutting using word cutting method, obtains multiple words;
Stop words is removed from the multiple words, the words that obtains that treated;
To treated, words executes the operation of words original shapeization, using the word after original shapeization operation as word collection.
Wherein all words of file are captured through disconnected word word cutting function come mainly according to word by breaking word word cutting
Space or punctuation mark between word are split, and capture the word that every file occurs in order to carry out subsequent words
Weighted value calculates or the random multinomial distribution operation of LDA model.
It, can not by that can find the nonsense words of vast number after word appeared in removal stop words extraction document
Analyst is helped further to understand file, it is therefore necessary to first carry out superfluous words filtering, delete its and under-represented deactivate
Words converges.Removal stop words is advantageous in that the difficulty for being able to ascend the speed of subsequent word-processing operation and reducing its operation
Degree.
The purpose of words original shape is that the derivative in english word is switched to original shape mode.English can be because before file
Following syntax demand and be changed into the forms such as noun, verb, adjective, single plural number, in order to avoid will when subsequent word-processing
Derivative words is considered as different individual characters, so that subsequent characteristics value matrix dimension is significantly increased, therefore by file words original shape to improve
Words analysis precision
S12, the term weighing that the word concentrates word is calculated.
In the present embodiment, the term weighing for calculating the word concentration word includes:
Obtain the length of the text where each word;
Calculate the number that each word occurs in the text at it;
Using TF-IDF method, and according to the length of the text where each word and each word its institute in the text
The number of appearance calculates the term weighing of each word.
Specifically, after having handled disconnected word word cutting, removal stop words, words original shape and having obtained more significant words,
In term weighing operation, the frequency of appearance, the length of article all can weighing factor calculating as a result, one of present case preferably real
In testing, certain term weighing value is calculated using TF-IDF algorithm, each word has a fixed weighted value.
S13, according to the term weighing, calculate the word and concentrate similarity between word.
In the present embodiment, the similarity between calculating word concentration any two word includes:
For any two word, obtain the text attribute of two words respectively, the text attribute include numerical attribute,
Category attribute;
According to the numerical attribute of two words, the numerical distance between two words is calculated;
According to the classification at the place of two words, the distance between the classification of two words is calculated;
Between the distance weighted summation between the classification of numerical distance and two words two words, described two words are obtained
Similarity between language;
It repeats the above steps, until the similarity between the word concentration any two word has been calculated.
The similarity between word is calculated by a variety of attributes, can accurately can more measure the semantic phase between word
Like degree.
S14, the similarity between word is concentrated according to the word, the word concentrated to the word clusters, and obtains
Descriptor after to cluster.
In the present embodiment, the similarity concentrated according to the word between word, the word that the word is concentrated
Language is clustered, and the descriptor after being clustered includes:
It is arranged and group's number of group is divided to be k, and randomly selects clustering center of the k sample as k group;
The similarity between word, the center of each group of iterative calculation are concentrated according to the word, and is redistributed each
Group belonging to word, the descriptor until reach iterated conditional, each of after being clustered group.
S15, determine that the word concentrates the word frequency of word according to the descriptor after cluster.
In the present embodiment, according to obtained theme collection and set of words is clustered, the corresponding word of each theme is calculated separately
Language counts each word and concentrates the number i.e. word frequency occurred in theme, obtains word frequency statistics data.
S16, the word frequency Training document theme of word is concentrated to generate model using the word.
In the present embodiment, it is to discrete that document subject matter, which generates model LDA (Latent Dirichlet Allocation),
Data set modeling probability model of growth, be three layers of a Bayesian model, be divided into assigned short text set layer, subject layer and Feature Words
Layer, every layer has corresponding stochastic variable or state modulator.Its basic thought is text mixes life by implicit theme at random
At each theme corresponds to specific Feature Words distribution.All short texts of LDA model hypothesis will give birth to there are K implicit themes
At a short text, a theme distribution of the short text is firstly generated, then regenerates the set of word;A word is generated,
It needs to randomly choose a theme according to the theme distribution of short text, then randomly chooses one according to the distribution of word in theme
Word repeats this process until generating short text.
LDA model is by propositions such as Blei, is three layers of Bayes's production model of one " text-theme-word ",
The mixed distribution that every text representation is the theme, and each theme is then the probability distribution on word.Initial model is only to text
Sheet-theme probability distribution, which introduces a hyper parameter, makes it obey Dirichlet distribution, and subsequent Griffiths etc. is to theme-word
Probability distribution, which is also introduced into a hyper parameter, makes it obey Dirichlet distribution.Two hyper parameters are traditionally arranged to be α=50/T, and β=
0.01.The number of parameters of LDA model is only related with number of topics and word number, and parameter Estimation is to calculate text-theme probability distribution
And theme-Word probability distribution, i.e. θ and φ.By carrying out Gibbs sampling estimation θ and φ indirectly to variable z.
S17, target short text message is obtained.
S18, model will be generated in the target short text message input document subject matter, obtains the target short text
The descriptor of message.
For example, customer service short-text message inputs LDA topic model, customer service intention is obtained.The short text that customer service is replied is disappeared
Breath is input in trained LDA topic model, obtains the descriptor of short-text message, and will input File Transfer to model
Training text in, the more descriptor in neologism library and word.
In the present embodiment, the method also includes: if the descriptor of the target short text message is not in descriptor number
When according to library, the descriptor of the target short text message is added to class belonging to descriptor described in the descriptor database
Not.
The present invention obtains short text sample;Short text sample is pre-processed, word collection is obtained;Calculate the word collection
The term weighing of middle word;According to the term weighing, the similarity between the word concentration word is calculated;According to institute's predicate
Language concentrates the similarity between word, and the word concentrated to the word clusters, the descriptor after being clustered;According to poly-
Descriptor after class determines that the word concentrates the word frequency of word;The word concentrates the word frequency Training document theme of word to generate
Model;Obtain target short text message;The target short text message is inputted in the document subject matter and generates model, obtains institute
State the descriptor of target short text message.The present invention is able to achieve the intention that user can accurately be identified from short text, thus
It is intended to provide the service of user demand according to user.
The present invention also provides a kind of short text topic identification devices.Referring to shown in Fig. 2, provided for one embodiment of the invention
The schematic diagram of internal structure of short text topic identification device.
In the present embodiment, short text topic identification device 1 can be PC (Personal Computer, PC),
It is also possible to the terminal devices such as smart phone, tablet computer, portable computer.The short text topic identification device 1 includes at least
Memory 11, processor 12, communication bus 13 and network interface 14.
Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), magnetic storage, disk, CD etc..Memory 11
It can be the internal storage unit of short text topic identification device 1, such as short text topic identification dress in some embodiments
Set 1 hard disk.Memory 11 is also possible to the External memory equipment of short text topic identification device 1 in further embodiments,
Such as the plug-in type hard disk being equipped on short text topic identification device 1, intelligent memory card (Smart Media Card, SMC), peace
Digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 11 can also be wrapped both
The internal storage unit for including short text topic identification device 1 also includes External memory equipment.Memory 11 can be not only used for depositing
Storage is installed on the application software and Various types of data of short text topic identification device 1, such as the generation of short text topic identification program 01
Code etc., can be also used for temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11
Code or processing data, such as execute short text topic identification program 01 etc..
Communication bus 13 is for realizing the connection communication between these components.
Network interface 14 optionally may include standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in
Communication connection is established between the device 1 and other electronic equipments.
Optionally, which can also include user interface, and user interface may include display (Display), input
Unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.It is optional
Ground, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and organic hair
Optical diode (Organic Light-Emitting Diode, OLED) touches device etc..Wherein, display appropriate can also claim
It is visual for being shown in the information handled in short text topic identification device 1 and for showing for display screen or display unit
The user interface of change.
Fig. 2 illustrates only the short text topic identification device with component 11-14 and short text topic identification program 01
1, it will be appreciated by persons skilled in the art that structure shown in fig. 1 does not constitute the limit to short text topic identification device 1
It is fixed, it may include perhaps combining certain components or different component layouts than illustrating less perhaps more components.
In 1 embodiment of device shown in Fig. 2, short text topic identification program 01 is stored in memory 11;Processor
Following steps are realized when the short text topic identification program 01 stored in 12 execution memories 11:
Obtain short text sample.
In the present embodiment, the short text sample includes the sample of multiple application scenarios, for example, microblogging, chat message,
Theme of news, viewpoint comment, question text, SMS, literature summary.It certainly in other embodiments, can also be for single
The acquisition of one scene progress great amount of samples.
Short text sample is pre-processed, word collection is obtained.
In the present embodiment, described to pre-process to short text sample, obtaining word collection includes:
The short text sample is subjected to disconnected word word cutting using word cutting method, obtains multiple words;
Stop words is removed from the multiple words, the words that obtains that treated;
To treated, words executes the operation of words original shapeization, using the word after original shapeization operation as word collection.
Wherein all words of file are captured through disconnected word word cutting function come mainly according to word by breaking word word cutting
Space or punctuation mark between word are split, and capture the word that every file occurs in order to carry out subsequent words
Weighted value calculates or the random multinomial distribution operation of LDA model.
It, can not by that can find the nonsense words of vast number after word appeared in removal stop words extraction document
Analyst is helped further to understand file, it is therefore necessary to first carry out superfluous words filtering, delete its and under-represented deactivate
Words converges.Removal stop words is advantageous in that the difficulty for being able to ascend the speed of subsequent word-processing operation and reducing its operation
Degree.
The purpose of words original shape is that the derivative in english word is switched to original shape mode.English can be because before file
Following syntax demand and be changed into the forms such as noun, verb, adjective, single plural number, in order to avoid will when subsequent word-processing
Derivative words is considered as different individual characters, so that subsequent characteristics value matrix dimension is significantly increased, therefore by file words original shape to improve
Words analysis precision
Calculate the term weighing that the word concentrates word.
In the present embodiment, the term weighing for calculating the word concentration word includes:
Obtain the length of the text where each word;
Calculate the number that each word occurs in the text at it;
Using TF-IDF method, and according to the length of the text where each word and each word its institute in the text
The number of appearance calculates the term weighing of each word.
Specifically, after having handled disconnected word word cutting, removal stop words, words original shape and having obtained more significant words,
In term weighing operation, the frequency of appearance, the length of article all can weighing factor calculating as a result, one of present case preferably real
In testing, certain term weighing value is calculated using TF-IDF algorithm, each word has a fixed weighted value.
According to the term weighing, the similarity between the word concentration word is calculated.
In the present embodiment, the similarity between calculating word concentration any two word includes:
For any two word, obtain the text attribute of two words respectively, the text attribute include numerical attribute,
Category attribute;
According to the numerical attribute of two words, the numerical distance between two words is calculated;
According to the classification at the place of two words, the distance between the classification of two words is calculated;
Between the distance weighted summation between the classification of numerical distance and two words two words, described two words are obtained
Similarity between language;
It repeats the above steps, until the similarity between the word concentration any two word has been calculated.
The similarity between word is calculated by a variety of attributes, can accurately can more measure the semantic phase between word
Like degree.
The similarity between word is concentrated according to the word, the word concentrated to the word clusters, and is gathered
Descriptor after class.
In the present embodiment, the similarity concentrated according to the word between word, the word that the word is concentrated
Language is clustered, and the descriptor after being clustered includes:
It is arranged and group's number of group is divided to be k, and randomly selects clustering center of the k sample as k group;
The similarity between word, the center of each group of iterative calculation are concentrated according to the word, and is redistributed each
Group belonging to word, the descriptor until reach iterated conditional, each of after being clustered group.
Determine that the word concentrates the word frequency of word according to the descriptor after cluster.
In the present embodiment, according to obtained theme collection and set of words is clustered, the corresponding word of each theme is calculated separately
Language counts each word and concentrates the number i.e. word frequency occurred in theme, obtains word frequency statistics data.
The word frequency Training document theme of word is concentrated to generate model using the word.
In the present embodiment, it is to discrete that document subject matter, which generates model LDA (Latent Dirichlet Allocation),
Data set modeling probability model of growth, be three layers of a Bayesian model, be divided into assigned short text set layer, subject layer and Feature Words
Layer, every layer has corresponding stochastic variable or state modulator.Its basic thought is text mixes life by implicit theme at random
At each theme corresponds to specific Feature Words distribution.All short texts of LDA model hypothesis will give birth to there are K implicit themes
At a short text, a theme distribution of the short text is firstly generated, then regenerates the set of word;A word is generated,
It needs to randomly choose a theme according to the theme distribution of short text, then randomly chooses one according to the distribution of word in theme
Word repeats this process until generating short text.
LDA model is by propositions such as Blei, is three layers of Bayes's production model of one " text-theme-word ",
The mixed distribution that every text representation is the theme, and each theme is then the probability distribution on word.Initial model is only to text
Sheet-theme probability distribution, which introduces a hyper parameter, makes it obey Dirichlet distribution, and subsequent Griffiths etc. is to theme-word
Probability distribution, which is also introduced into a hyper parameter, makes it obey Dirichlet distribution.Two hyper parameters are traditionally arranged to be α=50/T, and β=
0.01.The number of parameters of LDA model is only related with number of topics and word number, and parameter Estimation is to calculate text-theme probability distribution
And theme-Word probability distribution, i.e. θ and φ.By carrying out Gibbs sampling estimation θ and φ indirectly to variable z.
Obtain target short text message.
The target short text message is inputted in the document subject matter and generates model, obtains the target short text message
Descriptor.
For example, customer service short-text message inputs LDA topic model, customer service intention is obtained.The short text that customer service is replied is disappeared
Breath is input in trained LDA topic model, obtains the descriptor of short-text message, and will input File Transfer to model
Training text in, the more descriptor in neologism library and word.
In the present embodiment, the method also includes: if the descriptor of the target short text message is not in descriptor number
When according to library, the descriptor of the target short text message is added to class belonging to descriptor described in the descriptor database
Not.
The present invention obtains short text sample;Short text sample is pre-processed, word collection is obtained;Calculate the word collection
The term weighing of middle word;According to the term weighing, the similarity between the word concentration word is calculated;According to institute's predicate
Language concentrates the similarity between word, and the word concentrated to the word clusters, the descriptor after being clustered;According to poly-
Descriptor after class determines that the word concentrates the word frequency of word;The word concentrates the word frequency Training document theme of word to generate
Model;Obtain target short text message;The target short text message is inputted in the document subject matter and generates model, obtains institute
State the descriptor of target short text message.The present invention is able to achieve the intention that user can accurately be identified from short text, thus
It is intended to provide the service of user demand according to user.
Optionally, in other embodiments, short text topic identification program can also be divided into one or more mould
Block, one or more module are stored in memory 11, and (the present embodiment is processor by one or more processors
12) performed to complete the present invention, the so-called module of the present invention is the series of computation machine program for referring to complete specific function
Instruction segment, for describing implementation procedure of the short text topic identification program in short text topic identification device.
It is the short text topic identification in one embodiment of short text topic identification device of the present invention for example, referring to shown in Fig. 3
The program module schematic diagram of program, in the embodiment, short text topic identification program, which can be divided into, obtains module 10, processing
Module 20, computing module 30, cluster module 40, determining module 50 and training module 60, illustratively:
It obtains module 10 and obtains short text sample;
Processing module 20 pre-processes short text sample, obtains word collection;
Computing module 30 calculates the term weighing that the word concentrates word;
The computing module 30 calculates the similarity between the word concentration word according to the term weighing;
Cluster module 40 concentrates the similarity between word according to the word, and the word concentrated to the word gathers
Class, the descriptor after being clustered;
Determining module 50 determines that the word concentrates the word frequency of word according to the descriptor after cluster;
Training module 60 concentrates the word frequency Training document theme of word to generate model using the word;
The acquisition module 10 obtains target short text message;
The target short text message is inputted in the document subject matter and generates model by the computing module 30, is obtained described
The descriptor of target short text message.
Above-mentioned acquisition module 10, processing module 20, computing module 30, cluster module 40, determining module 50 and training module
The program modules such as 60 are performed realized functions or operations step and are substantially the same with above-described embodiment, and details are not described herein.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
On be stored with short text topic identification program, the short text topic identification program can be executed by one or more processors, with
Realize following operation:
Obtain short text sample;
Short text sample is pre-processed, word collection is obtained;
Calculate the term weighing that the word concentrates word;
According to the term weighing, the similarity between the word concentration word is calculated;
The similarity between word is concentrated according to the word, the word concentrated to the word clusters, and is gathered
Descriptor after class;
Determine that the word concentrates the word frequency of word according to the descriptor after cluster;
The word frequency Training document theme of word is concentrated to generate model using the word;
Obtain target short text message;
The target short text message is inputted in the document subject matter and generates model, obtains the target short text message
Descriptor.
Computer readable storage medium specific embodiment of the present invention and above-mentioned short text topic identification device and method are each
Embodiment is essentially identical, does not make tired state herein.
It should be noted that the serial number of the above embodiments of the invention is only for description, do not represent the advantages or disadvantages of the embodiments.And
The terms "include", "comprise" herein or any other variant thereof is intended to cover non-exclusive inclusion, so that packet
Process, device, article or the method for including a series of elements not only include those elements, but also including being not explicitly listed
Other element, or further include for this process, device, article or the intrinsic element of method.Do not limiting more
In the case where, the element that is limited by sentence "including a ...", it is not excluded that including process, device, the article of the element
Or there is also other identical elements in method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone,
Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of short text subject identifying method, which is characterized in that the described method includes:
Obtain short text sample;
Short text sample is pre-processed, word collection is obtained;
Calculate the term weighing that the word concentrates word;
According to the term weighing, the similarity between the word concentration word is calculated;
The similarity between word is concentrated according to the word, the word concentrated to the word clusters, after obtaining cluster
Descriptor;
Determine that the word concentrates the word frequency of word according to the descriptor after cluster;
The word frequency Training document theme of word is concentrated to generate model using the word;
Obtain target short text message;
The target short text message is inputted in the document subject matter and generates model, obtains the master of the target short text message
Epigraph.
2. short text subject identifying method as described in claim 1, which is characterized in that described to be located in advance to short text sample
Reason, obtaining word collection includes:
The short text sample is subjected to disconnected word word cutting using word cutting method, obtains multiple words;
Stop words is removed from the multiple words, the words that obtains that treated;
To treated, words executes the operation of words original shapeization, using the word after original shapeization operation as word collection.
3. short text subject identifying method as described in claim 1, which is characterized in that described to calculate the word concentration word
Term weighing include:
Obtain the length of the text where each word;
Calculate the number that each word occurs in the text at it;
Occur in the text using TF-IDF method, and according to the length of the text where each word and each word at it
Number, calculate the term weighing of each word.
4. short text subject identifying method as described in claim 1, which is characterized in that described according to the term weighing, meter
Calculating the word concentrates the similarity between word to include:
For any two word, the text attribute of two words is obtained respectively, the text attribute includes numerical attribute, classification
Attribute;
According to the numerical attribute of two words, the numerical distance between two words is calculated;
According to the classification at the place of two words, the distance between the classification of two words is calculated;
Between the distance weighted summation between the classification of numerical distance and two words two words, obtain between described two words
Similarity;
It repeats the above steps, until the similarity between the word concentration any two word has been calculated.
5. short text subject identifying method as described in claim 1, which is characterized in that described to concentrate word according to the word
Between similarity, to the word concentrate word cluster, the descriptor after being clustered includes:
It is arranged and group's number of group is divided to be k, and randomly selects clustering center of the k sample as k group;
The similarity between word, the center of each group of iterative calculation are concentrated according to the word, and redistributes each word
Affiliated group, the descriptor until reach iterated conditional, each of after being clustered group.
6. the short text subject identifying method as described in any one of claims 1 to 5, which is characterized in that the method is also wrapped
It includes:
If the descriptor of the target short text message is not in descriptor database, by the target short text message
Descriptor is added to classification belonging to descriptor described in the descriptor database.
7. a kind of short text topic identification device, which is characterized in that described device includes memory and processor, the memory
On be stored with the short text topic identification program that can be run on the processor, the short text topic identification program is described
Processor realizes following steps when executing:
Obtain short text sample;
Short text sample is pre-processed, word collection is obtained;
Calculate the term weighing that the word concentrates word;
According to the term weighing, the similarity between the word concentration word is calculated;
The similarity between word is concentrated according to the word, the word concentrated to the word clusters, after obtaining cluster
Descriptor;
Determine that the word concentrates the word frequency of word according to the descriptor after cluster;
The word frequency Training document theme of word is concentrated to generate model using the word;
Obtain target short text message;
The target short text message is inputted in the document subject matter and generates model, obtains the master of the target short text message
Epigraph.
8. short text topic identification device as claimed in claim 7, which is characterized in that described to be located in advance to short text sample
Reason, obtaining word collection includes:
The short text sample is subjected to disconnected word word cutting using word cutting method, obtains multiple words;
Stop words is removed from the multiple words, the words that obtains that treated;
To treated, words executes the operation of words original shapeization, using the word after original shapeization operation as word collection.
9. short text topic identification device as claimed in claim 7, which is characterized in that described to calculate the word concentration word
Term weighing include:
Obtain the length of the text where each word;
Calculate the number that each word occurs in the text at it;
Occur in the text using TF-IDF method, and according to the length of the text where each word and each word at it
Number, calculate the term weighing of each word.
10. a kind of computer readable storage medium, which is characterized in that be stored with short text on the computer readable storage medium
Topic identification program, the short text topic identification program can be executed by one or more processor, to realize as right is wanted
Short text subject identifying method described in asking any one of 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910466244.1A CN110321553B (en) | 2019-05-30 | 2019-05-30 | Short text topic identification method and device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910466244.1A CN110321553B (en) | 2019-05-30 | 2019-05-30 | Short text topic identification method and device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110321553A true CN110321553A (en) | 2019-10-11 |
CN110321553B CN110321553B (en) | 2023-01-17 |
Family
ID=68119218
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910466244.1A Active CN110321553B (en) | 2019-05-30 | 2019-05-30 | Short text topic identification method and device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321553B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851602A (en) * | 2019-11-13 | 2020-02-28 | 精硕科技(北京)股份有限公司 | Method and device for topic clustering |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN111061877A (en) * | 2019-12-10 | 2020-04-24 | 厦门市美亚柏科信息股份有限公司 | Text theme extraction method and device |
CN111460133A (en) * | 2020-03-27 | 2020-07-28 | 北京百度网讯科技有限公司 | Topic phrase generation method and device and electronic equipment |
CN111898366A (en) * | 2020-07-29 | 2020-11-06 | 平安科技(深圳)有限公司 | Document subject word aggregation method and device, computer equipment and readable storage medium |
CN111930885A (en) * | 2020-07-03 | 2020-11-13 | 北京新联财通咨询有限公司 | Method and device for extracting text topics and computer equipment |
CN112396049A (en) * | 2020-11-19 | 2021-02-23 | 平安普惠企业管理有限公司 | Text error correction method and device, computer equipment and storage medium |
CN113807073A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Text content abnormity detection method, device and storage medium |
CN114281928A (en) * | 2020-09-28 | 2022-04-05 | ***通信集团广西有限公司 | Model generation method, device and equipment based on text data |
CN116431814A (en) * | 2023-06-06 | 2023-07-14 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170192959A1 (en) * | 2015-07-07 | 2017-07-06 | Foundation Of Soongsil University-Industry Cooperation | Apparatus and method for extracting topics |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
-
2019
- 2019-05-30 CN CN201910466244.1A patent/CN110321553B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170192959A1 (en) * | 2015-07-07 | 2017-07-06 | Foundation Of Soongsil University-Industry Cooperation | Apparatus and method for extracting topics |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
Non-Patent Citations (1)
Title |
---|
郭蓝天 等: "一种基于LDA主题模型的话题发现方法", 《西北工业大学学报》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851602A (en) * | 2019-11-13 | 2020-02-28 | 精硕科技(北京)股份有限公司 | Method and device for topic clustering |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN110941961B (en) * | 2019-11-29 | 2023-08-25 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN111061877A (en) * | 2019-12-10 | 2020-04-24 | 厦门市美亚柏科信息股份有限公司 | Text theme extraction method and device |
CN111460133A (en) * | 2020-03-27 | 2020-07-28 | 北京百度网讯科技有限公司 | Topic phrase generation method and device and electronic equipment |
CN111460133B (en) * | 2020-03-27 | 2023-08-18 | 北京百度网讯科技有限公司 | Theme phrase generation method and device and electronic equipment |
CN113807073B (en) * | 2020-06-16 | 2023-11-14 | 中国电信股份有限公司 | Text content anomaly detection method, device and storage medium |
CN113807073A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Text content abnormity detection method, device and storage medium |
CN111930885B (en) * | 2020-07-03 | 2023-08-04 | 北京新联财通咨询有限公司 | Text topic extraction method and device and computer equipment |
CN111930885A (en) * | 2020-07-03 | 2020-11-13 | 北京新联财通咨询有限公司 | Method and device for extracting text topics and computer equipment |
CN111898366A (en) * | 2020-07-29 | 2020-11-06 | 平安科技(深圳)有限公司 | Document subject word aggregation method and device, computer equipment and readable storage medium |
CN111898366B (en) * | 2020-07-29 | 2022-08-09 | 平安科技(深圳)有限公司 | Document subject word aggregation method and device, computer equipment and readable storage medium |
CN114281928A (en) * | 2020-09-28 | 2022-04-05 | ***通信集团广西有限公司 | Model generation method, device and equipment based on text data |
CN112396049A (en) * | 2020-11-19 | 2021-02-23 | 平安普惠企业管理有限公司 | Text error correction method and device, computer equipment and storage medium |
CN116431814A (en) * | 2023-06-06 | 2023-07-14 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
CN116431814B (en) * | 2023-06-06 | 2023-09-05 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110321553B (en) | 2023-01-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321553A (en) | Short text subject identifying method, device and computer readable storage medium | |
WO2019227710A1 (en) | Network public opinion analysis method and apparatus, and computer-readable storage medium | |
WO2022141861A1 (en) | Emotion classification method and apparatus, electronic device, and storage medium | |
CN106960030B (en) | Information pushing method and device based on artificial intelligence | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
US20160306800A1 (en) | Reply recommendation apparatus and system and method for text construction | |
CN107885499A (en) | A kind of interface document generation method and terminal device | |
WO2017075017A1 (en) | Automatic conversation creator for news | |
US10108698B2 (en) | Common data repository for improving transactional efficiencies of user interactions with a computing device | |
CN111309910A (en) | Text information mining method and device | |
JP6776310B2 (en) | User-Real-time feedback information provision methods and systems associated with input content | |
CN111984792A (en) | Website classification method and device, computer equipment and storage medium | |
CN105653620B (en) | Log analysis method and device of intelligent question-answering system | |
CN110427453B (en) | Data similarity calculation method, device, computer equipment and storage medium | |
CN110297893A (en) | Natural language question-answering method, device, computer installation and storage medium | |
US20190147104A1 (en) | Method and apparatus for constructing artificial intelligence application | |
CN113076735A (en) | Target information acquisition method and device and server | |
CN113282762A (en) | Knowledge graph construction method and device, electronic equipment and storage medium | |
CN107545505A (en) | Insure recognition methods and the system of finance product information | |
CN110263121B (en) | Table data processing method, apparatus, electronic apparatus and computer readable storage medium | |
CN111859862B (en) | Text data labeling method and device, storage medium and electronic device | |
CN112926341A (en) | Text data processing method and device | |
CN106446696A (en) | Information processing method and electronic device | |
Jee et al. | Potential of patent image data as technology intelligence source | |
CN110502741B (en) | Chinese text recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |