CN108090231A - A kind of topic model optimization method based on comentropy - Google Patents
A kind of topic model optimization method based on comentropy Download PDFInfo
- Publication number
- CN108090231A CN108090231A CN201810029097.7A CN201810029097A CN108090231A CN 108090231 A CN108090231 A CN 108090231A CN 201810029097 A CN201810029097 A CN 201810029097A CN 108090231 A CN108090231 A CN 108090231A
- Authority
- CN
- China
- Prior art keywords
- mrow
- theme
- topic
- lexical item
- phrase
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of topic model optimization methods based on comentropy, belong to Text Classification field.The present invention main technical schemes be:It is related to a kind of structure of topic model and the degree of subject relativity of document content is calculated using constructed topic model, specifically utilize comentropy and mutual information, the feature lexical item of theme can uniquely be characterized by being excavated from topic corpus, the feature lexical item of threshold condition will be met as theme dictionary, training topic model calculates the degree of subject relativity of document content.Present invention is particularly suitable for the document content degree of subject relativity calculating that subject key words item is depending therefrom, high granularity, the theme feature lexical item of strong feature can be excavated according to statistics such as comentropies, effective aggregation features lexical item realizes the document content Optimum Classification of dependent theme.
Description
Technical field
The present invention relates to a kind of topic model optimization methods based on comentropy, belong to Text Classification field.
Background technology
In recent years, mass data information is while great convenience is brought, and analysis similarly to information and looks into
It looks for and brings huge challenge.Under big data background, required information how is rapidly obtained from mass data as people
It is in the urgent need to address the problem of.
The form complexity of data is various, and compared to the data mode that video, audio so visualize, text data is abstract
And the highest data mode of condensed degree.In machine learning and natural language processing field, it is often necessary to be dug from a large amount of texts
Excavate the potential applications relation contained in text lexical item.Previous information retrieval website by Shallow Semantic Parsing to content of text into
The preliminary semantic analysis of row determines the correlation between the document and search for, but with society and the continuous hair of technology
Exhibition, it is desirable to quickly obtain accurate answer in a manner of " ask and answer ", so frequent and efficient interactive mode
Machine is promoted to need have deeper analysis and understandability to text semantic.
By the study and prediction of topic model, the theme distribution of text can be obtained, realizes text cluster, classification, inspection
The tasks such as rope, extension, recommendation and applied to text mining, sentiment analysis, commending system, digital book, public sentiment monitoring, number
According to fields such as acquisition, social network sites and personalized retrievals.
Traditional subject heading list representation model mainly has Boolean Model, vector space model, probabilistic model and language model etc..Cloth
You represent particular topic by model with a subject key words set, as long as calculating the intersection of keyword set, you can judge text
The degree of correlation of shelves and theme.Although Boolean Model is easily achieved, but it does not account for the weight of keyword, can not be accurate
Similitude is calculated, binary outcome can not effectively distinguish degree of subject relativity.The appearance of vector space model compensates for Boolean Model and recognizes
For all keywords it is of equal importance the defects of, improve the binary value of keyword weight, quantitatively segment different keywords to theme
Different contributions.But vector space model does not account for the semantic information of lexical item, can not judge lexical item not on semantic understanding
Same and semantic relevant content of text.
Due to document semantic and the close relation of document subject matter, consider how to carry out document subject matter from document structure tree angle
The method of modeling is come into being.PLSA (Probabilistic Latent Se-mantic Analysis) topic model be from
The angle of Frequency school is set out to document structure tree process model building, and Frequency school thinks that model parameter is although unknown but fixes not
Become, can apply and be calculated the methods of Maximum-likelihood estimation.Since think with the Bayesian schools that Frequency school is completely contradicted
Unknown parameters, then parameter is also a stochastic variable, also obeys corresponding distribution.If it is ginseng on the basis of PLSA models
Number has just obtained LDA (Latent Dirichlet Allocation) topic model plus corresponding prior distribution.
LDA has three layers of Feature Words, theme and document Bayesian network as complete production probability statistics topic model
Network structure by being modeled to corpus, excavates potential semantic information in corpus.With the development and application of LDA models,
Extended Model based on LDA is also gradually suggested.In order to preferably find the correlation information between implicit theme, CTM models
The Dirichlet substituted using logic-normal distribution in LDA models is distributed;PAV models using directed acyclic graph represent theme it
Between imply semantic information, so as to more effectively excavate existing hierarchical relationship between theme;SLDA models pass through
Class label is added in, the structure of thematic structure information and prediction is made to become more accurate.Above-mentioned extended model takes full advantage of LDA moulds
The type expression ability powerful to text.Compared to other topic models, probability theory is introduced into model by LDA, layer of structure
Clearly, meet text actual conditions, there is powerful semantic classes characteristic under big data environment, meanwhile, pass through Dirichlet points
Cloth constructs subject layer and Feature Words layer, can quickly handle huge topic corpus, effectively avoid over-fitting in training process
Situation.
Topic model represent it is accurate whether be restriction text subject degree of correlation an important factor for.However, in actual feelings
In condition, the text in text library can be gathered by the Attribute transposition of some structurings for some, and the text among each set is deposited
In general character, and these general character are by being ignored such as topic model independent between LDA this class hypothesis text.Therefore, as text master
Between topic share majority equal key word item when, how using LDA models advantage, build new topic model, effectively distinguish
Subject categories belonging to document are the key that we study.
The content of the invention
It is an object of the invention to improve the classification accuracy of dependent theme in text classification, for document content lexical item
Different and semantic relevant characteristic, it is proposed that a kind of topic model optimization method based on comentropy utilizes comentropy and mutual trust
Breath statistic excavates the feature lexical item that can uniquely characterize theme from topic corpus, and selection meets the Feature Words of threshold condition
, training topic model calculates the degree of subject relativity of document content further according to this topic model, distinguishes the theme class belonging to document
Not;This method, which is particularly suitable for subject key words set, more intersection or each theme with the superior and the subordinate's inclusion relation, energy
Subject categories belonging to enough effective district single cent shelves.
A kind of topic model optimization method based on comentropy, includes the following steps:
Step 1. training LDA topic models obtain theme dictionary;
Specifically, according to topic corpus train LDA topic models, obtain theme dictionary, wherein, topic corpus by with
Family voluntarily selects as needed, and step 1 specifically includes following sub-step:
Step 1.1 is according to Probability p (dm) one document d of selectionm, wherein m ∈ [1, M], M are number of documents;
Step 1.2 utilizes Dirichlet prior distributions generation document dmTheme multinomial distributionIts
In,It is the parameter of Dirichlet prior distributions;
Step 1.3 basisGenerate document dmThe theme k=z of n-th of wordm,n, wherein, zm,nIt represents n-th in m documents
A theme;
Step 1.4 reuses Dirichlet prior distributions generation k=zm,nLexical item multinomial distributionWherein,It is the parameter of Dirichlet prior distributions;
Step 1.5 is from lexical item multinomial distributionGenerate Topic word wm,n, wherein, wm,nIt represents in m documents n-th
Word;
Step 1.6 is for document dmIn NmA word repeats step 1.2 to step 1.5, generates corresponding Topic word
Step 1.7 repeats M-1 step 1.1 to step 1.6, generation theme dictionary w for M documentM,N, wherein, N
For the number of word in M document;
Step 2. utilizes comentropy, and the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus;
Specifically, topic corpus is scanned, and excavates based on left and right comentropy and mutual information statistic and meets certain threshold
The candidate feature lexical item of value condition;
Step 2 specifically includes two sub-steps, step 2.1 and step 2.2:
Step 2.1 utilizes left and right comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme,
Computational methods are shown in formula (1):
Candidate feature lexical item phrase needs to meet:HL (phrase) > a1∩ HR (phrase) > a2∩ I (phrase) >
a3, wherein, a1, a2, a3Respectively HL (phrase), HR (phrase), the threshold value of I (phrase);
It should be noted that the left comentropy HL (phrase) of candidate feature lexical item phrase and right comentropy HR
(phrase) definition procedure is about set to formula (2) and (3):
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively
Character string, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y |
Phrase it is the probability of the right adjacent character string of candidate feature lexical item phrase) to represent character string y;
Meanwhile the calculation formula of candidate feature lexical item phrase mutual informations is about set to formula (4):
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively
Character string, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent character string x and y respectively
The probability individually occurred;It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y is higher,
Namely character string x and y connect into word probability it is higher;
Meanwhile wi∈ phrase represent to form the substring w of phrasei(1≤i≤3).∑ p (phrase) is normalization
Coefficient, 0≤p (phrase)≤1.p(topic|wi) can be obtained according to Bayesian formula, shown in computational methods such as formula (5):
Wherein, p (topic) and p (wi) counted from topic corpus, p (wi| topic) it is obtained from step 1;
Ranking results of the step 2.2 according to the probability value p (phrase) of candidate feature lexical item from high to low, screening meet threshold
The candidate feature lexical item of value condition;Specifically, according to formula (1) threshold parameter different with formula (5) setting, calculate and choose
Preceding K candidate feature lexical item;
For step 3. using candidate feature lexical item as theme dictionary, training obtains topic model;
The training of step 3 specifically includes following sub-step:
Step 3.1 random initializtion, to each word w in each document in topic corpusi, a theme is assigned at random
Number zi;
Step 3.2 scans topic corpus, to each word wi, compiled according to Gibbs Sampling formula resamplings theme
Number, and updated in topic corpus, shown in wherein Gibbs Sampling sampling computational methods such as formula (6):
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject therein i-th
;Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;It represents to lead in document m
Inscribe the number of k, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For theme k in document m
Probability;V represents the number of lexical item in topic corpus, and K is the theme the number of theme in dictionary, and M is number of documents;
Step 3.3 repeat the above steps 3.2 sampling process, until Gibbs Sampling restrain;Wherein, Gibbs
Sampling convergences refer to that the probability value that formula (6) sampling obtains approaches the Joint Distribution of word and theme;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in step 3.4 statistics topic corpus;
Step 4. calculates the degree of subject relativity of document content, and probability distribution value of the document on predetermined theme is taken to be led for it
Inscribe relevance degree;
Specifically, using topic model, the theme distribution probability value of the document is taken to calculate relevance degree for its theme;
Wherein, the document in the topic corpus in the document and step 1 to step 3 in step 4 is different, and the latter is instruction
Practice document, for training topic model;The former is prediction document, for applying topic model;
So far, from step 1 to step 4, a kind of topic model optimization method based on comentropy is completed.
Advantageous effect
A kind of topic model optimization method based on comentropy of the present invention, with existing topic model optimization method phase
Than having the advantages that:
1st, the method for the invention is more suitable for the document content degree of subject relativity calculating depending therefrom of subject key words item;
2nd, the method for the invention can excavate the theme of high granularity, strong feature according to comentropy and mutual information statistic
Feature lexical item, effective aggregation features lexical item, are realized the defects of low granularity lexical item is avoided to inscribe smudgy on the theme, word more
The document content Optimum Classification of dependent theme;
3rd, the method for the invention need to only calculate the comentropy of lexical item and mutual information statistic before training pattern, operation
Simply, method run cost is small.
Description of the drawings
Fig. 1 is the flow diagram in the present invention a kind of topic model optimization method and embodiment 1 based on comentropy;
Fig. 2 is the LDA probabilistic models signal in a kind of topic model optimization method embodiment 2 based on comentropy of the present invention
Figure;
Fig. 3 is to utilize Gibbs in a kind of topic model optimization method embodiment 3 based on comentropy of the present invention
The doc-topic-word path probability schematic diagrames of Sampling formula sampling.
Specific embodiment
The invention will be further described with reference to the accompanying drawings and detailed description.
In order to which technical solution in present application example and advantage is more clearly understood, below in conjunction with attached drawing to the application's
Exemplary embodiment is described in more detail, it is clear that and described embodiment is only the part of the embodiment of the application,
Rather than the exhaustion of all embodiments.It should be noted that in the case where there is no conflict, the example in the application can be tied mutually
It closes.
The present invention defines independent theme and refers to have significant difference between subject key words set and include pass without the superior and the subordinate
Each theme of system, and dependent theme refers to there is more intersection or with the superior and the subordinate's inclusion relation between subject key words set
Each theme.The reason for defining independent theme and the two concepts of dependent theme is belonging to further subdivision document content
Specific category.
Applicant analyzes the existing technical method being modeled based on document structure tree angle to document subject matter, with LDA moulds
Exemplified by type method, if the model directly applies to document content relatedness computation, most equal keywords are shared between theme
Xiang Shi, the model are not accurate enough to the document classification of dependent theme.
A kind of topic model optimization method based on comentropy is provided in present application example, can be counted according to comentropy
Amount weighs word string degree of freedom from outside, and mutual information statistic is determined uniquely to characterize theme from the internal tight ness rating for weighing word string
Feature lexical item, accordingly generate theme dictionary, train topic model using Gibbs Sampling methods, calculate document content
Degree of subject relativity value.The statistics such as use information entropy can excavate high granularity, the theme feature lexical item of strong feature, effectively poly-
The defects of closing feature lexical item, low granularity lexical item avoided to inscribe smudgy on the theme, word more.It should be noted that here
High granularity refers to multiple contaminations, and low granularity refers to single word.
Scheme in present application example can be applied to text mining, sentiment analysis, commending system, digital book, public sentiment
The fields such as monitoring, data acquisition, social network sites and personalized retrieval.
Embodiment 1
The embodiment of the present invention 1 elaborates a kind of topic model optimization method based on comentropy, and attached drawing 1 is reality of the invention
Existing flow chart, specifically includes following steps:
S1. LDA topic models is trained to obtain theme dictionary;
For there is a LDA models of M document and K theme, the generating process of document is about set in specific LDA models:
S1.A is for document dm∈ [1, M], according to Probability p (dm) selection document dm, wherein m ∈ [1, M], M are number of files
Mesh;
S1.B is for document dm∈ [1, M], sampling document dmTheme multinomial distributionWherein,
It is the parameter of Dirichlet prior distributions;
S1.C is for document dm∈ [1, M], sampling document dmThe theme k=z of n-th of wordm,n, wherein, zm,nRepresent m
N-th of theme in document;
S1.D is for theme k ∈ [1, K], the lexical item multinomial distribution of sampling theme kWherein,It is
The parameter of Dirichlet prior distributions;
S1.E is for document dm∈ [1, M], from lexical item multinomial distributionGenerate Topic word wm,n, wherein, wm,nRepresent m
N-th of word in piece document;
S1.F is for document dmIn NmA word repeats step S1.B~S1.E, generates corresponding Topic word wm,N;
S1.G repeats M-1 step S1.A~S1.F, generation theme dictionary w for M documentM,N, wherein, N is M text
The number of word in shelves;
S2. using comentropy, the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus;
S2.A utilizes comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, calculating side
Method is:
Candidate feature lexical item phrase needs to meet:HL (phrase) > a1∩ HR (phrase) > a2∩ I (phrase) >
a3, wherein, a1, a2, a3Respectively HL (phrase), HR (phrase), the threshold value of I (phrase);
The left comentropy HL (phrase) of candidate feature lexical item phrase and the definition procedure of right comentropy HR (phrase)
About it is set to
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively
Character string, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y |
Phrase it is the probability of the right adjacent character string of candidate feature lexical item phrase) to represent character string y;
The calculation formula of candidate feature lexical item phrase mutual informations is about set to:
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively
Character string, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent character string x and y respectively
The probability individually occurred.It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y is higher,
Namely character string x and y connect into word probability it is higher;
Meanwhile wi∈ phrase represent to form the substring w of phrasei(1≤i≤3);∑ p (phrase) is normalization
Coefficient, 0≤p (phrase)≤1;
Meanwhile p (topic | wi) can be obtained according to Bayesian formula, computational methods are:
Wherein, p (topic) and p (wi) counted from topic corpus, p (wi| topic) divide for the joint of word and theme
Cloth is obtained from S1;
Ranking results of the S2.B according to the probability value p (phrase) of feature lexical item from high to low, screening meet threshold condition
Candidate feature lexical item;
Specifically, according to formulaWithDifferent threshold parameters is set, K feature lexical item before calculating and choosing;
S3. using candidate feature lexical item as theme dictionary, training obtains topic model;
Specific training step includes:
S3.A random initializtions, to each word w in every document in topic corpusi, it is random to assign a theme volume
Number zi;
S3.B scans topic corpus, to each word wi, it is numbered according to Gibbs Sampling formula resamplings theme,
And updated in topic corpus, wherein Gibbs Sampling samplings computational methods are:
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject therein i-th
;Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;It represents to lead in document m
Inscribe the number of k, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For theme k in document m
Probability;V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents;
S3.C repeats above-mentioned sampling process, until Gibbs Sampling restrain;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in S3.D statistics topic corpus;
S4. the degree of subject relativity of document content is calculated, it is its theme to take probability distribution value of the document on predetermined theme
Relevance degree.
So far, by S1~S4, a kind of topic model optimization method based on comentropy in the present embodiment is completed.
Embodiment 2
The present embodiment specifically describes the theme dictionary generating process of the LDA topic models described in step 1 of the present invention,
LDA generates model as a kind of document subject matter, contains Feature Words, theme and document three-decker, is three layers of Bayes
Probabilistic model, as shown in Figure 2.As can be seen that the generation of theme dictionary corresponds to two solely respectively from the probabilistic model of attached drawing 2
Vertical Dirichlet-Multinomial conjugated structures.
S2.A generating process 1:α→θ→z.Corresponding to step 1, specific generating process is:Represent life
The corresponding topics numbers of all words into m documents.
From between conjugated structure characteristic and document independently of each other, the generating probability of theme in entire subject matter corpus
For:
Wherein, Represent the number of the lexical item of k-th of topic generation in m documents, K is
The number of theme, M are the number of document.
S2.B generating process 2:β → w, corresponding to step 1, specific generating process is:Table
Show all lexical items that theme number is k in m documents of generation.Due to the theme number of generation lexical item and the process of generation lexical item
It can be exchanged with each other, therefore the theme that the generating process of a document can be considered as to Mr. into all lexical items in document is numbered,
Then different lexical items is regenerated to all identical theme numbers.By independently of each other may be used between conjugated structure characteristic and document
Know, the generating probability of lexical item is such as in entire language material:
Wherein, Represent the number of lexical item t in the lexical item of k-th of theme generation.
S2.C according toWithEach pass can be obtained
The word of keyword item and the Joint Distribution of theme, computational methods are about set to:
So far, the theme dictionary generating process of LDA topic models is completed by step S2.A~S2.C.
Embodiment 3
The present embodiment specifically describes the Gibbs Sampling method of samplings described in step 3 of the present invention, gives
Doc-topic-word path probability schematic diagrames, as shown in Figure 3.It is as follows:
S3.A random initializtions, to each word w in every document in topic corpusi, it is random to assign a theme volume
Number zi;
S3.B scans topic corpus, to each word wi, it is numbered according to Gibbs Sampling formula resamplings theme,
And it is updated in topic corpus, wherein Gibbs Sampling sampling computational methods such as formula:
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject therein i-th
;Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;Represent occur in document m
The number of theme k, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For master in document m
Inscribe the probability of k;V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents;
Formula the right as p (topic | doc) p (word | topic), this probability namely doc → topic → word
Path probability, since theme has K, so the physical significance of Gibbs Sampling formula is exactly to be carried out in this K paths
Sampling, as shown in Figure 3.
S3.C repeats above-mentioned sampling process, until Gibbs Sampling restrain;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in S3.D statistics topic corpus.
Embodiment 4
Based on examples detailed above 1, the present embodiment provides a kind of specifically topic model optimization method based on comentropy, the party
Method is specifically realized based on today's tops topic corpus and is calculated the document content degree of correlation using topic model.
S4.A based on experience value, quasi-definite Dirichlet prior distributions parameter alpha=0.5 and β=0.1, Gibbs
Sampling maximum iterations are 2000 times, and based on comentropy and mutual information statistic, correspondence is calculated in number of topics 70
Theme candidate feature lexical item.
Specifically, candidates' feature lexical item such as " Group Co., Ltd ", " capital market ", " China's economic " and " financial institution "
It can substantially represent finance and economic theme;Candidates' feature lexical items such as " smart mobile phones ", " Internet era " and " artificial intelligence technology "
It can substantially represent scientific and technological class theme;Candidates' feature lexical item such as " USN " and " weaponry " can substantially represent military class master
Topic;Candidates' feature lexical item such as " Cixi empress dowager " and " Chinese history " can substantially represent history class theme.
S4.B re-starts training and obtains topic model using candidate feature lexical item as the theme dictionary of topic model.
Specifically, 3 themes, 9877 lexical items and 121 document re -trainings are chosen, obtain " data " theme
Feature lexical item (highest 29 lexical items of weight selection here):11st, 90, company, double ten, technology, work, platform, ai, be more than,
It is China, internet, artificial intelligence, Alibaba, day cat, product, enterprise, Jingdone district, development, Ali, future, brand, user, complete
Ball, hundred million yuan, 2017,10, number, automation, drop drop;Feature lexical item (highest 29 of weight selection here of " equipment " theme
Lexical item):It is dnf, player, version, injury, occupation, technical ability, Lu Ke, game, trade council, upgrading, attribute, increase, epic, amplification, true
, update, abyss, robot, Korea Spro's clothes, the time, sweep the floor, stone, it is transregional, clean, 90, strength, correcting, state clothes, weapon;" vehicle
The feature lexical item (highest 29 lexical items of weight selection here) of type " theme:Automobile, design, price, engine, are matched somebody with somebody at machine oil
Put, use, standard, Ai Erfa, power, space, buying car, 4s shops, interior trim, sale, market, sensation, BMW, friend, influence, state
Interior, 310w, car owner, automatic, moulding, vehicle, consumer, suv, purchase vehicle.
S4.C application topic models calculate the degree of subject relativity value of document content, take the document general on predetermined theme
Rate Distribution Value is its degree of subject relativity value.
Specifically, using the theme dictionary of above-mentioned " data ", " equipment " and " vehicle " theme, science and technology, automobile and trip are chosen
Play language material calculates the degree of subject relativity value of document content in corresponding language material.Obtain the theme phase of document content in " science and technology " language material
Close angle value (3 document citings before choosing here):doc:0,topic:0(0.520574),topic:1(0.314514),
topic:2(0.164912);doc:1,topic:0(0.738012),topic:1(0.135914),topic:2
(0.126075);doc:2,topic:0(0.813056),topic:2(0.122989),topic:1(0.063955)." automobile "
The degree of subject relativity value (3 document citings before choosing here) of document content in language material:doc:0,topic:2(0.755955),
topic:0(0.143801),topic:1(0.100244);doc:1,topic:2(0.736144),topic:0
(0.144676),topic:1(0.119180);doc:2,topic:2(0.614256),topic:0(0.298078),topic:
1(0.087666).The degree of subject relativity value (3 document citings before choosing here) of document content in " game " language material:doc:0,
topic:0(0.395853),topic:1(0.336999),topic:2(0.267147);doc:1,topic:1
(0.607892),topic:2(0.252507),topic:0(0.139601);doc:2,topic:1(0.420732),topic:
0(0.314079),topic:2(0.265189)。
The foregoing is merely the preferable embodiments of the present invention, are not intended to limit the invention, all spirit in the present invention
Within principle, any modification, equivalent substitution, improvement and etc. are made, should be included within the scope of protection of the invention.
Claims (5)
1. a kind of topic model optimization method based on comentropy, it is characterised in that:Using comentropy and mutual information statistic from
The feature lexical item of theme can uniquely be characterized by being excavated in topic corpus, and selection meets the feature lexical item of threshold condition, training master
Model is inscribed, calculates the degree of subject relativity of document content, distinguishes the subject categories belonging to document;This method is particularly suitable for theme pass
Keyword set has more intersection or each theme with the superior and the subordinate's inclusion relation, theme class that can be belonging to effective district single cent shelves
Not;Include the following steps:
Step 1. training LDA topic models obtain theme dictionary;
Specifically, LDA topic models are trained according to topic corpus, obtains theme dictionary;
Step 2. utilizes comentropy, and the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus;
Specifically, topic corpus is scanned, and excavates based on left and right comentropy and mutual information statistic and meets specific threshold item
The candidate feature lexical item of part;
For step 3. using candidate feature lexical item as theme dictionary, training obtains topic model;
Step 4. calculates the degree of subject relativity of document content, and it is its theme phase to take probability distribution value of the document on predetermined theme
Close angle value;
Specifically, using topic model, the theme distribution probability value of the document is taken to calculate relevance degree for its theme;
So far, from step 1 to step 4, a kind of topic model optimization method based on comentropy is completed.
2. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:In step 1
Topic corpus voluntarily selected as needed by user, step 1 specifically includes following sub-step:
Step 1.1 is according to Probability p (dm) one document d of selectionm, wherein m ∈ [1, M], M are number of documents;
Step 1.2 utilizes Dirichlet prior distributions generation document dmTheme multinomial distributionWherein,It is the parameter of Dirichlet prior distributions;
Step 1.3 basisGenerate document dmThe theme k=z of n-th of wordm,n, wherein, zm,nRepresent n-th of master in m documents
Topic;
Step 1.4 reuses Dirichlet prior distributions generation k=zm,nLexical item multinomial distribution
Wherein,It is the parameter of Dirichlet prior distributions;
Step 1.5 is from lexical item multinomial distributionGenerate Topic word wm,n, wherein, wm,nRepresent n-th of word in m documents;
Step 1.6 is for document dmIn NmA word repeats step 1.2 to step 1.5, generates corresponding Topic word
Step 1.7 repeats M-1 step 1.1 to step 1.6, generation theme dictionary w for M documentM,N, wherein, N M
The number of word in a document.
3. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:Step 2 has
Body includes following sub-step:
Step 2.1 utilizes comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, calculating side
Method is shown in formula (1):
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>&Element;</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
</mrow>
</munder>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mi>o</mi>
<mi>p</mi>
<mi>i</mi>
<mi>c</mi>
<mo>|</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>/</mo>
<mi>&Sigma;</mi>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
Candidate feature lexical item phrase needs to meet:HL (phrase) > a1∩ HR (phrase) > a2∩ I (phrase) > a3,
Middle a1, a2, a3Respectively HL (phrase), HR (phrase), the threshold value of I (phrase);
The left comentropy HL (phrase) of candidate feature lexical item phrase and the definition procedure agreement of right comentropy HR (phrase)
For formula (2) and (3):
<mrow>
<mi>H</mi>
<mi>R</mi>
<mrow>
<mo>(</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munder>
<mo>&Sigma;</mo>
<mi>y</mi>
</munder>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>y</mi>
<mo>|</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mi>log</mi>
<mi> </mi>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>y</mi>
<mo>|</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>H</mi>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munder>
<mo>&Sigma;</mo>
<mi>x</mi>
</munder>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>|</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mi>log</mi>
<mi> </mi>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>|</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjacent character respectively
String, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y | phrase)
It is the probability of the right adjacent character string of candidate feature lexical item phrase to represent character string y;
The calculation formula of candidate feature lexical item phrase mutual informations is about set to formula (4):
<mrow>
<mi>I</mi>
<mrow>
<mo>(</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>y</mi>
<mo>&Element;</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
</mrow>
</munder>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>x</mi>
<mo>&Element;</mo>
<mi>p</mi>
<mi>h</mi>
<mi>r</mi>
<mi>a</mi>
<mi>s</mi>
<mi>e</mi>
</mrow>
</munder>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>,</mo>
<mi>y</mi>
<mo>)</mo>
</mrow>
<mi>l</mi>
<mi>o</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mfrac>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>,</mo>
<mi>y</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>)</mo>
</mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>y</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjacent character respectively
String, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent that character string x and y are independent respectively
The probability of appearance;It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y it is higher namely
The probability that character string x and y connect into word is higher;
Meanwhile wi∈ phrase represent to form the substring w of phrasei(1≤i≤3);∑ p (phrase) is normalization system
Number, 0≤p (phrase)≤1;
Meanwhile p (topic | wi) can be obtained according to Bayesian formula, shown in computational methods such as formula (5):
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mi>o</mi>
<mi>p</mi>
<mi>i</mi>
<mi>c</mi>
<mo>|</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<mi>t</mi>
<mi>o</mi>
<mi>p</mi>
<mi>i</mi>
<mi>c</mi>
<mo>)</mo>
</mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mi>o</mi>
<mi>p</mi>
<mi>i</mi>
<mi>c</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>5</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, p (topic) and p (wi) counted from topic corpus, p (wi| topic) it is obtained from step 1;
Ranking results of the step 2.2 according to the probability value p (phrase) of candidate feature lexical item from high to low, screening meet threshold value
The candidate feature lexical item of condition;Specifically, according to formula (1) threshold parameter different with formula (5) setting, before calculating and choosing
K candidate feature lexical item.
4. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:Step 3
Training specifically includes following sub-step again:
Step 3.1 random initializtion, to each word w in every document in topic corpusi, it is random to assign a theme number
zi;
Step 3.2 scans topic corpus, to each word wi, it is numbered according to Gibbs Sampling formula resamplings theme,
And updated in topic corpus, the wherein sampling of Gibbs Sampling formula is calculated as shown in (6):
<mrow>
<mtable>
<mtr>
<mtd>
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>z</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mi>k</mi>
<mo>|</mo>
<msub>
<mover>
<mi>z</mi>
<mo>&RightArrow;</mo>
</mover>
<mrow>
<mo>&Not;</mo>
<mi>i</mi>
</mrow>
</msub>
<mo>,</mo>
<mover>
<mi>w</mi>
<mo>&RightArrow;</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>&Proportional;</mo>
<msub>
<mover>
<mi>&theta;</mi>
<mo>^</mo>
</mover>
<mrow>
<mi>m</mi>
<mo>,</mo>
<mi>k</mi>
</mrow>
</msub>
<mo>&CenterDot;</mo>
<msub>
<mover>
<mi>&phi;</mi>
<mo>^</mo>
</mover>
<mrow>
<mi>k</mi>
<mo>,</mo>
<mi>t</mi>
</mrow>
</msub>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<mo>=</mo>
<mfrac>
<mrow>
<msubsup>
<mi>n</mi>
<mrow>
<mi>m</mi>
<mo>,</mo>
<mo>&Not;</mo>
<mi>i</mi>
</mrow>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
</msubsup>
<mo>+</mo>
<msub>
<mi>&alpha;</mi>
<mi>k</mi>
</msub>
</mrow>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>K</mi>
</munderover>
<msubsup>
<mi>n</mi>
<mrow>
<mi>m</mi>
<mo>,</mo>
<mo>&Not;</mo>
<mi>i</mi>
</mrow>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
</msubsup>
<mo>+</mo>
<msub>
<mi>&alpha;</mi>
<mi>k</mi>
</msub>
</mrow>
</mfrac>
<mo>&CenterDot;</mo>
<mfrac>
<mrow>
<msubsup>
<mi>n</mi>
<mrow>
<mi>k</mi>
<mo>,</mo>
<mo>&Not;</mo>
<mi>i</mi>
</mrow>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
</msubsup>
<mo>+</mo>
<msub>
<mi>&beta;</mi>
<mi>t</mi>
</msub>
</mrow>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>t</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>V</mi>
</munderover>
<msubsup>
<mi>n</mi>
<mrow>
<mi>k</mi>
<mo>,</mo>
<mo>&Not;</mo>
<mi>i</mi>
</mrow>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
</msubsup>
<mo>+</mo>
<msub>
<mi>&beta;</mi>
<mi>t</mi>
</msub>
</mrow>
</mfrac>
</mrow>
</mtd>
</mtr>
</mtable>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>6</mn>
<mo>)</mo>
</mrow>
</mrow>
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject i-th therein;
Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;Represent occur theme k's in document m
Number, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For in document m theme k it is general
Rate;V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents;
Step 3.3 repeat the above steps 3.2 sampling process, until Gibbs Sampling restrain;
Wherein, Gibbs Sampling convergences refer to that formula (6) samples the Joint Distribution that obtained probability value approaches word and theme;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in step 3.4 statistics topic corpus.
5. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:In step 4
Document and step 1 to step 3 in main body corpus in document it is different, the latter is Training document, for training theme
Model;The former is prediction document, for applying topic model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810029097.7A CN108090231A (en) | 2018-01-12 | 2018-01-12 | A kind of topic model optimization method based on comentropy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810029097.7A CN108090231A (en) | 2018-01-12 | 2018-01-12 | A kind of topic model optimization method based on comentropy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108090231A true CN108090231A (en) | 2018-05-29 |
Family
ID=62183108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810029097.7A Pending CN108090231A (en) | 2018-01-12 | 2018-01-12 | A kind of topic model optimization method based on comentropy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108090231A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271623A (en) * | 2018-08-16 | 2019-01-25 | 龙马智芯(珠海横琴)科技有限公司 | Text emotion denoising method and system |
CN109344252A (en) * | 2018-09-12 | 2019-02-15 | 东北大学 | Microblogging file classification method and system based on high-quality topic expansion |
CN109376347A (en) * | 2018-10-16 | 2019-02-22 | 北京信息科技大学 | A kind of HSK composition generation method based on topic model |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN109919427A (en) * | 2019-01-24 | 2019-06-21 | 平安科技(深圳)有限公司 | Model subject under discussion duplicate removal appraisal procedure, server and computer readable storage medium |
CN110347977A (en) * | 2019-06-28 | 2019-10-18 | 太原理工大学 | A kind of news automated tag method based on LDA model |
CN111507098A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
WO2020199591A1 (en) * | 2019-03-29 | 2020-10-08 | 平安科技(深圳)有限公司 | Text categorization model training method, apparatus, computer device, and storage medium |
CN112100492A (en) * | 2020-09-11 | 2020-12-18 | 河北冀联人力资源服务集团有限公司 | Batch delivery method and system for resumes of different versions |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324700A (en) * | 2013-06-08 | 2013-09-25 | 同济大学 | Noumenon concept attribute learning method based on Web information |
-
2018
- 2018-01-12 CN CN201810029097.7A patent/CN108090231A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324700A (en) * | 2013-06-08 | 2013-09-25 | 同济大学 | Noumenon concept attribute learning method based on Web information |
Non-Patent Citations (4)
Title |
---|
JULY: "通俗理解LDA主题模型", 《HTTPS://BLOG.CSDN.NET/V_JULY_V/ARTICLE/DETAILS/41209515》 * |
LIN YULAN: "Research on Interactive Text Topic Mining Based on LDA Model Taking customer service chat records as an example", 《2017 INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY, ELECTRONICS AND COMMUNICATION (ICCTEC)》 * |
码农场>自然语言处理: "基于互信息和左右信息熵的短语提取识别", 《HTTP://WWW.HANKCS.COM/NLP/EXTRACTION-AND-IDENTIFICATION-OF-MUTUAL-INFORMATION-ABOUT-THE-PHRASE-BASED-ON-INFORMATION-ENTROPY.HTML》 * |
黄勇: "改进的互信息与LDA结合的特征降维方法研究", 《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271623A (en) * | 2018-08-16 | 2019-01-25 | 龙马智芯(珠海横琴)科技有限公司 | Text emotion denoising method and system |
CN109344252A (en) * | 2018-09-12 | 2019-02-15 | 东北大学 | Microblogging file classification method and system based on high-quality topic expansion |
CN109376347A (en) * | 2018-10-16 | 2019-02-22 | 北京信息科技大学 | A kind of HSK composition generation method based on topic model |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN109919427A (en) * | 2019-01-24 | 2019-06-21 | 平安科技(深圳)有限公司 | Model subject under discussion duplicate removal appraisal procedure, server and computer readable storage medium |
WO2020199591A1 (en) * | 2019-03-29 | 2020-10-08 | 平安科技(深圳)有限公司 | Text categorization model training method, apparatus, computer device, and storage medium |
CN110347977A (en) * | 2019-06-28 | 2019-10-18 | 太原理工大学 | A kind of news automated tag method based on LDA model |
CN111507098A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111507098B (en) * | 2020-04-17 | 2023-03-21 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN112100492A (en) * | 2020-09-11 | 2020-12-18 | 河北冀联人力资源服务集团有限公司 | Batch delivery method and system for resumes of different versions |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
CN113032573B (en) * | 2021-04-30 | 2024-01-23 | 同方知网数字出版技术股份有限公司 | Large-scale text classification method and system combining topic semantics and TF-IDF algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108090231A (en) | A kind of topic model optimization method based on comentropy | |
Kim et al. | Deep hybrid recommender systems via exploiting document context and statistics of items | |
Deshpande et al. | Artificial intelligence for big data: Complete guide to automating big data solutions using artificial intelligence techniques | |
CN110765260A (en) | Information recommendation method based on convolutional neural network and joint attention mechanism | |
CN106663124A (en) | Generating and using a knowledge-enhanced model | |
Siddharth et al. | Natural language processing in-and-for design research | |
CN105659225A (en) | Query expansion and query-document matching using path-constrained random walks | |
Zhao et al. | The study on the text classification for financial news based on partial information | |
CN113312480B (en) | Scientific and technological thesis level multi-label classification method and device based on graph volume network | |
Rahman et al. | Predicting sequential design decisions using the function-behavior-structure design process model and recurrent neural networks | |
CN107688870A (en) | A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input | |
Borna et al. | Hierarchical LSTM network for text classification | |
Kalaivani et al. | A review on feature extraction techniques for sentiment classification | |
Pandiaraj et al. | Sentiment analysis on newspaper article reviews: contribution towards improved rider optimization-based hybrid classifier | |
Wulam et al. | A Recommendation System Based on Fusing Boosting Model and DNN Model. | |
Wen et al. | The research and development of completed GM (1, 1) model toolbox using Matlab | |
Delaforge et al. | EBBE-text: Explaining neural networks by exploring text classification decision boundaries | |
Malik et al. | Software requirement specific entity extraction using transformer models. | |
Neha et al. | Deep neural networks predicting student performance | |
CN111859238B (en) | Model-based method, device and computer equipment for predicting data change frequency | |
Wu et al. | A study on natural language processing classified news | |
CN116107619A (en) | Web API recommendation method based on factoring machine | |
Goossens et al. | Extracting decision dependencies and decision logic from text using deep learning techniques | |
Marcondes | Knowledge organization and representation in digital environments: relations between ontology and knowledge organization | |
Handayani et al. | Sentiment Analysis Of Electric Cars Using Recurrent Neural Network Method In Indonesian Tweets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180529 |
|
RJ01 | Rejection of invention patent application after publication |